CN114937446A

CN114937446A - Voice synthesis method, device, equipment and storage medium

Info

Publication number: CN114937446A
Application number: CN202210425994.6A
Authority: CN
Inventors: 张永超; 张征; 虞国桥
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-08-23

Abstract

The method determines each target object contained in the text to be processed according to the characteristic information of the text to be processed, determines the sound effect parameters of each target object, and further determines the voice corresponding to the text to be processed containing each target object based on the sound effect parameters of each target object and a pre-trained voice synthesis model. Therefore, by automatically identifying each target object contained in the text to be processed and automatically determining the sound effect parameters of each target object, even in a large-scale speech synthesis scene, the characteristics of speech speed, volume and the like corresponding to the target object can be automatically controlled, and the aim of emphasizing the target object is fulfilled.

Description

Voice synthesis method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech synthesis.

Background

With the development of the technology of Speech synthesis (Text To Speech, TTS), the intelligent Speech system can automatically convert the Text stored in the computer into human voice output more truly. The voice synthesis technology is widely applied to the fields of human-computer voice interaction systems, voice information publishing systems and the like. However, in general, the intonation of human voice output by voice synthesis is relatively flat, and even in a scene in which a user requires to repeat the voice, the speech rate and the volume of the repeated voice are still the same as those of the synthesized voice output last time.

In order to highlight a text to be emphasized in a restatement scene, in the prior art, a text to be emphasized is usually manually marked by using a Speech Synthesis Markup Language (SSML), and characteristics such as a Speech speed and a volume of a Speech corresponding to the text to be emphasized are controlled by manually configuring an adjustment parameter, so that a human voice including an emphasized tone is output in the restatement scene.

However, in some application fields of speech synthesis technology, the amount of text to be converted into speech is large, the text logic is complex, and it is very labor-consuming to label the text to be emphasized by all the way of SSML manual labeling, and the requirement of large-scale speech synthesis cannot be met.

Disclosure of Invention

The present specification provides a speech synthesis method and apparatus to partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a speech synthesis method including:

acquiring a text to be processed;

responding to a request of a user for acquiring voice of the text to be processed, and determining each target object contained in the text to be processed according to the characteristic information of the text to be processed;

determining sound effect parameters of the target objects according to the target objects;

and determining the voice of the text to be processed containing each target object according to the sound effect parameters of each target object and a pre-trained voice synthesis model.

Optionally, determining each target object included in the text to be processed according to the feature information of the text to be processed specifically includes:

acquiring a plurality of preset target object identification templates;

searching whether a preset target object identification template matched with the text to be processed exists in the preset target object identification templates or not according to the characteristic information of the text to be processed;

if yes, determining each target object contained in the text to be processed according to a preset target object identification template matched with the text to be processed.

performing word segmentation on the text to be processed to obtain each word contained in the text to be processed;

acquiring a feature vector corresponding to each word contained in the text to be processed;

and inputting the feature vectors corresponding to the words contained in the text to be processed into a pre-trained named body recognition model, and obtaining each named body output by the pre-trained named body recognition model as each target object contained in the text to be processed.

Optionally, determining the sound effect parameters of each target object according to each target object specifically includes:

acquiring characteristic information of the user;

determining a first volume parameter of the text to be processed according to the characteristic information of the user;

determining a second volume parameter and a speech rate parameter of each target object according to the characteristic information of each target object;

and determining the sound effect parameters of each target object according to the second volume parameter and the speech speed parameter of each target object and the first volume parameter of the text to be processed.

Optionally, determining the voice of the text to be processed including each target object according to the sound effect parameters of each target object and a pre-trained voice synthesis model, specifically including:

acquiring an original phoneme sequence corresponding to the text to be processed;

determining a target phoneme sequence corresponding to the text to be processed according to the original phoneme sequence corresponding to the text to be processed and sound effect parameters of each target object contained in the text to be processed;

and inputting the target phoneme sequence corresponding to the text to be processed into a pre-trained speech synthesis model to obtain the speech of the text to be processed output by the pre-trained speech synthesis model.

Optionally, determining the speech including the text to be processed of each target object according to the sound effect parameters of each target object and a pre-trained speech synthesis model, specifically including:

inputting the target phoneme sequence corresponding to the text to be processed into a pre-trained speech synthesis model to obtain a first predicted duration and a first predicted volume corresponding to the speech of each target object in the text to be processed output by the pre-trained speech synthesis model and a second predicted duration and a second predicted volume corresponding to the speech of the text to be processed;

and determining the voice corresponding to the text to be processed according to the first predicted time length and the first predicted volume corresponding to the voice of the target object and the second predicted time length and the second predicted volume corresponding to the voice of the text to be processed.

Optionally, after the feature information of the text to be processed and the sound effect parameters of the target objects are input to a pre-trained speech synthesis model to obtain the speech of the text to be processed output by the pre-trained speech synthesis model, the method further includes:

acquiring an appointed frequency band corresponding to the voice of the text to be processed;

determining a gain coefficient of a designated frequency band corresponding to the voice of the text to be processed according to the acquired designated frequency band;

and adjusting the appointed frequency band of the voice of the text to be processed according to the gain coefficient to obtain the target voice of the text to be processed.

The present specification provides a speech synthesis apparatus including:

the text to be processed acquisition module is used for acquiring a text to be processed;

the target object determining module is used for responding to a request of a user for acquiring the voice of the text to be processed and determining each target object contained in the text to be processed according to the characteristic information of the text to be processed;

the sound effect parameter determining module is used for determining the sound effect parameters of the target objects according to the target objects;

and the voice determining module is used for determining the voice of the text to be processed containing each target object according to the sound effect parameters of each target object and a pre-trained voice synthesis model.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described speech synthesis method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned speech synthesis method when executing the program.

The technical scheme adopted by the specification can achieve the following beneficial effects:

according to the method, each target object contained in the text to be processed is determined according to the characteristic information of the text to be processed, the sound effect parameter of each target object is determined, and then the voice corresponding to the text to be processed containing each target object is determined based on the sound effect parameter of each target object and a pre-trained voice synthesis model. Therefore, by means of automatically identifying each target object contained in the text to be processed and automatically determining the sound effect parameters of each target object, even in a large-scale speech synthesis scene, the characteristics of speech speed, volume and the like corresponding to the target object can be automatically controlled, and the purpose of emphasizing the target object is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the principles of the specification and not to limit the specification in a limiting sense. In the drawings:

FIG. 1 is a schematic flow chart of a speech synthesis method in this specification;

FIG. 2 is a schematic flow chart of a speech synthesis method in this specification;

FIG. 3 is a schematic diagram of a speech synthesis apparatus provided herein;

fig. 4 is a schematic diagram of an electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

In addition, it should be noted that all the actions of acquiring signals, information or data in the present invention are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

With the continuous development of electronic information processing technology, voice is widely used in daily life and work as an important carrier for people to obtain information. The voice interaction is an important form of human-computer interaction, the hands and eyes of a user can be liberated to the maximum degree, voice synthesis is used as the output of the whole system, the requirement on human simulation of the voice interaction cannot be only the sound color of the synthesized audio image of a real person, and the context of real person communication should be attached to the expression. For example, in the human-computer interaction, a situation that a user does not hear the voice output by the intelligent voice system for the first time may occur, and at this time, the user only needs to reply "no hearing," and then the intelligent voice system may be triggered to repeatedly output the voice, and this voice interaction scenario is referred to as a repeat scenario in the embodiment of the present specification. In a restatement scenario, the speech repeatedly output by the intelligent speech system is not different from the speech output for the first time in speech speed and volume, which may cause the user to still not hear the speech, thereby reducing the interaction experience of the user.

At present, aiming at the requirement of outputting the voice emphasizing important contents under the restitution scene, a method of performing manual labeling by using SSML is adopted, the method identifies the important contents manually, and adopts an SSML format to input voice characteristics such as the speed and the volume of the voice corresponding to the important contents. For example, in a map service scenario, a voice prompt function may be provided for a user, such as outputting a prompt voice "distance from beijing to tianjin", only two important contents of "beijing" and "tianjin" that need to be emphasized may be labeled in an SSML manual labeling manner to implement the emphasized function, but when other similar prompt voices need to be output, labeling still needs to be repeated, such as "distance from west ann to south jing", and even if the text of the prompt voice is similar to the text structure of the prompt voice, the voice with sound effect change still needs to be output in a manual labeling manner of "west ann" and "south jing".

Therefore, in the process, the identification of the important content and the marking of the voice speed and the volume characteristics of the important content are both based on manpower, which not only consumes manpower, but also is difficult to meet the requirement of large-scale voice synthesis.

Based on the above problems, in the embodiments of the present specification, by a method of automatically identifying each target object in a text to be processed and automatically determining sound effect parameters of the target object, a voice emphasizing the target object is automatically output to a user without human intervention, so as to achieve an effect similar to human emphasizing.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a speech synthesis method in this specification, which specifically includes the following steps:

s100: and acquiring a text to be processed.

In the embodiments of the present specification, the text to be processed refers to text that needs to be synthesized into speech. The execution main body of the voice synthesis method can be an intelligent voice system or intelligent voice terminal equipment with a voice synthesis function, and can also be a server with the voice synthesis function. The present specification specifically describes the embodiments of the present specification, taking as an example only an intelligent speech system in which an execution subject is configured with a speech synthesis function. In the embodiment of the present specification, a user refers to an object for providing an intelligent voice service by an intelligent voice system, and a worker refers to a worker for maintaining the intelligent voice system.

In practical application, the intelligent voice system can acquire different texts to be processed according to a specific voice interaction scene. For example, in the process of performing question-answer interaction between the user and the intelligent voice system, for the question of the user, the intelligent voice system takes the answer corresponding to the question as the text to be processed. In the intelligent voice publishing system, texts needing to be published to users are used as texts to be processed. The text to be processed can be an original text or a text after text analysis, wherein the text analysis comprises text processing such as text regularization, word segmentation, phonetic notation and the like.

S102: and responding to a request of a user for acquiring the voice of the text to be processed, and determining each target object contained in the text to be processed according to the characteristic information of the text to be processed.

Specifically, the feature information of the text to be processed is first obtained, where the feature information of the text to be processed may include a regular expression of the text to be processed after the text analysis, semantic features of the text to be processed, feature vectors of words included in the text to be processed, and relationship features between words included in the text to be processed.

In practical application, according to a specific application scenario, each target object included in the text to be processed may be obtained according to a preset target object determination rule, for example, a target object recognition template preset by a worker, a named body in the text to be processed as a target object, and other rules, and the type and granularity of the target object may also be customized by a user, for example, multiple target object types to be selected are shown for the user in advance, the target object type is selected by the user according to a requirement of the user, and if the user does not select the type of the target object, the target object may be determined according to the preset target object determination rule. The granularity of the target object refers to the ratio of the number of words of the target object to the total number of words of the text to be processed, and may also refer to the degree of refinement of the classification of the target object, for example, when the granularity of the target object is large, the classification of the target object may be a number class and an entity class, and when the granularity of the target object is small, the classification of the target object may be a time class and an organization name class.

S104: and determining sound effect parameters of the target objects according to the target objects.

Since the speech synthesis method provided in the embodiment of the present specification is to output speech similar to that in which a real person emphasizes important content, such as slowing down speech speed, increasing volume, etc., for speech of important content that needs to be emphasized. Therefore, in this step, the sound effect parameters of each target object need to be determined, and the sound effect parameters include characteristic parameters such as a speech speed (duration) parameter and a volume parameter. And adjusting the original voice corresponding to the text to be processed based on the determined sound effect parameters of the target object, wherein the adjusted voice of the text to be processed can present the emphasis effect of each target object contained in the text to be processed, and the aim similar to real person emphasis is fulfilled.

S106: and determining the voice of the text to be processed containing each target object according to the sound effect parameters of each target object and a pre-trained voice synthesis model.

In this step, since the sound effect parameters of the target object include the adjustment of the speed, volume, and the like of the speech corresponding to the target object, the speech corresponding to the text to be processed, which is obtained by combining the pre-trained speech synthesis model, is the speech obtained by adjusting the original speech corresponding to the text to be processed according to the sound effect parameters corresponding to each target object. The adjusted voice of the text to be processed emphasizes and defines the target object, so that the voice heard by the user contains a sound effect similar to the voice emphasis, and is closer to the real voice.

The method comprises the steps of determining each target object contained in the text to be processed, determining the sound effect parameter of each target object based on the characteristic information of each target object, and further inputting the sound effect parameter of each target object into a pre-trained speech synthesis model to obtain the speech corresponding to the text to be processed containing each target object. Therefore, by the method for automatically identifying each target object contained in the text to be processed, even in a large-scale speech synthesis scene, the characteristics such as the speech speed, the volume and the like of the speech corresponding to the target object can be automatically controlled, and the aim of emphasizing the target object is fulfilled.

In this embodiment of the present specification, in the step S100 shown in fig. 1, the text to be processed may also be a text after text analysis, and the text to be processed may specifically perform text analysis through the following three aspects.

In a first aspect: and (5) text regularization. In a Chinese scenario, the text regularization refers to converting non-Chinese characters (punctuation, numbers, etc.) into Chinese characters, that is, normalizing non-normalized text in the text to be processed. This is because when the text to be processed includes the non-standard text, the non-standard text corresponds to different pronunciation modes in different language scenes. For example, when the text to be processed includes "2022 years", the pronunciation mode corresponding to "2022" is "two zero two", and when "2022" appears in the word "2022 g", the pronunciation mode is "two thousand zero twenty two". Therefore, firstly, the text to be processed is subjected to text regularization processing, so that the pronouncing error of the non-standard text can be avoided, and ambiguity can be avoided.

In a second aspect: and (5) word segmentation. Word segmentation refers to dividing a plurality of words belonging to the same word into the same word. Word segmentation is the basis of natural language processing, and the accuracy of word segmentation directly determines the quality of part-of-speech tagging, syntactic analysis, and text analysis. In a Chinese scene, different from English words and sentences, the space spoken words are used for separation, Chinese naturally lacks separators, and words and sentences need to be separated by combining context. The word segmentation algorithm adopted in the embodiments of the present specification may be any one of the word segmentation algorithms at present, and the present specification does not limit this. In addition, the part-of-speech of the word, such as verb, noun, adjective, etc., can be labeled in the word segmentation processing stage in the text analysis. The syntactic analysis can also be performed on the word segmentation result, such as analyzing the syntactic structure (the main predicate object structure) of the sentence and the dependency relationship (parallel and dependent) among the words.

In a third aspect: and (5) performing sound fixation. ZhuYin in text analysis refers to text-to-Phoneme (G2P), which is the smallest unit of speech divided from the point of timbre. The problems of polyphone character, retromorphism, tonal modification and the like can be solved through G2P. For example, a phoneme sequence of "jin 1 tie 1qi4hen3hao 3" can be obtained through G2P "today is very weather good". In the embodiment of the present specification, an original phoneme sequence of a text to be processed may be obtained through phonetic notation in a text analysis process, and the original phoneme sequence of the text to be processed may include features such as vowel phonemes, consonant phonemes, and tones of each word in the text to be processed.

In this embodiment of the present specification, as shown in step S102 in fig. 1, it is determined that, in each target object included in the text to be processed, the target object is according to different application scenarios, and the request for the user to acquire the voice corresponding to the text to be processed may be that the user first requests to acquire the voice corresponding to the text to be processed, or may be that the user does not request to acquire the voice corresponding to the text to be processed for the first time.

For the condition that a user requests to acquire the voice corresponding to the text to be processed for the first time: in some specific scenes, even if the request for acquiring the voice corresponding to the text to be processed by the user is the first time, the step of determining the target object in the text to be processed can be triggered, so that the voice output for the first time can realize the purpose of emphasizing important contents. For example, a voice reminding function can be set in a take-out delivery scene, and a rider (user) can preset that a delivery starting point and a delivery ending point are automatically issued in a voice mode after a delivery is received, so that the starting point and the delivery ending point of an order can be immediately known when the delivery is received. In this scenario, an instruction which is preset by a rider and automatically issues a delivery starting point and a delivery ending point in a voice manner after order receiving is taken as a request for a user to acquire a voice corresponding to the text to be processed, and when the rider is dispatched, the voice issuing system can automatically issue voice prompt information such as 'dispatch, from place a to place B'.

For the condition that a user does not request for obtaining the voice corresponding to the text to be processed for the first time: in the restatement scenario, the user does not hear the voice when the intelligent voice system outputs the voice for the first time, and therefore the intelligent voice system is started to repeatedly output the voice by replying an instruction of "not hearing the voice". In the above process, the voice command that the user replies "not hear" is used as the request for repeatedly acquiring the voice corresponding to the text to be processed, which is input by the user. In order to enable the re-output voice to achieve the purpose of emphasizing important content, the intelligent voice system automatically determines each target object contained in the text to be processed in response to a request of a user for obtaining the voice corresponding to the text to be processed, that is, each target object contained in the text to be processed is important content to be emphasized in the text to be processed.

It can be seen that the applicable scenario of the speech synthesis method provided in the embodiment of the present specification is not limited to the human-computer interaction restitution scenario, and a function of emphasizing important contents may be implemented when speech is output for the first time.

In this embodiment of the present specification, as shown in step S102 in fig. 1, each target object included in the text to be processed is determined according to the feature information of the text to be processed, and as shown in fig. 2, the determination may specifically be performed through the following steps:

s200: and acquiring a plurality of preset target object identification templates.

In this step, a plurality of target object recognition templates set in advance by the staff are acquired. The staff refers to the staff who sets and maintains the target object recognition template applied to the intelligent voice system. The target object recognition template can be set, classified, stored and maintained by workers according to the service types and is used for recognizing the target object in the text to be processed. When a target object in the text to be processed needs to be identified, a plurality of preset target object identification templates can be called, and the target object identification template hit by the text to be processed is inquired. The target object recognition template is typically a normalized text template, for example, in a voice prompt scenario in which a user is prompted with a departure place and a destination, a worker may preset the target object recognition templates "from (C)" to (D) ", where" (C) "is the target object C and" (D) "is the target object D. Wherein, the user refers to an object of the smart voice system providing the smart voice service. When the text to be processed acquired by the intelligent voice system is from Beijing to Tianjin, the text to be processed can be matched with the target object identification template, and then the target object C is acquired as Beijing and the target object D is acquired as Tianjin.

Therefore, the staff can automatically identify the target object for the common high-frequency text in the specific scene only by presetting the target object identification template in the specific scene, so that the automatic identification of a large amount of high-frequency texts can be realized, the staff is not required to fill the specific content of the target object, and the waste of human resources is greatly reduced.

S202: and searching whether a preset target object identification template matched with the text to be processed exists in the preset target object identification templates or not according to the characteristic information of the text to be processed. If so, go to step S204, otherwise go to step S206.

Because the preset target object identification template is also in the form of a normalized regular expression, the target object matched with the text to be processed can be determined to be the other template by utilizing the existing regular expression matching method. Specifically, the feature information of the text to be processed may be obtained first, where the feature information of the text to be processed may include a regular expression of the text to be processed after the text analysis. Then, the service class of the text to be processed may be identified according to the feature information of the text to be processed, the target object identification templates corresponding to the service classes of the plurality of texts to be processed may be obtained according to the service class of the text to be processed, and the target object identification template matching the text to be processed may be searched in the obtained plurality of target object identification templates, for example, the text to be processed is "a dispatch from a place a to a place B", the service class of the text to be processed may be determined to be a delivery service class from the word "a dispatch" by analyzing the text to be processed, and whether the target object identification template matching the text to be processed exists may be searched in all preset target object identification templates of the delivery service class, so as to determine the target object identification template corresponding to the text to be processed.

In addition, because the target object recognition template is configured in advance by the staff according to the business needs, the situation that the text to be processed cannot be completely matched with the target object recognition template may also occur after the text is analyzed, based on this, each word in the text to be processed can be sequentially matched with the preset target object recognition template according to the sequence of the words in the text to be processed, the matching degree of each word in the text to be processed and the target object recognition template is calculated, if the number of the words with the matching degree higher than the threshold value of the matching degree is large, which indicates that the matching degree of the text to be processed and the target object recognition template is high, the target object recognition template is used as the target object recognition template matched with the text to be processed, and the text to be processed is recognized according to the target object recognition template, so as to obtain the target object in the text to be processed.

S204: and determining each target object contained in the text to be processed according to a preset target object identification template matched with the text to be processed.

If the target object identification template matched with the text to be processed exists, determining each target object of the text to be processed according to the position of the target object in the target object identification template corresponding to the searched text to be processed.

S206: and acquiring a feature vector corresponding to each word contained in the text to be processed.

S208: and inputting the feature vectors corresponding to the words contained in the text to be processed into a pre-trained named body recognition model, and obtaining each named body output by the pre-trained named body recognition model as each target object contained in the text to be processed.

If the target object identification template matched with the text to be processed does not exist, the named body in the text to be processed can be identified as the target object in the text to be processed by utilizing the named body identification technology. Named Entity Recognition (NER) is an important step in the natural language processing process, and is widely applied to tasks such as information extraction, information retrieval, information recommendation, machine translation and the like. The term "named body" refers to a proper noun having a specific meaning in a natural language. The type of the named entity may be an entity type, such as a person name, a place name, an organization name, etc., a time type, such as a date, etc., or a number type, such as a currency, a percentage, a number, etc.

Optionally, the above-mentioned manner of recognizing the named entity in the text to be processed as the target object in the text to be processed by using the named entity recognition technology may also be executed without assuming that there is no target object recognition template matching the text to be processed, that is, the named entity recognition may be directly performed on the text to be processed, and each obtained named entity is used as the target object of the text to be processed, so as to achieve the purpose of saving manpower.

Therefore, the speech synthesis method provided by the embodiment of the specification can realize the purpose of automatically identifying the target object in the text to be processed, the mode of the target object identification template preset by the staff provided by the method not only completely achieves the function of using the SSML manual labeling scheme, but also does not need the staff to complete the work of filling the specific text of the target object, only needs to maintain that the target object in the specific application scene is the identification template, greatly simplifies the workload of the staff and reduces the use threshold of using the function of emphasizing the target object. In addition, the method also provides an intelligent identification mode by using a named body identification technology, so that each target object contained in the text to be processed can be directly and automatically identified, the named body in the text to be processed can be automatically identified on the premise that the text to be processed is not matched with the preset target object identification template, and the effect of emphasizing the named body is achieved.

In an optional embodiment of this specification, in the specific process of determining each target object included in the text to be processed as shown in step S102 in fig. 1, it may also be determined whether each word is a target object by determining the importance of each word included in the text to be processed, and specifically, the following steps are implemented:

firstly, acquiring a feature vector corresponding to each word contained in the text to be processed.

In this step, each word included in the text to be processed may be determined by the text to be processed after the text analysis as shown in S100 shown in fig. 1, where each word in the text to be processed may be determined by word segmentation in the text analysis process. The feature vector corresponding to each word contained in the text to be processed may include semantic features, syntactic features, and relationship features of adjacent words to the word.

Then, for each word contained in the text to be processed, determining the weight of the word according to the feature vector corresponding to the word and the feature vectors corresponding to the adjacent words of the word.

Further, the weight of the word is determined according to the feature vector corresponding to the word and the feature vectors corresponding to the adjacent words of the word. The weight of the word can represent the importance of the word in the text to be processed, and if the weight of the word is not lower than a preset weight threshold, it is indicated that the word needs to be emphasized in the voice of the text to be processed, so that each word with the weight not lower than the preset weight threshold can be used as a target object of the text to be processed, so that the sound effect parameter of the word can be determined in the subsequent steps, and the word can be emphasized in the voice of the text to be processed in terms of volume and speed.

And finally, according to the weight of each word contained in the text to be processed, taking the word with the weight not lower than a preset weight threshold value as a target object.

Optionally, in some application scenarios, a situation that the weights of all words contained in the text to be processed are not lower than a preset weight threshold may occur, or a situation that the weights of all words contained in the text to be processed are lower than a preset weight threshold may occur, for the above situation, in order to enhance the intelligibility of the user listening to the speech of the text to be processed, all words in the text to be processed may be taken as target objects of the text to be processed, that is, all texts of the text to be processed are taken as important contents, so that sound effect parameters of all words in the text to be processed are determined in subsequent steps, and then the whole text to be processed is emphasized in the speech of the text to be processed in terms of volume and speech speed.

For example, in a scenario where a user interacts with an intelligent voice customer service, for a question asked by the user, the intelligent voice customer service may determine a text to be processed that can answer the question asked by the user, and by analyzing the text to be processed semantically, it may happen that the specialties of words included in the text to be processed are strong, and outputting a voice corresponding to the text to be processed according to a default volume and a default speech speed may make the user unable to clearly understand the semantics of the text to be processed, obviously, the importance of the words in the text to be processed is high, so that the intelligent voice customer service cannot automatically distinguish which words are more important for the user, for this scenario, all words in the text to be processed may be used as target objects of the text to be processed, all words included in the text to be processed are amplified in volume, the speech speed is slowed down, and the reaction time of the user for the voice of the text to be processed is increased, and then the user can listen to the required voice information according to the self requirement, the user is prevented from repeatedly listening to the voice of the text to be processed, and the interactive experience is improved.

It should be noted that the above-mentioned manner of determining the target object by the weight of each word in the text to be processed may be used simultaneously with the above-mentioned manner of determining the target object in the text to be processed as shown in fig. 2, or may be used separately, and this is not limited in this specification.

In the embodiment of the present specification, as shown in step S104 in fig. 1, the sound effect parameter of each target object is determined according to the feature information of each target object, specifically, the sound effect parameter is determined through the following steps:

firstly, determining a first volume parameter of the text to be processed according to the acquired characteristic information of the user;

in practical application, when the intelligent voice system performs voice interaction with a user, if the environment where the user is located is noisy, the user may not hear the voice with the default volume output by the intelligent voice system clearly, at this time, the first volume parameter of the text to be processed is determined on the basis of the default volume by acquiring the characteristic information of the user, and the volume of the voice of the text to be processed is amplified. The feature information of the user may be a signal-to-noise ratio of an environment where the user is located, which is acquired while responding to a request of the user to acquire the speech of the text to be processed. In the embodiments of the present specification, the signal-to-noise ratio of the environment where the user is located refers to a ratio of a normal sound signal intensity of the environment where the user is located to an environmental noise signal intensity determined according to a request of the user and noise of the environment where the user is located. The first volume parameter of the text to be processed is determined according to the signal-to-noise ratio of the environment where the user is located, and then the voice of the text to be processed can be amplified in the subsequent steps on the basis of the default volume of the voice output by the intelligent voice system, so that the situation that the user cannot hear the voice of the text to be processed output by the intelligent voice system due to the fact that the environment where the user is located is too noisy is avoided.

And secondly, determining a second volume parameter and a speech speed parameter of each target object according to each target object.

In practical application, because the speech of the text to be processed finally output by the intelligent speech system needs to realize the highlighting and emphasizing of each target object, after each target object is determined, the volume amplification and the speech speed slowing of the speech corresponding to each target object are determined. Specifically, the second volume parameter and the speech rate parameter of each target object are relative values, that is, the speech volume of the target object can be amplified based on the original speech volume of the text to be processed according to the second volume parameter of the target object, and the speech rate of the target object can be slowed based on the original speech rate of the text to be processed according to the speech rate parameter of the target object. In addition, the second volume parameter and the speech rate parameter of the target object may be fixed parameters, or may be adjustable parameters determined according to a specific application scenario, so as to dynamically adjust a volume amplification effect and a speech rate slowing effect of the speech of the target object in the text to be processed.

And finally, determining the sound effect parameters of each target object according to the second volume parameter and the speech speed parameter of each target object and the first volume parameter of the text to be processed.

Therefore, the voice of the text to be processed needs to amplify the volume when the user environment is noisy, and also needs to amplify the volume and slow down the speech speed of the target object in the text to be processed. It should be noted that the first volume parameter of the text to be processed, the second volume parameter of the target object, and the sound effect parameter of the target object are all relative values, that is, when the volumes and the speeds of the voice of the text to be processed and the voice of the target object included in the text to be processed are adjusted according to the first volume parameter of the text to be processed, the second volume parameter of the target object, and the sound effect parameter of the target object, the volumes and the speeds of the voice of the text to be processed are all adjusted on the basis of the volumes and the speeds of the original voices.

For example, when the environment where the user is located is quiet, the original volume is 50%, and the original speech rate is 1. At the moment, the user can clearly hear the voice output by the intelligent voice system with moderate volume. When the user the environment is when comparatively noisy, according to the user the SNR of environment determines that the first volume parameter of pending text is 10%, promptly, when the environment that the user is located is comparatively noisy, improves the volume of 10% to original pronunciation volume, promptly, the output reinforcing volume is 60%, and the speech rate still is 1 to ensure that the user can hear the pronunciation of intelligent speech system output clearly in noisy environment. On the basis, when the text to be processed contains the target object, the second volume parameter of the target object is determined to be 5%, and the speech speed parameter is determined to be 0.2. At this time, the voice volume of the text to be processed may be adjusted such that the voice volume of the target object is 65%, the speech rate of the target object is 0.8, and the voice volume of the text other than the target object in the text to be processed is 60%, and the speech rate is 1.

In this embodiment of the present specification, as shown in step S106 in fig. 1, determining the speech including the text to be processed of each target object according to the sound effect parameters of each target object and the pre-trained speech synthesis model, specifically by the following steps:

firstly, acquiring an original phoneme sequence corresponding to the text to be processed, and determining a target phoneme sequence corresponding to the text to be processed according to the original phoneme sequence corresponding to the text to be processed and sound effect parameters of each target object contained in the text to be processed.

In practical application, the original phoneme sequence of the text to be processed can be obtained through the phonetic notation step in the text analysis process. The phonetic notation step is already explained in the text analysis process shown in step S100 of fig. 1, and is not described herein again.

The original phoneme sequence of the text to be processed is obtained by arranging phonemes corresponding to the words according to the arrangement sequence of each word included in the text to be processed, and since the sound effect parameters of the target object include a speech speed parameter, that is, the pronunciation duration of the speech of the target object is different from the pronunciation duration of the speech of the text other than the target object in the text to be processed, in order to fuse the sound effect parameters of each target object included in the text to be processed with the phonemes of each target object, phoneme alignment is required. The phoneme alignment mode adopted in the embodiments of the present description may be any existing phoneme alignment mode, which is not limited in the present description.

Secondly, inputting the target phoneme sequence corresponding to the text to be processed into a pre-trained speech synthesis model, and obtaining a first predicted duration and a first predicted volume corresponding to the speech of each target object in the text to be processed output by the pre-trained speech synthesis model and a second predicted duration and a second predicted volume corresponding to the speech of the text to be processed.

Further, the target phoneme sequence corresponding to the text to be processed includes a speech rate and a volume characteristic of the target object, and a default speech rate characteristic and a default volume characteristic corresponding to the text to be processed. Based on the above, the second predicted duration of the speech of the text to be processed output by the pre-trained speech synthesis model corresponds to the default speech speed, the second predicted volume of the speech of the text to be processed is the default volume, the first predicted duration of the speech of the target object is determined according to the speech speed parameter of the target object on the basis of the default speech speed, and the first predicted volume of the speech of the target object is determined according to the volume parameter of the target object on the basis of the default volume. In general, the speech rate of the speech corresponding to the target object is slower than the default speech rate of the speech corresponding to the text to be processed, and the volume of the speech corresponding to the target object is greater than the default volume of the speech corresponding to the text to be processed.

And then, determining the voice corresponding to the text to be processed according to the first predicted duration and the first predicted volume corresponding to the voice of the target object and the second predicted duration and the second predicted volume corresponding to the voice of the specified text.

And fusing the speech speed characteristic and the volume characteristic of the target object with the default speech speed characteristic and the default volume characteristic corresponding to the text to be processed to obtain the speech corresponding to the text to be processed, so as to realize the emphasis and the highlighting of the target object in the text to be processed.

In an optional embodiment of this specification, after obtaining the speech of the text to be processed output by the pre-trained speech synthesis model in step S106 in fig. 1, the specified frequency band of the speech of the text to be processed may be further adjusted according to a gain coefficient of the specified frequency band, which is specifically implemented in the following manner.

Firstly, acquiring a designated frequency band corresponding to the voice of the text to be processed. The designated frequency band comprises a voice frequency band with high sensitivity of human ears to sound, which is determined according to the auditory characteristics of human ears. According to the gain coefficient corresponding to the designated frequency band, the voice of the text to be processed falling into the designated frequency band is subjected to tone adjustment by using an equalizer, so that the definition and intelligibility of the voice on the designated frequency band can be obviously enhanced. The reason why the tone adjustment is performed on the voice after the voice of the text to be processed is determined is that the compensation for the voice of the specified frequency band is completed in the post-processing stage, so that the synthetic tone quality of the voice can be improved, and the influence on the tone quality of the voice synthesis when the voice is synthesized by using the pre-trained voice synthesis model is avoided.

Then, according to the acquired specified frequency band, determining a gain coefficient of the specified frequency band corresponding to the voice of the text to be processed;

specifically, the gain coefficient of the designated frequency band is used for enhancing the strength of the voice signal of the designated frequency band. Since the frequency in the sound wave determines the level of the sound tone. The actual speech is not a wave of a single frequency, but a superposition of waves of various frequencies, thereby forming distinctive speech. The timbre of sound differs in that the sound signals of different frequencies have different intensities. And the equalizer is implemented according to this principle. In the embodiment of the present specification, the specified frequency band determined according to the auditory characteristics of the human ear is a frequency band from 2kHz to 4kHz, and the speech of the text to be processed in the frequency band is adjusted, so that the definition and intelligibility of the speech in the specified frequency band can be enhanced.

Alternatively, it is also possible to obtain a user-specified timbre effect, such as a male voice, a female voice, and the like, while responding to a user request for obtaining speech of a text to be processed, and determine a gain coefficient for adjusting the speech of the specified frequency band by using the equalizer according to characteristics of each frequency band corresponding to the user-specified timbre effect.

And finally, adjusting the appointed frequency band of the voice of the text to be processed according to the gain coefficient to obtain the target voice of the text to be processed.

After the voice of the text to be processed is determined, the voice of the text to be processed is subjected to sound effect post-processing by using an equalizer, and on the basis of ensuring the tone quality of the synthesized voice output by the voice synthesis model, the appointed frequency band of the voice of the text to be processed is adjusted, so that the voice heard by a user is closer to smooth human voice.

Based on the same idea, the speech synthesis method provided above for one or more embodiments of the present specification further provides a corresponding speech synthesis apparatus, as shown in fig. 3.

Fig. 3 is a schematic diagram of a speech synthesis apparatus provided in this specification, specifically including:

a to-be-processed text acquisition module 300, configured to acquire a to-be-processed text;

a target object determining module 302, configured to determine, in response to a request from a user to obtain a voice of the text to be processed, each target object included in the text to be processed;

a sound effect parameter determining module 304, configured to determine, according to the target objects, sound effect parameters of the target objects;

a voice determining module 306, configured to determine, according to the sound effect parameters of each target object and a pre-trained voice synthesis model, a voice including the text to be processed of each target object.

Optionally, the target object determining module 302 is specifically configured to obtain a plurality of preset target object identification templates; searching whether a preset target object identification template matched with the text to be processed exists in the plurality of preset target object identification templates or not according to the characteristic information of the text to be processed; if yes, determining each target object contained in the text to be processed according to a preset target object identification template matched with the text to be processed.

Optionally, the target object determining module 302 is specifically configured to perform word segmentation on the text to be processed to obtain words included in the text to be processed; acquiring a feature vector corresponding to each word contained in the text to be processed; and inputting the feature vectors corresponding to the words contained in the text to be processed into a pre-trained named body recognition model, and obtaining each named body output by the pre-trained named body recognition model as each target object contained in the text to be processed.

Optionally, the sound effect parameter determining module 304 is specifically configured to obtain feature information of the user; determining a first volume parameter of the text to be processed according to the characteristic information of the user; determining a second volume parameter and a speech speed parameter of each target object according to the characteristic information of each target object; and determining the sound effect parameters of each target object according to the second volume parameter and the speech speed parameter of each target object and the first volume parameter of the text to be processed.

Optionally, the speech determining module 306 is specifically configured to obtain an original phoneme sequence corresponding to the text to be processed; determining a target phoneme sequence corresponding to the text to be processed according to the original phoneme sequence corresponding to the text to be processed and sound effect parameters of each target object contained in the text to be processed; and inputting the target phoneme sequence corresponding to the text to be processed into a pre-trained speech synthesis model to obtain the speech of the text to be processed output by the pre-trained speech synthesis model.

Optionally, the speech determining module 306 is specifically configured to input the target phoneme sequence corresponding to the text to be processed into a pre-trained speech synthesis model, so as to obtain a first predicted duration and a first predicted volume corresponding to the speech of each target object in the text to be processed output by the pre-trained speech synthesis model, and a second predicted duration and a second predicted volume corresponding to the speech of the text to be processed; and determining the voice corresponding to the text to be processed according to the first predicted duration and the first predicted volume corresponding to the voice of the target object and the second predicted duration and the second predicted volume corresponding to the voice of the text to be processed.

Optionally, the voice determining module 306 is further configured to, after the voice determining module 306 inputs the feature information of the text to be processed and the sound effect parameters of the target objects into a pre-trained voice synthesis model to obtain the voice of the text to be processed output by the pre-trained voice synthesis model, obtain an assigned frequency band corresponding to the voice of the text to be processed; determining a gain coefficient of a designated frequency band corresponding to the voice of the text to be processed according to the acquired designated frequency band; and adjusting the appointed frequency band of the voice of the text to be processed according to the gain coefficient to obtain the target voice of the text to be processed.

The present specification also provides a computer-readable storage medium storing a computer program operable to execute the speech synthesis method provided in fig. 1 above.

This specification also provides a schematic block diagram of the electronic device shown in fig. 4. As shown in fig. 4, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and may also include hardware required for other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement the speech synthesis method described in fig. 1 above. Of course, besides the software implementation, this specification does not exclude other implementations, such as logic devices or combination of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain a corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to the software compiler used in program development, but the original code before compiling is also written in a specific Programming Language, which is called Hardware Description Language (HDL), and the HDL is not only one kind but many kinds, such as abel (advanced boot Expression Language), ahdl (alternate Language Description Language), communication, CUPL (computer universal Programming Language), HDCal (Java Hardware Description Language), langa, Lola, mylar, HDL, PALASM, rhydl (runtime Description Language), vhjhdul (Hardware Description Language), and vhygl-Language, which are currently used commonly. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be conceived to be both a software module implementing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, respectively. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the system embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of speech synthesis, comprising:

acquiring a text to be processed;

2. The method according to claim 1, wherein determining each target object included in the text to be processed according to the feature information of the text to be processed specifically comprises:

acquiring a plurality of preset target object identification templates;

searching whether a preset target object identification template matched with the text to be processed exists in the plurality of preset target object identification templates or not according to the characteristic information of the text to be processed;

3. The method according to claim 1, wherein determining each target object included in the text to be processed according to the feature information of the text to be processed specifically includes:

and inputting the feature vectors corresponding to the words contained in the text to be processed into a pre-trained named body recognition model, and obtaining the named bodies contained in the text to be processed and output by the pre-trained named body recognition model as target objects contained in the text to be processed.

4. The method of claim 1, wherein determining the sound-effect parameters of each target object according to the target object specifically comprises:

acquiring characteristic information of the user;

determining a second volume parameter and a speech rate parameter of each target object according to each target object;

5. The method according to claim 1, wherein determining the speech including the text to be processed of each target object according to the sound effect parameters of each target object and a pre-trained speech synthesis model specifically comprises:

determining a target phoneme sequence corresponding to the text to be processed according to the original phoneme sequence corresponding to the text to be processed and the sound effect parameters of each target object contained in the text to be processed;

6. The method according to claim 5, wherein inputting the target phoneme sequence corresponding to the text to be processed into a pre-trained speech synthesis model to obtain the speech of the text to be processed output by the pre-trained speech synthesis model specifically comprises:

7. The method of claim 1, wherein after inputting the sound-effect parameters of each target object into the pre-trained speech synthesis model to obtain the speech output by the pre-trained speech synthesis model and containing the text to be processed of each target object, the method further comprises:

determining a gain coefficient of a designated frequency band corresponding to the voice of the text to be processed according to the designated frequency band;

8. A speech synthesis apparatus, comprising:

9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the program.