CN114664283A

CN114664283A - Text processing method in speech synthesis and electronic equipment

Info

Publication number: CN114664283A
Application number: CN202210193309.1A
Authority: CN
Inventors: 包鑫彤
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-24

Abstract

The embodiment of the application discloses a text processing method in speech synthesis and electronic equipment, wherein the method comprises the following steps: determining text content needing voice synthesis; determining a target position in the text content where the sound expressive force needs to be enhanced and a target sound material for sound expressive force enhancement at the target position; and inserting the target sound material into the target position for playing in the process of converting the text content into a voice synthesis result. By the embodiment of the application, the sound expressive force of the whole voice synthesis result can be enhanced.

Description

Text processing method in speech synthesis and electronic equipment

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a text processing method and an electronic device in speech synthesis.

Background

In many application systems, applications exist for human-machine interaction with users through artificial intelligence (which may have avatars, including virtual characters, virtual animals, etc.). For example, in a live application of a commodity information service system, a commodity is explained by a "virtual host" in voice, and the like.

In the process of voice broadcasting through artificial intelligence, Text content (which may be referred To as "script") is generally prepared in advance, and then converted into a voice signal through a TTS (Text To Speech) technology for broadcasting. Among them, if the conversion from text to speech is directly performed by a simple TTS technique, there are problems of poor expressiveness, including flat broadcast sound, lack of emotional expression, unnatural sentence breaks, and the like.

In order to improve the expressive force of artificial intelligence in the voice broadcasting process, some TTS optimization algorithms are provided in the prior art. These algorithms can be used for prosody prediction to make sentence breaks more natural in the voice broadcast process; or modeling can be performed according to some linguistic features, so that the simulation degree of the voice production of the real person is improved, and the like.

Although the existing TTS optimization algorithm can improve the sound expressive force of artificial intelligence in the voice broadcasting process to a certain extent, a promotion space still exists.

Disclosure of Invention

The application provides a text processing method in speech synthesis and electronic equipment, which can enhance the sound expressive force of the whole speech synthesis result.

The present application provides the following:

a method of text processing in speech synthesis, comprising:

determining text content needing voice synthesis;

determining a target location in the text content where enhanced sound expression is required and a target sound material for sound expression enhancement at the target location;

and inserting the target sound material into the target position for playing in the process of converting the text content into a voice synthesis result.

The target sound materials are obtained by recording the process of reading target words, phrases or sentences in the target scene by the real person.

Wherein, still include:

reading a configuration file, wherein the configuration file comprises: matching rules and insertion position rule information corresponding to the plurality of sound materials respectively;

the determining a target location in the text content for which enhanced sound expression is desired and a target sound material for sound expression enhancement at the target location includes:

dividing the text content into a plurality of text segments;

and judging whether the text fragment accords with a matching rule corresponding to a certain sound material, if so, determining the target position according to the insertion position rule information corresponding to the sound material, and determining the sound material as the target sound material.

Wherein the matching rule comprises a keyword and/or a regular expression;

the judging whether the text segment meets the matching rule corresponding to a certain sound material includes:

and if the text segment comprises keywords corresponding to a certain sound material and/or accords with a corresponding regular expression, determining that the text segment accords with a matching rule corresponding to the sound material.

Wherein, the configuration file further comprises: loudness and/or pause duration information corresponding to the plurality of sound materials, respectively.

Wherein, the configuration file further comprises: and the rule effective probability information corresponding to the plurality of sound materials is used for controlling the insertion frequency of the sound materials.

Wherein, after determining a target location in the text content for which an enhancement of sound expression is required and a target sound material for sound expression enhancement at the target location, the method further comprises:

adding an expression label at the target position of the text content, wherein the information carried by the expression label at least comprises: identification information of the target sound material;

the inserting the target sound material at the target location comprises:

and in the process of converting the text content into a voice synthesis result, loading the target sound material according to the expression label, and replacing the target sound material at the target position for playing.

The target voice material is stored in a server;

the loading the target sound material according to the expression label and replacing the target sound material to the target position for playing comprises the following steps:

and loading the target sound material from the server according to the expression label, and replacing the target sound material to the target position for playing.

Wherein the text content comprises: and generating an explanation text content of the commodity which carries out voice explanation through the virtual image according to the requirement.

A method for commodity explanation through an avatar, comprising:

determining a commodity needing voice explanation through a virtual image, and generating explanation text contents for the commodity;

determining a target position in the explanation text content, wherein the sound expressive force needs to be enhanced, and a target sound material used for enhancing the sound expressive force at the target position;

adding an expression label at the target position in the explanation text content, wherein the information carried by the expression label at least comprises: identification information of the target sound material;

and in the process of converting the explanation text content into a voice synthesis result, loading the target sound material according to the expression label, and replacing the target sound material to the target position for playing.

Wherein, the information that the expression label carried still includes: and the loudness and/or pause duration information corresponding to the target sound material respectively is used for controlling the loudness and/or pause duration when the target sound material is played.

A method of live broadcasting through an avatar, comprising:

determining text content needing live broadcasting through the virtual image;

determining a target location in the text content for which enhanced sound expression is desired, and a target sound material for sound expression enhancement at the target location;

adding an expression label at the target position in the text content, wherein the information carried by the expression label at least comprises: identification information of the target sound material;

A text processing apparatus in speech synthesis, comprising:

the text content determining unit is used for determining the text content needing voice synthesis;

a position and material determining unit for determining a target position in the text content where the sound expression needs to be enhanced and a target sound material for sound expression enhancement at the target position;

and the inserting unit is used for inserting the target sound material into the target position for playing in the process of converting the text content into a voice synthesis result.

An apparatus for commodity explanation through an avatar, comprising:

the system comprises a text content generating unit, a voice analyzing unit and a voice analyzing unit, wherein the text content generating unit is used for determining a commodity needing voice explanation through a virtual image and generating explanation text content for the commodity;

a position and material determination unit for determining a target position where the sound expression is to be enhanced in the lecture text content, and a target sound material for sound expression enhancement at the target position;

an emotion tag adding unit, configured to add an emotion tag at the target location in the explanation text content, where information carried by the emotion tag at least includes: identification information of the target sound material;

and the replacing unit is used for loading the target sound material from a server according to the expression label and replacing the target sound material to the target position for playing in the process of converting the explanation text content into a voice synthesis result.

An apparatus for live broadcasting through an avatar, comprising:

the text content determining unit is used for determining text content needing live broadcast through the virtual image;

a position and material determination unit for determining a target position in the text content at which the sound expression is to be enhanced, and a target sound material for sound expression enhancement at the target position;

an expression label adding unit, configured to add an expression label at the target position in the text content, where information carried by the expression label at least includes: identification information of the target sound material;

and the replacing unit is used for loading the target sound material according to the expression label and replacing the target sound material to the target position for playing in the process of converting the text content into a voice synthesis result.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the preceding claims.

An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding claims.

According to the specific embodiments provided herein, the present application discloses the following technical effects:

according to the embodiment of the application, some sound materials related to enhancing the sound expressive force can be obtained in advance, and for the text content needing to be subjected to speech synthesis, before the speech synthesis is carried out, the target position needing to be subjected to sound expressive force enhancement in the text content and the target sound material capable of being subjected to sound expressive force enhancement at the position can be judged. In this way, such target sound material can be inserted at a specific target position at the time of speech synthesis to achieve enhancement of sound expression of the overall speech synthesis result.

The voice broadcasting method based on the voice material has the advantages that the content of the voice material can be compared with certain content to determine the content related to the voice, and real people can read and record the voice, so that the voice broadcasting method based on the voice material has stronger and more natural expressive force, and voice broadcasting is more emotional, fluctuating and respiratory, and is more vivid.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a flow chart of a first method provided by an embodiment of the present application;

FIG. 3 is a diagram of emoticon addition results provided in an embodiment of the present application;

FIG. 4 is a flow chart of a second method provided by embodiments of the present application;

FIG. 5 is a flow chart of a third method provided by embodiments of the present application;

FIG. 6 is a schematic diagram of a first apparatus provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a second apparatus provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a third apparatus provided by an embodiment of the present application;

fig. 9 is a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In the embodiment of the present application, in order to enhance the sound expression in the speech synthesis process, some sound materials (for example, sound materials may be obtained by human recording in advance) may be prepared in advance, and when some text content needs to be subjected to speech synthesis, a target position where the sound expression needs to be enhanced and a target sound material that can act to enhance the sound expression at the target position may be determined in such text content. In this way, in the process of converting text content into a speech synthesis result, the target sound material can be inserted at such a target position for playing, so that the sound expression in the speech synthesis process is enhanced by the inserted target sound material.

That is to say, the way of enhancing the sound expressive force provided by the embodiment of the present application is not limited to performing optimization processing on the parameters (including prosody, mood, etc.) of the speech synthesis result itself converted from the text content, but helps to enhance the sound expressive force of the speech synthesis result by inserting some sound materials prepared in advance into the speech synthesis result. Through the method, natural expressions of sounds which are difficult to restore directly through a TTS (text to speech) optimization mode can be obtained, and voice broadcast is emotional, has more 'fluctuation feeling' and 'breathing feeling', and is vivid and lively. Since the above-mentioned sound material is used for being inserted into the speech synthesis result to play a role of enhancing expression of voice, and the role is similar to that of "emoticon" inserted into the text of chat conversation, such a sound material can also be referred to as "sound emoticon" vividly, and a plurality of sound materials having the same characteristics can be combined together to become "sound emoticon", and so on.

In specific implementation, in order to achieve the above purpose, some words and phrases that can be used to enhance the sound expression may be collected in advance on the granularity of words, phrases or phrases, for example, "ha, h", "baking pan", "wa", "come from", etc. Then, the real characters can respectively read the short texts, and readers can know information such as scenes corresponding to the specific short texts, so that the specific read tone and the like are more suitable for playing a role in enhancing sound expression under the corresponding scenes. The recording method has the advantages that recording can be carried out while reading, so that a recording result corresponding to a short text read by a reader can correspond to a voice material, namely a voice expression, and a plurality of different voice expressions corresponding to the same reader can also form a voice expression package. Of course, a plurality of different "sound expressions" corresponding to the same reader may also constitute a plurality of different "sound expression packages", and so on. In addition, the same short text can be read by different readers, so that more different 'sound expressions' are generated, and the like.

Wherein, the reader may correspond to the real character simulated in the speech synthesis process. That is, in the speech synthesis process, in order to make the speech uttered by artificial intelligence closer to the utterance of a real person, some speech packets of real characters may be used for simulation. For example, if three voice packets of the real character A, B, C are prepared in advance, the real character can be selected in the voice synthesis process, for example, if the voice packet of the real character a is selected, the voice synthesis process can realize the simulation of the sound of the real character a, and so on. In the above case, the "voice emoticons" recorded in the embodiment of the present application may also be recorded by the real characters A, B, C, etc. respectively, so that after selecting a voice packet of a real character for voice synthesis, the "voice emoticons" recorded by the real character may be used to enhance the voice expressiveness of the voice synthesis result, so that the voice synthesis result and the acoustic characteristics such as the tone color of the inserted "voice emoticons" may be consistent.

From the perspective of system architecture, as shown in fig. 1, in one manner, since text content for voice broadcast may be generally automatically generated in advance, the embodiment of the present application may provide a service that may automatically add a "sound expression" to known text content, so that an application may add a specific "sound expression" to a target location of the text content by invoking the service before speech synthesis is required for a certain known text content (at this time, since a "sound expression" is added to the text content, the specific sound expression may specifically exist in the form of an emoji tag (similar to a placeholder), and the specific emoji tag may carry information such as an identifier of the "sound expression"). For example, for a product that needs to be played through an avatar, a "scenario" corresponding to the product may be prepared in advance (may be automatically generated through a related algorithm), and during the voice broadcast process, the "scenario" may be voice-synthesized to be converted into an audio content for play and output. After the text content is prepared, the service of automatically adding the "sound expression" provided by the embodiment of the application can be called, so as to determine which "sound expression" is added at what position. Subsequently, when the text content is specifically subjected to speech synthesis, if the position with the expression label is met, the corresponding "sound expression" can be replaced, so that the specific "sound expression" can be inserted into the speech synthesis result, and the sound expressive force of the speech synthesis result is improved.

In this way, as shown in fig. 1, the audio content corresponding to the specific "sound emoticon" may be stored in a server (which may be a cloud server provided by a cloud service provider), and when performing speech synthesis specifically, the audio content corresponding to the "sound emoticon" corresponding to the specific emoticon tag may be loaded from the server. In addition, in order to determine what position in the specific text content is suitable for adding what kind of "sound expression", an algorithm model may be established in advance, or a configuration file may be provided, in which matching rules such as specific keywords and/or regular expressions and specific position matching rules are provided (for example, a specific "sound expression" may be added to a sentence beginning, a sentence middle, or a sentence end). In addition, the loudness, the pause time, the rule effective probability and the like of the specific 'sound expression' can be configured in the configuration file. The loudness can be used for controlling the strength of specific tone expression, the pause time is mainly used for controlling the pause duration of playing specific 'voice expression', and the rule effective probability can be used for controlling the occurrence probability of specific 'voice expression' so as to control the fatigue degree, and the like. The configuration file has a small volume, so that the configuration file can be directly stored in a server which specifically provides a function of automatically adding the "sound expression" (different from the server for storing the "sound expression package"), when the "sound expression" needs to be added to the specific text content, the specific configuration file can be read from the server, and the position of the "sound expression" needing to be added to the specific text content and the specific "sound expression" needing to be added are determined according to the matching rule configured in the configuration file, and the like.

Alternatively, in another implementation, if text content editing such as "script" is required by manual writing, a "sound expression" component may be provided in the "script" editing tool, so that the user may choose to insert a specific "sound expression" at a position during editing of the "script". Or, the above-mentioned "sound expression" component may also be provided in the process of manually checking and confirming text contents such as automatically generated "script", and the like. In the above manner, a function of downloading the "sound expression package" can be provided, and the user can download the "sound expression package" which is interested by the user to the local terminal equipment where the client is located, so that the "sound expression" can be added at a proper position of the text content. At this time, the "sound expression" added to the text content can also be embodied in the form of an expression label, and when speech synthesis is specifically performed, a specific sound expression can be locally loaded and played and output according to information such as a sound expression identifier associated with the specific expression label. Of course, the operation of downloading the "voice expression package" may not be executed, and at this time, a specific voice expression may be loaded from the server in the process of speech synthesis, and so on.

The following describes specific implementations provided in embodiments of the present application in detail.

Example one

First, the embodiment provides a text processing method in speech synthesis, and referring to fig. 2, the method may specifically include:

s201: text content that needs to be speech synthesized is determined.

The text content to be subjected to voice synthesis is specifically the content such as a script which needs to be subjected to voice broadcast in an artificial intelligence manner such as a virtual character, and in practical application, the text content may be an explanation script which is automatically generated according to detailed information and the like of a commodity to be explained, and the like. For example, in the process of live broadcasting by a certain merchant user, if the anchor needs to be moved away at a certain time, the virtual image needs to replace the anchor to perform voice broadcasting, or if the merchant user does not have a live anchor, the virtual image directly performs voice broadcasting on the commodity, and the like, after the commodity needing to be broadcasted and the corresponding "script" are determined, before the "script" is specifically converted into voice, some sound materials for enhancing the sound expression can be inserted into a proper position in the "script" through the service provided by the embodiment of the application. At this time, such a "scenario" may become a text content that requires speech synthesis as described in the embodiment of the present application, and the like. It should be noted here that, under the condition that the product is broadcasted by voice through the virtual image, an operation option for selecting a specific service of automatically adding "voice expression" can be provided for the merchant user at the merchant end, and if the merchant needs to use the service, the selection can be performed. In the case of checking the service, since a specific "script" may be generated at the "script" server, the "script" server may invoke the service of automatically adding "voice expression" (the service may also be deployed in another server) provided in the embodiment of the present application, and after the insertion of the specific "voice expression" is completed, speech synthesis is performed.

S202: a target location in the textual content for which enhanced sound expression is desired is determined, as well as target sound material for sound expression enhancement at the target location.

The sound materials may specifically refer to some audio format files obtained by the pre-recording method, and may specifically be obtained by recording short texts such as words, phrases and the like in a certain scene by a real person during the reading process. These short texts such as words, phrases, and the like may be texts for expressing moose of "lyrics", "guiding to perform a certain action", "attracting attention", "emphasizing" and the like, so that the specific tone words may include general tone words, emotion tone words, intention tone words, and the like. In addition, the language word customization can be carried out by a merchant and the like. For example, "hello hello", "small pan" and "coming" may be mentioned, and by reading the words by the real person in a specific scene, a natural expression of a specific sound can be obtained, thereby sufficiently enhancing the sound expression. Of course, since the short text of the above-described nature can already play a role in enhancing the expression, in some cases, for the purpose of reducing the cost and the like, the specific short text may be directly converted into voice by means of voice synthesis and recorded, and the like.

Specifically, since the specific situations of different text contents are very different, the specific locations at which the enhancement of the sound expressive force is required are also different, and in addition, the specific sound materials are also multiple, and the locations at which the sound materials are required to be inserted are also different. Thus, after the specific text content is determined, the target position in the text content can be determined first, and the specific position indicates which specific sound material needs to be added.

The target position may refer to the beginning, middle or end of a sentence or paragraph in the text content, or the like. There are many implementations for specifically determining the target location and which sound material needs to be inserted at the target location. For example, for the service of automatically adding "voice expression" shown in fig. 1, the determination may be performed by a pre-trained algorithm model, or may also be performed by a specific matching rule configured in a pre-generated configuration file, or the like. For the way of manually entering or editing the specific text content, the determination may be made according to the user-specified location and the identification of the selected "voice expression", and so on.

In an implementation manner that the determination is performed by the matching rule configured in the configuration file, the specific configuration file may at least include the matching rule and the insertion position rule information corresponding to each sound material. In this way, particularly when the above-described target position and target sound material are determined, the text content may be first divided into a plurality of text segments, for example, may be divided in units of sentences, or the like. Then, whether the text segment meets a matching rule corresponding to a certain sound material or not can be judged, if yes, the target position is determined according to the insertion position rule information corresponding to the sound material, and the sound material is determined as the target sound material. That is, a particular text content may be divided into segments, and then based on a particular profile, it may be determined which segments require enhanced sound expression and at what location with which sound material to enhance.

In a specific implementation, the matching rule corresponding to a specific sound material may include a keyword and/or a regular expression, so that, specifically, when performing matching judgment, it may be judged whether a specific text segment includes a keyword corresponding to a certain sound material and/or conforms to a corresponding regular expression, and if so, it may be determined that the text segment conforms to the matching rule corresponding to the sound material. In addition, according to the insertion position rule corresponding to the sound material, it can be determined whether the sound material needs to be inserted at the beginning, the middle or the end of a specific text segment. For example, the content of a certain sound material is "hello hello", and the matching rule corresponding to the sound material in the configuration file is: the text segment comprises a keyword 'I Lala', and the beginning position of the segment does not comprise 'hello hello', so that the text segment conforms to the matching rule of the 'hello hello'; moreover, the insertion position rule corresponding to the sound material is as follows: inserted at the beginning of the segment, then "hello hello" can be inserted at the beginning of the text segment, and so on. Of course, in addition to the foregoing matching by the keyword and/or the regular expression, the determination may be performed by natural language understanding of the text segment, and the like.

In addition, in an optional manner, the specific configuration file may further include loudness and/or pause duration information corresponding to the sound materials, respectively. In this way it is possible to determine, in addition to which sound material is inserted at which position of the text segment, at which loudness the playback is to take place, how long a pause is needed after the end of the playback, etc.

Further, if a plurality of text segments are successfully matched with the same sound material in the same text content, in this case, if the sound material is directly inserted into the text segments, it means that the user may repeatedly hear the same sound material in a short time, so that the user may suffer from hearing fatigue, and it is difficult to actually enhance the sound expression. Therefore, in an optional manner, the specific configuration file may further include: and the rule effective probability information corresponding to the plurality of sound materials is used for controlling the insertion frequency of the sound materials.

S203: and inserting the target sound material into the target position for playing in the process of converting the text content into a voice synthesis result.

Since the sound material is inserted into some target positions in the specific text content, the target sound material can be inserted into the target positions for playing in the process of converting the text content into the speech synthesis result. The original text content is converted into voice through a voice synthesis algorithm for playing, and the content of the inserted sound material may be related to the determined tone of "lyrics praise", "guide to perform a certain operation", "attract attention", "emphasize", etc., and may be read and recorded by a real person in advance, so that the original text content may have stronger and more natural expressive force, thereby enhancing the voice expressive force of the voice synthesis result as a whole.

In specific implementation, since a specific "sound expression" is added to the text content, in order to facilitate implementation, in an implementation manner, an expression tag may be added to the target location of the text content, and information carried by the expression tag may at least include: identification information of the target sound material. In the specific implementation, the text corresponding to the sound material can be directly used as the emoticon. For example, assume that a text fragment is: "baby, family good, I Lai! When the judgment is made according to the matching rule corresponding to the specific voice material, since the segment is located at the beginning of the text content and includes the keywords such as "great family", "i am coming", and the like, the segment conforms to the matching rule corresponding to "hello hello", and the specific target position can be the beginning of a sentence, so that the segment can be located at "baby, great family, i come to a cheer |, as shown in fig. 3! The "beginning of this segment is inserted into the" hello hello "speech material. In the example shown in fig. 3, "hello hello" is shown as an emoticon. That is, when performing speech synthesis, it is not necessary to perform speech synthesis on "hello hello", but output "baby, family, i cheer! Before the corresponding voice synthesis result, the pre-recorded audio file corresponding to the material is loaded and played. Therefore, when the content is prerecorded, the real person reading mode can be adopted for recording instead of artificial synthesis, so that the sound expression of the content can be more natural. Of course, other text segments may be processed similarly, and other sound materials are inserted at positions such as sentence center and sentence end (the content in the black rectangular box in fig. 3 belongs to the sound material emoticons inserted in the embodiment of the present application), and so on.

That is to say, in the process of converting text content into a speech synthesis result, the target sound material may be loaded according to the emoji tag and inserted into the target position for playing. And if the target voice material is stored in the server, the target voice material can be loaded from the server according to the expression label and inserted into the target position for playing. The specific expression label can also carry parameters such as loudness and pause duration of the sound material, so that the sound material can be played according to the parameters. Of course, if these parameters are not present, playback may be performed at a default loudness, pause duration, etc., and so on.

In summary, with the embodiments of the present application, some sound materials related to enhancing sound expressiveness can be obtained in advance, and for text content requiring speech synthesis, before performing speech synthesis, a target location in the text content requiring sound expressiveness enhancement, and specifically a target sound material at the location for sound expressiveness enhancement, can be determined first. In this way, such target sound material can be inserted at a specific target position at the time of speech synthesis to achieve enhancement of sound expressiveness of the overall speech synthesis result.

Example two

The second embodiment provides a method for explaining commodities through an avatar, referring to fig. 4, for an application of the solution provided in the second embodiment of the present application in a scene where the commodities are explained through the avatar, including:

s401: determining a commodity needing to be subjected to voice explanation through a virtual image, and generating explanation text contents for the commodity;

s402: adding an expression label at the target position in the explanation text content, wherein the information carried by the expression label at least comprises: identification information of the target sound material;

s403: and in the process of converting the explanation text content into a voice synthesis result, loading the target sound material from a server according to the expression label, and replacing the target sound material to the target position for playing.

EXAMPLE III

In practical application, the scheme provided in the embodiment of the present application may also be applied to more scenes in which live broadcast is performed through an avatar, and is not limited to live broadcast of goods, so a third embodiment of the present application further provides a method for live broadcast through an avatar, referring to fig. 5, where the method may specifically include:

s501: determining text content needing live broadcasting through the virtual image;

the specific text content may be obtained in advance in a variety of ways.

S502: determining a target location in the text content for which enhanced sound expression is desired, and a target sound material for sound expression enhancement at the target location;

s503: adding an expression label at the target position in the text content, wherein the information carried by the expression label at least comprises: identification information of the target sound material;

s504: and in the process of converting the text content into a voice synthesis result, loading the target sound material according to the expression label, and replacing the target sound material to the target position for playing.

For the parts of the second and third embodiments that are not described in detail, reference may be made to the descriptions of the first embodiment and other parts of this specification, which are not described herein again.

It should be noted that, in the embodiments of the present application, the user data may be used, and in practical applications, the user-specific personal data may be used in the scheme described herein within the scope permitted by the applicable law, under the condition of meeting the requirements of the applicable law and regulations in the country (for example, the user explicitly agrees, the user is informed, etc.).

Corresponding to the first embodiment, an embodiment of the present application further provides a text processing apparatus in speech synthesis, and referring to fig. 6, the apparatus may include:

a text content determining unit 601, configured to determine a text content that needs to be subjected to speech synthesis;

a location and material determination unit 602, configured to determine a target location in the text content where sound expressiveness enhancement is required, and a target sound material for sound expressiveness enhancement at the target location;

an inserting unit 603, configured to insert the target sound material into the target position for playing in the process of converting the text content into a speech synthesis result.

The target sound material is obtained by recording the process of reading target words, phrases or sentences in a target scene by real characters.

In a specific implementation, the apparatus may further include:

a configuration file reading unit, configured to read a configuration file, where the configuration file includes: matching rules and insertion position rule information corresponding to the plurality of sound materials respectively;

the position and material determining unit may specifically include:

a segment dividing subunit, configured to divide the text content into a plurality of text segments;

and the judging subunit is used for judging whether the text segment conforms to a matching rule corresponding to a certain sound material, if so, determining the target position according to the insertion position rule information corresponding to the sound material, and determining the sound material as the target sound material.

Wherein the matching rule comprises a keyword and/or a regular expression;

the judgment subunit is specifically configured to:

Specifically, the configuration file may further include: loudness and/or pause duration information corresponding to the plurality of sound materials, respectively.

In addition, the configuration file may further include: and the rule effective probability information corresponding to the plurality of sound materials is used for controlling the insertion frequency of the sound materials.

In a specific implementation, the apparatus may further include:

an expression label adding unit, configured to add an expression label at the target position of the text content, where information carried by the expression label at least includes: identification information of the target sound material;

the insertion unit may specifically be configured to:

and in the process of converting the text content into a voice synthesis result, loading the target sound material according to the expression label, and replacing the target sound material to the target position for playing.

The target voice material is stored in a server;

the insertion unit may specifically be configured to:

Corresponding to the second embodiment, the embodiment of the present application further provides an apparatus for explaining a commodity through an avatar, and referring to fig. 7, the apparatus may include:

a text content generating unit 701 for determining a commodity to be subjected to voice explanation through an avatar, and generating an explanation text content for the commodity;

a location and material determination unit 702 for determining a target location in the lecture text content at which the sound expressiveness enhancement is required and a target sound material for sound expressiveness enhancement at the target location;

an expression label adding unit 703, configured to add an expression label to the target position in the explanation text content, where information carried by the expression label at least includes: identification information of the target sound material;

and a replacing unit 704, configured to load the target sound material from a server according to the emoji tag in a process of converting the explanation text content into a speech synthesis result, and replace the target sound material to the target position for playing.

Corresponding to the embodiment, the embodiment of the present application further provides a device for live broadcasting through an avatar, and referring to fig. 8, the device may include:

a text content determining unit 801, configured to determine text content that needs to be live broadcast through an avatar;

a location and material determination unit 802 for determining a target location in the text content where the sound expression needs to be enhanced, and a target sound material for sound expression enhancement at the target location;

an expression label adding unit 803, configured to add an expression label at the target position in the text content, where information carried by the expression label at least includes: identification information of the target sound material;

and a replacing unit 804, configured to load the target sound material according to the emotion tag and replace the target sound material to the target position for playing in a process of converting the text content into a speech synthesis result.

In addition, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of the preceding method embodiments.

And an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

Fig. 9 illustrates an architecture of an electronic device, which may specifically include a processor 910, a video display adapter 911, a disk drive 912, an input/output interface 913, a network interface 914, and a memory 920. The processor 910, the video display adapter 911, the disk drive 912, the input/output interface 913, and the network interface 914 may be communicatively connected to the memory 920 via a communication bus 930.

The processor 910 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided in the present Application.

The Memory 920 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 920 may store an operating system 921 for controlling the operation of the electronic device 900, a Basic Input Output System (BIOS) for controlling low-level operations of the electronic device 900. In addition, a web browser 923, a data storage management system 924, a speech synthesis processing system 925, and the like may also be stored. The speech synthesis processing system 925 can be an application program that implements the operations of the foregoing steps in this embodiment. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program code is stored in the memory 920 and invoked by the processor 910 for execution.

The input/output interface 913 is used to connect the input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 914 is used for connecting a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

The bus 930 includes a path to transfer information between the various components of the device, such as the processor 910, the video display adapter 911, the disk drive 912, the input/output interface 913, the network interface 914, and the memory 920.

It should be noted that although the above-mentioned devices only show the processor 910, the video display adapter 911, the disk drive 912, the input/output interface 913, the network interface 914, the storage 920, the bus 930 and so on, in the implementation process, the device may also include other components necessary for realizing normal operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The text processing method and the electronic device in speech synthesis provided by the present application are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method for processing text in speech synthesis, comprising:

determining text content needing voice synthesis;

determining a target position in the text content where the sound expressive force needs to be enhanced and a target sound material for sound expressive force enhancement at the target position;

2. The method of claim 1,

the target sound materials are obtained by recording the process of reading target words, phrases or sentences in the target scene by the real characters.

3. The method of claim 1, further comprising:

dividing the text content into a plurality of text segments;

4. The method of claim 3,

the matching rules comprise keywords and/or regular expressions;

5. The method of claim 3,

the configuration file further comprises: loudness and/or pause duration information corresponding to the plurality of sound materials, respectively.

6. The method of claim 3,

the configuration file further comprises: and the rule effective probability information corresponding to the plurality of sound materials is used for controlling the insertion frequency of the sound materials.

7. The method according to any one of claims 1 to 6,

after determining a target location in the text content for which an enhancement of sound expression is desired and a target sound material for sound expression enhancement at the target location, the method further comprises:

the inserting the target sound material at the target location comprises:

8. The method of claim 7,

the target voice material is stored in a server;

9. The method according to any one of claims 1 to 6,

the text content includes: and generating an explanation text content of the commodity which carries out voice explanation through the virtual image according to the requirement.

10. A method for commodity explanation through an avatar, comprising:

11. The method of claim 10,

the information carried by the expression label further comprises: and the loudness and/or pause duration information corresponding to the target sound material respectively is used for controlling the loudness and/or pause duration when the target sound material is played.

12. A method for live broadcasting through an avatar, comprising:

determining text content needing live broadcasting through the virtual image;

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 12.

14. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of claims 1 to 12.