CN109582271B

CN109582271B - Method, device and equipment for dynamically setting TTS (text to speech) playing parameters

Info

Publication number: CN109582271B
Application number: CN201811261770.6A
Authority: CN
Inventors: 戴帅湘; 袁志伟; 李龙飞
Original assignee: Beijing Moran Cognitive Technology Co Ltd
Current assignee: Gansu Longdian Yunchuang Technology Consulting Co.,Ltd.
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2020-04-03
Anticipated expiration: 2038-10-26
Also published as: CN109582271A

Abstract

The embodiment of the invention discloses a method, a device and equipment for dynamically setting TTS (text to speech) playing parameters. The setting method of the TTS playing mark can dynamically set the playing acceleration mark bit, the volume mark bit and the tone mark bit by extracting the parameter information which is acquired from an external system and is related to the playing object, the environment and the user and endowing different weights to the parameters, thereby realizing the TTS intelligent playing.

Description

Method, device and equipment for dynamically setting TTS (text to speech) playing parameters

Technical Field

The embodiment of the invention relates to the field of artificial intelligence, in particular to TTS playing setting.

Background

TTS (Texto Speech) can realize the conversion from text to voice, and is an important technology of man-machine interaction in artificial intelligence technology. Improvements to TTS technology typically involve changes in TTS playback speed, such as controlling playback speed based on how urgent a broadcast object is. However, TTS is faced with a complex environment, and different users have different requirements for broadcasting, and how to intelligently perceive and analyze the environment and requirements to realize adaptive changes of the broadcasting characteristics becomes a new problem.

Disclosure of Invention

The embodiment of the invention provides a method for dynamically setting TTS playing parameters, which comprises the following steps: setting an acceleration mark bit in a playing object, wherein the acceleration mark bit is used for controlling the playing speed of TTS; the acceleration flag bit comprises an attribute bit and a numerical bit; the acceleration mark bit is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user requirement parameters and user state parameters of the playing object; the parameters are obtained automatically.

The value of the attribute bit represents the increase or decrease of the playing speed relative to the reference speed.

The attribute bit value is 0 or 1, wherein when the attribute bit value is 1, the representation playing rate is increased relative to the reference rate; and when the attribute bit is 0, the representation playing speed is reduced relative to the reference speed.

The value of the numerical bit indicates the degree to which the playback rate is increased or decreased relative to the reference rate.

The scenario is set by a set of parameters with different weights, which are set by the user, or automatically generated by the system.

The attribute of the playing object comprises at least one of the length, the size and the content attribute of the playing object.

The content attribute of the playing object comprises at least one of an emergency degree index, an aging characteristic index and a function attribute index of the playing object.

The emergency degree index of the playing object represents whether the playing object is an emergency message or not, the aging characteristic index of the playing object refers to the validity period of the playing object, the function attribute index represents the function of the playing object, and the function comprises at least one of reminding, warning and entertainment.

The environmental parameter includes at least one of a speed index, a noise index, and a location index.

The user state parameter comprises at least one of an emotion index, a fatigue index and a health index.

And the parameters and the indexes are dynamically sensed and obtained from an external system.

The scenario is updated in real time according to the relevant parameters and indicators.

The scene is periodically updated and remains unchanged for a certain period of time.

A volume marking position is also set in the playing object and used for controlling the playing volume of the TTS, and the volume is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters and user state parameters of the playing object; the parameters are obtained automatically.

A tone mark bit is also set in the playing object and used for controlling the tone of TTS playing, and the tone is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user state parameters and user requirement parameters of the playing object; the parameters are obtained automatically.

The device also comprises mark priority setting used for setting the priority of the acceleration mark bit, the volume mark bit or the tone mark bit or opening and closing the mark bit to open or close the mark bit.

The embodiment of the invention also provides a device for dynamically setting TTS playing parameters, which comprises: the playing unit is used for calling the related control parameters to play the TTS content according to the set mark bit; the mark position setting unit is connected with the playing unit and used for setting at least one of an acceleration mark position, a volume mark position and a tone mark position and outputting a setting result to the playing unit; the scene setting unit is connected with the marking bit setting unit and used for calculating scene parameters and indexes to control the generation of the marking bits; the parameter extraction unit is used for extracting the parameter information related to the scene from the information received by the receiving unit and forwarding the parameter information to the scene setting unit; the receiving unit is used for receiving information from an external system in real time or periodically and forwarding the information to the parameter extraction unit;

wherein, the acceleration mark bit is used for controlling the playing speed of the TTS; the acceleration flag bit comprises an attribute bit and a numerical bit; the acceleration mark bit is automatically generated according to the scene and is dynamically adjustable;

the scenario set by the scenario setting unit is defined by a set of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user requirement parameters and user state parameters of a playing object.

And the indexes are dynamically sensed and obtained by the system.

The volume marking bit is used for controlling the playing volume of TTS, and the volume is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters and user requirement parameters of the playing object; the parameters are obtained automatically.

The tone marking bit user controls the tone of TTS playing, and the tone is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters and user requirement parameters of the playing object; the parameters are obtained automatically.

The mark bit setting unit comprises a mark priority setting module which is used for setting the priority of an acceleration mark bit, a volume mark bit or a tone mark bit or opening and closing the mark bit to enable the mark bit to be opened or closed.

The invention also discloses an intelligent terminal which is characterized by comprising the device for dynamically setting the TTS playing parameters.

The invention also discloses a computer device, which is characterized by comprising a processor and a memory, wherein the processor is used for executing the instructions stored in the memory, and the instructions are used for executing the method.

The invention also discloses a computer readable storage medium which is characterized by storing instructions for executing the method.

Drawings

FIG. 1 illustrates a method for dynamically setting TTS playing parameters according to the present invention;

FIG. 2 is a block diagram of the apparatus for dynamically setting TTS playing parameters according to the present invention;

FIG. 3 is a block diagram of a flag bit setting unit in the apparatus for dynamically setting TTS playing parameters according to the present invention;

fig. 4 is a block diagram of a scenario setting unit in the apparatus for dynamically setting TTS playing parameters of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Example one

Referring to fig. 1, the present invention discloses a method for dynamically setting TTS playing parameters, which controls the playing speed, volume and tone attribute by setting a flag bit that dynamically changes with the external environment, playing object and playing audience state, so as to implement intelligent playing.

In the invention, after the TTS playing object is read, an acceleration mark bit is added in the playing object, and the playing speed is correspondingly controlled by utilizing the acceleration mark bit. A reference speed is preset, and the reference speed refers to the playing speed without the acceleration mark bit. For example: the base rate is approximately 250 chinese characters/minute. The acceleration flag bit is 4 bits, the 1 st bit is an attribute bit, and the value can be (0,1), where 0 represents that the current playing speed is reduced or slowed down relative to the reference speed; 1 represents that the current playing rate is increased or accelerated relative to the reference rate; bits 2 to 4 indicate the range of variation of the playback rate from the reference rate. The following table 1 shows the relationship between the value of the acceleration flag bit and the playing speed.

Table 1: relation between acceleration mark bit value and playing speed

The above correspondence relationship is only used for explaining how to set the relationship between the acceleration flag bit and the playback speed, and should not be considered as a unique limitation. The user can set other modes or values according to own needs or preferences. For example, the gradient of the change at the time of decreasing or slowing the rate may be set to 10 words/min, and the gradient of the change at the time of increasing or slowing the rate may be set to 20 words/min, or the like. The set upper limit rate and the lower limit rate should conform to the usage habit of the user and should not exceed the user acceptance degree, for example, the upper limit of the speech rate should ensure that the user can clearly hear the content.

The acceleration mark bit is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user requirement parameters and user state parameters of the playing object; the parameters are obtained automatically. The attribute parameter of the playing object comprises a content attribute of the playing object, and the content attribute comprises: an emergency index of a playing object, for example, in an in-vehicle playing system, when receiving a message of an adjacent accident in front through an interface with navigation software, the emergency of the playing object is particularly emergency, and when only the parameter exists, the corresponding acceleration flag bit is set to 1111; in addition, the content attribute further includes a function attribute index of the playing object, for example, the playing object is a current news category, an entertainment category, a work content, and the like, and the acceleration flag bit of the current news category may be set to 1110, and the work content may be set to 1000, or 0000, that is, the reference rate playing, corresponding to different acceleration flag bit values.

The environment parameters may include, for example, an in-vehicle environment, a night environment, a private environment, etc., for example, the play acceleration flag bit in the private environment may be set to 0001;

the user requirement parameter is set according to the user preference, for example, the user usually prefers faster speech speed, and also prefers faster playing speed, so the overall playing speed can be increased, for example, by increasing the amplitude to 40 words/minute, and the overall playing speed can be increased by, for example, setting a faster reference speed, for example, setting the reference speed to 290 words/minute.

The user status parameters are obtained according to the status of the user himself. For example: judging the fatigue degree of the user by detecting the blinking frequency of the user; the health state or emotional state of the user is judged by remotely monitoring indexes such as blood pressure and heart rate of the user.

When there are multiple parameters that affect the setting of the acceleration flag bit together, a conflict may occur, for example, the attribute parameter of the playing object needs to increase the playing speed, and the environment parameter needs to decrease the playing speed. At this time, to avoid collision and obtain a more intelligent result, different parameters may be set to have different priorities, for example, the attribute parameter of the playing object is the first priority and the environmental parameter is the second priority, and then when the above situation occurs, the playing speed should be preferentially increased, that is, the attribute bit of the acceleration flag bit is set to 1. In order to solve the above problem, different parameters may be set to have different weights, for example, the weight of the playing object attribute is 8, and the weight of the environment parameter is 2, so when the playing object attribute is urgent and the environment parameter is at night, the next higher playing speed may be set, and the setting to the highest playing speed is more intelligent and humanized than the case of only considering the playing object attribute.

Example two

Referring to fig. 1, the present invention discloses a method for dynamically setting TTS playing parameters, which controls the playing speed, volume and timbre attributes by setting a flag bit that dynamically changes with the external environment, the playing object, the playing audience demand and the playing audience state change, so as to implement intelligent playing.

Besides the playing speed, the playing volume is also an important parameter affecting the perception of the audience, and it is also necessary to set the volume flag bit to meet the intelligent requirement of the volume. The volume of the playback is adjusted by setting a volume flag bit in the playback object. For example, the volume flag bit may be set to 4 bits, and the 1 st bit indicates whether to increase or decrease with respect to the reference volume; bits 2-4 indicate the degree of volume increase or decrease. The volume flag bit is also automatically adjusted according to parameter information received by the playing system from an interface with an external system, and may include, for example, accident information received from a car navigation system, information of a location received from a positioning system, and user status attributes received from a biometric system, which represent information of a user's emotion, a health index such as blood pressure, and a drowsy state. For example, if the accident information is received from the vehicle-mounted system, the attribute parameter of the accident information is urgent, and the volume flag bit is set to be the maximum value correspondingly, that is, the maximum volume is used for broadcasting, so as to warn the driver. Or the position where the positioning system receives is at home, the problem that other people are disturbed is avoided, and the broadcasting can be carried out at a higher volume. Or the position where the positioning system receives the information is the home, and the user is detected to be about to sleep through blinking of the biological recognition system, and then the minimum value of the volume mark bit is correspondingly set to broadcast the information at the softest volume. Or the position where the positioning system receives the information is a vehicle-mounted environment, and the fact that the user is about to sleep is detected through blinking of the biological recognition system, the maximum value of the volume mark bit is correspondingly set, the maximum value is broadcasted with the strongest volume, and the warning effect is achieved.

Similarly, there may be conflicts between the volume parameters, for example, some parameters may require the user to turn up the volume, and some parameters may require the user to turn down the volume, and the priority of different parameters may be set to solve the problem. For example, if the attribute parameter of the playback object is set to have the highest priority, the volume is turned up to the highest level when the attribute of the playback object is urgent. Or setting different parameters to have different weights, for example, the weight of the attribute parameter of the playing object is 8, and the weight of the user state parameter is 2, when the attribute is urgent and the user state is drowsy, the user is broadcasted with the volume lower than the highest volume, so that the user is prevented from being frightened.

EXAMPLE III

The tone of TTS broadcast is an index influencing user experience, and the comfortable tone can improve the user experience and further improve the viscosity of the user. Therefore, the playing setting system of the invention is also provided with a tone mark bit for adjusting the played tone. The tone mark bits are set according to the number of the tones recorded in the system, for example, if 8 tones are recorded in advance, the tone mark bits are set to 3 bits, each value corresponds to one tone, and the played tone is adjusted by setting the volume mark bit in the playing object. The volume flag bit is also automatically adjusted according to parameter information received by the playing system from an interface with an external system, and may include, for example, attributes of a playing object, different source attributes included in information obtained from different APPs, and user status attributes received from the biometric identification system and representing information such as user emotion, health indicators such as blood pressure, and drowsiness status. For example, if accident information is received from the vehicle-mounted system, the attribute parameter of the accident information is urgent, and the corresponding tone is set to be objective and cool boys; if the playing object is from family chatting in WeChat, the tone color can be set to a soft female voice.

Also, conflicts may arise between various attributes or parameters affecting the tone mark bits, which may be coordinated by setting different priorities or different weights.

The playing acceleration flag bit, the volume flag bit and the tone flag bit can be determined and assigned based on a series of same parameters, indexes and same weights and priorities, that is, the same index can influence the setting of the three flag bits; different respective parameters, metrics, and different priority and weight information may also be selected for determination and assignment.

Example four

Referring to fig. 2, the present invention discloses a device for dynamically setting TTS playing parameters, which includes a playing unit for invoking relevant control parameters to play TTS contents according to a set flag bit; the mark position setting unit is connected with the playing unit and used for setting at least one of an acceleration mark position, a volume mark position and a tone mark position and outputting a setting result to the playing unit; the scene setting unit is connected with the marking bit setting unit and used for calculating scene parameters and indexes to control the generation of the marking bits; the parameter extraction unit is used for extracting the parameter information related to the scene from the information received by the receiving unit and forwarding the parameter information to the scene setting unit; the receiving unit is used for receiving information from an external system in real time or periodically and forwarding the information to the parameter extraction unit;

the flag bit setting unit comprises an acceleration flag bit setting module, and the acceleration flag bit is used for controlling the playing speed of the TTS; the acceleration flag bit comprises an attribute bit and a numerical bit; the acceleration mark bit is automatically generated according to the scene and is dynamically adjustable;

preferably, the flag bit setting unit may further include a volume flag bit setting module, where the volume flag bit is used to control the playing volume of the TTS, and the volume is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user requirement parameters and user state parameters of the playing object; the parameters are obtained automatically.

Preferably, the flag bit setting unit may further include a tone flag bit setting module, where the tone flag bit user controls a tone of TTS playing, and the tone is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user requirement parameters and user state parameters of the playing object; the parameters are obtained automatically.

As shown in fig. 3, a flag priority setting module is disposed in the flag bit setting unit, and is configured to set priorities of the acceleration flag bit, the volume flag bit, and the tone flag bit, or control a function of the corresponding flag bit to turn on or off the corresponding flag bit.

EXAMPLE five

Referring to fig. 4, the scenario set by the scenario setting unit is defined by a set of parameters with different weights, where the parameters include at least one of an attribute parameter, an environment parameter, a user requirement, and a user status parameter of a playing object; the scene setting unit comprises a playing object attribute parameter setting module, an environment parameter setting module, a user requirement parameter setting module and a user state parameter setting module. The scene setting unit is connected with the parameter extracting unit, the parameter extracting unit is connected with the receiving unit and used for extracting the parameter information related to the scene from the information received by the receiving unit and forwarding the parameter information to the scene setting unit; the receiving unit is used for receiving information from an external system in real time or periodically and forwarding the information to the parameter extraction unit; the receiving unit is connected with different external systems through different interfaces, including but not limited to various navigation systems, biological feature perception systems, various APPs and the like, and is provided with a plurality of interfaces which can be expanded to be compatible with more external systems. The scene setting unit is also provided with a parameter weight setting module which is used for endowing different weight to different parameter setting units, and the weight can be initially set, automatically set or set by acquiring user input through an external interface. The scene setting unit finally defines the scene according to the parameters with different weights set by different units. The scene unit setting unit also comprises a checking module for checking each parameter to ensure the correct result is output.

EXAMPLE six

The play setting system defined in the fourth and fifth embodiments may be integrated into different hardware or operating systems such as a mobile terminal, a vehicle-mounted system, a player, a computer, and the like.

EXAMPLE seven

The invention also protects a computer program medium for storing computer instructions for implementing the method for dynamically setting TTS playing parameters of the invention.

The invention also protects a computer device comprising a memory and a processor. The processor can call the instructions stored in the memory and realize the method for dynamically setting the TTS playing parameters by executing the instructions.

Claims

1. A method for dynamically setting TTS playing parameters, the method comprising: setting an acceleration mark bit in a playing object, wherein the acceleration mark bit is used for controlling the playing speed of TTS; presetting a reference speed, wherein the reference speed refers to a playing speed when no acceleration mark bit exists; the acceleration mark bit comprises an attribute bit and a numerical bit, and the value of the attribute bit represents the increase or decrease of the playing speed relative to the reference speed; the attribute bit value is 0 or 1, wherein when the attribute bit value is 1, the representation playing rate is increased relative to the reference rate; when the attribute bit is 0, the representation playing speed is reduced relative to the reference speed; the value of the numerical bit represents the degree of increase or decrease of the playing rate relative to the reference rate;

the acceleration mark bit is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise attribute parameters, environment parameters and user requirement parameters of a playing object; the parameters are automatically obtained; the weight is set by a user, and the scene is updated in real time according to related parameters and indexes;

the device also comprises mark priority setting used for setting the priority of the acceleration mark bit or opening and closing the function of the mark bit to enable the mark bit to be opened or closed.

2. The method of claim 1, wherein the attributes of the playing object comprise at least one of a length, a size, and a content attribute of the playing object.

3. The method according to claim 2, wherein the content attribute of the playing object comprises at least one of an urgency index, an aging characteristic index and a function attribute index of the playing object.

4. The method according to claim 3, wherein the emergency degree index of the playing object represents whether the playing object is an emergency message, the aging characteristic index of the playing object represents the effective period of the playing object, the function attribute index represents the function of the playing object, and the function comprises at least one of reminding, warning and entertainment.

5. The method of claim 1, the environmental parameter comprising at least one of a speed indicator, a noise indicator, a location indicator.

6. The method of claim 1, the parameters further comprising user status parameters comprising at least one of an emotional index, a fatigue index, a health index.

7. The method according to one of claims 3 to 6, wherein the parameters and indicators are dynamically sensed and obtained from an external system.

8. The method of claim 1, further comprising a volume flag bit for controlling the playing volume of the TTS, wherein the volume is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters and user state parameters of the playing object; the parameters are obtained automatically.

9. The method of claim 1, further comprising a tone mark bit for controlling the tone of TTS playing, wherein the tone is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user state parameters and user requirement parameters of the playing object; the parameters are obtained automatically.

10. The method of claim 1, further comprising a flag priority setting for priority of volume flag bits or tone flag bits, or a function to turn the flag bits on or off.

11. An apparatus for dynamically setting TTS playing parameters, the apparatus comprising: the playing unit is used for calling the related control parameters to play the TTS content according to the set mark bit; the mark position setting unit is connected with the playing unit and used for setting at least one of an acceleration mark position, a volume mark position and a tone mark position, controlling the opening and closing of the mark position and outputting a setting result to the playing unit; the scene setting unit is connected with the marking bit setting unit and used for calculating scene parameters and indexes to control the generation of the marking bits; the parameter extraction unit is used for extracting the parameter information related to the scene from the information received by the receiving unit and forwarding the parameter information to the scene setting unit; the receiving unit is used for receiving information from an external system in real time or periodically and forwarding the information to the parameter extraction unit;

presetting a reference speed, wherein the reference speed refers to a playing speed when no acceleration mark bit exists; the flag bit setting unit is used for setting an acceleration flag bit in a playing object, and the acceleration flag bit is used for controlling the playing speed of TTS; the acceleration flag bit comprises an attribute bit and a numerical bit; the value of the attribute bit represents the increase or decrease of the playing speed relative to the reference speed; the attribute bit value is 0 or 1, wherein when the attribute bit value is 1, the representation playing rate is increased relative to the reference rate; when the attribute bit is 0, the representation playing speed is reduced relative to the reference speed; the value of the numerical bit represents the degree of increase or decrease of the playing rate relative to the reference rate; the acceleration mark bit is automatically generated according to the scene and is dynamically adjustable; the scene is updated in real time according to the relevant parameters and indexes;

the scene set by the scene setting unit is defined by a group of parameters with different weights, the weights are set by a user, and the parameters comprise attribute parameters, environment parameters and user requirement parameters of a playing object; the device also comprises mark priority setting used for setting the priority of the acceleration mark bit or opening and closing the function of the mark bit to enable the mark bit to be opened or closed.

12. The apparatus of claim 11, wherein the attribute of the playing object comprises at least one of a length, a size, and a content attribute of the playing object.

13. The apparatus according to claim 12, wherein the content attribute of the playback object comprises at least one of an urgency index, an aging characteristic index, and a function attribute index of the playback object.

14. The apparatus according to claim 13, wherein the emergency index of the playing object represents whether the playing object is an emergency message, the aging characteristic index of the playing object represents a valid period of the playing object, and the function attribute index represents a function of the playing object, the function includes at least one of reminding, warning and entertainment.

15. The apparatus of claim 11, the environmental parameter comprising at least one of a speed indicator, a noise indicator, a location indicator.

16. The apparatus of claim 11, the parameters further comprising user status parameters comprising at least one of an emotional index, a fatigue index, a health index.

17. The apparatus of one of claims 12-16, said metrics each being dynamically perceived and obtained by the system.

18. The apparatus of claim 11, further comprising a volume flag bit for controlling the playing volume of the TTS, wherein the volume is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user requirement parameters and user state parameters of the playing object; the parameters are obtained automatically.

19. The apparatus of claim 11, further comprising a tone mark bit, wherein the user controls the tone of the TTS playing, and the tone is automatically generated according to the scene and is dynamically adjustable; the scene is defined by a group of parameters with different weights, and the parameters comprise at least one of attribute parameters, environment parameters, user requirement parameters and user state parameters of the playing object; the parameters are obtained automatically.

20. The apparatus of claim 11, further comprising a flag priority setting for setting a priority of a volume flag bit or a tone flag bit, or a function of turning the flag bit on or off.

21. An intelligent terminal, characterized in that it comprises means for dynamically setting TTS playing parameters according to any one of claims 11 to 20.

22. A computer device, comprising a processor and a memory, the processor configured to execute instructions stored in the memory, the instructions configured to perform the method of claims 1-10.

23. A computer-readable storage medium having stored thereon instructions for performing the method of claims 1-10.