CN113066474A

CN113066474A - Voice broadcasting method, device, equipment and medium

Info

Publication number: CN113066474A
Application number: CN202110352361.2A
Authority: CN
Inventors: 刘浩
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-02

Abstract

The invention discloses a voice broadcasting method, a device, equipment and a medium, wherein in the embodiment of the invention, when a network is connected, at least one attribute value group of configured first text information and attribute information representing voice broadcasting of intelligent equipment is sent to a TTS server, and an audio file synthesized by the TTS server according to the first text information and the attribute value group is received and stored, so that when the voice broadcasting requirement is determined, even if the network is interrupted, the required audio file can be searched from the locally stored audio file for playing, and the user experience is improved.

Description

Voice broadcasting method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice broadcasting method, device, equipment and medium.

Background

With the rapid development of intelligent devices such as robots and the like and human-computer interaction, compared with Text reading, more and more robots adopt a mode from Text To Speech (TTS), so that Text output is converted into Speech output, characters are converted into a more convenient Speech playing mode, and user experience is improved.

However, before the robot uses TTS for broadcasting, the robot needs to acquire the online TTS from the server, but in the process of acquiring the online TTS, a scene with a poor network may exist, for example, the robot may move to an area with a poor network, and at this time, the network of the intelligent device and the server is interrupted, so that the online TTS cannot be broadcasted, and user experience is affected.

Disclosure of Invention

The invention provides a voice broadcasting method, a voice broadcasting device, voice broadcasting equipment and voice broadcasting media, which are used for solving the problem that in the prior art, due to network interruption of intelligent equipment and a server, online TTS (text to speech) cannot be broadcasted, and user experience is influenced.

The invention provides a voice broadcasting method, which comprises the following steps:

if the intelligent equipment is communicated with a network from a text to a voice TTS server, sending the configured first text information and at least one attribute value group representing the attribute information broadcasted by the intelligent equipment in a voice mode to the TTS server;

receiving and storing an audio file returned by the TTS server, wherein the audio file is obtained by performing voice synthesis by the TTS server according to the first text information and the attribute value set;

and if the voice broadcasting is determined to be needed, searching a corresponding target audio file in the stored audio files, and controlling the intelligent equipment to play the target audio file.

In one possible embodiment, the method further comprises:

monitoring a network state if a network is interrupted after the first text information and the at least one attribute value set are sent to the TTS server;

if network connection is monitored, determining non-synthesized data information according to the stored audio file, wherein the data information comprises second text information of the non-synthesized audio file in the first text information and/or a first target attribute value group of the non-synthesized audio file in the first text information;

and sending the related information of the data information to the TTS server so that the TTS server synthesizes an audio file according to the data information.

In one possible embodiment, the method further comprises:

receiving an updating request aiming at the first text information, and determining the updated first text information;

if the stored audio file does not contain the audio file corresponding to the updated first text information, sending the updated first text information and the attribute value set of the updated first text information to the TTS server;

and receiving and storing an audio file combined by the TTS server according to the updated first text information and the attribute value of the updated first text information.

In one possible embodiment, the method further comprises:

receiving a switching request aiming at the attribute value of the attribute information, and determining a second target attribute value set after switching;

if the stored audio file does not contain the audio file corresponding to the second target attribute value set, sending the first text information and the second target attribute value set to the TTS server;

and receiving and storing an audio file synthesized by the TTS server according to the second target attribute value set and the first text information.

In one possible embodiment, the method further comprises:

if the first updating condition is met, acquiring the frequency of the use of each stored audio file, and deleting the audio files of which the frequency of use is lower than a set threshold value; and/or the presence of a gas in the gas,

and if the second updating condition is met, determining a third target attribute value set currently used by the intelligent equipment, and deleting the audio files corresponding to the attribute value sets except the third target attribute value set.

In one possible embodiment, the attribute information includes at least one of:

speaker character, language, TTS synthesis volume, speech rate, and audio sampling rate.

In a possible implementation manner, the sending, to the TTS server, the configured first text information and at least one attribute value group of attribute information representing the smart device voice broadcast includes:

determining a third target attribute value set currently used by the intelligent device, and sending the configured first text information and the third target attribute value set to the TTS server; or

And determining a plurality of attribute value groups which can be configured by the intelligent device according to each attribute value of each attribute information, and sending the configured first text information and the plurality of attribute value groups to the TTS server.

The invention provides a voice broadcasting device, comprising:

the sending module is used for sending the configured first text information and at least one attribute value set representing the attribute information of the voice broadcast of the intelligent equipment to the TTS server if the intelligent equipment is communicated with a network from a text to the voice TTS server;

a receiving module, configured to receive and store an audio file returned by the TTS server, where the audio file is obtained by performing speech synthesis by the TTS server according to the first text information and the attribute value set;

and the processing module is used for searching a corresponding target audio file in the stored audio files and controlling the intelligent equipment to play the target audio file if the voice broadcasting is determined to be required.

In a possible implementation manner, the processing module is further configured to monitor a network status if a network is interrupted after the first text information and the at least one attribute value set are sent to the TTS server; if network connection is monitored, determining non-synthesized data information according to the stored audio file, wherein the data information comprises second text information of the non-synthesized audio file in the first text information and/or a first target attribute value group of the non-synthesized audio file in the first text information; and sending the related information of the data information to the TTS server so that the TTS server synthesizes an audio file according to the data information.

In a possible implementation manner, the receiving module is further configured to receive an update request for the first text information, and determine updated first text information;

the sending module is further configured to send the updated first text information and the attribute value set of the updated first text information to the TTS server if the stored audio file does not contain the audio file corresponding to the updated first text information;

the receiving module is further configured to receive and store an audio file composed by the TTS server according to the updated first text information and the attribute value of the updated first text information.

In a possible implementation manner, the receiving module is further configured to receive a switching request for an attribute value of the attribute information, and determine a second target attribute value set after switching;

the sending module is further configured to send the first text information and the second target attribute value set to the TTS server if the stored audio file does not include an audio file corresponding to the second target attribute value set;

and the receiving module is further configured to receive and store an audio file synthesized by the TTS server according to the second target attribute value set and the first text information.

In a possible implementation manner, the processing module is further configured to, if a first update condition is met, acquire a frequency of using each saved audio file, and delete an audio file whose frequency of using is lower than a set threshold; and/or the presence of a gas in the gas,

In a possible implementation manner, the sending module is specifically configured to determine a third target attribute value set currently used by the smart device, and send the configured first text information and the third target attribute value set to the TTS server; or determining a plurality of attribute value groups which can be configured by the intelligent device according to each attribute value of each attribute information, and sending the configured first text information and the plurality of attribute value groups to the TTS server.

The invention provides an electronic device, which comprises a processor, wherein the processor is used for realizing the steps of the voice broadcasting method when executing a computer program stored in a memory.

The present invention provides a computer-readable storage medium storing a computer program executable by a terminal, the program causing the terminal to perform any one of the steps of the voice broadcasting method described above when the program is run on the terminal.

In the embodiment of the invention, if the intelligent device is communicated with a TTS server through a network, the configured first text information and at least one attribute value set representing the attribute information of the intelligent device voice broadcast are sent to the TTS server, and an audio file returned by the TTS server is received and stored, wherein the audio file is obtained by performing voice synthesis on the TTS server according to the first text information and the attribute value set, and if the fact that the voice broadcast is required is determined, a corresponding target audio file is searched in the stored audio file, and the intelligent device is controlled to play the target audio file. In the embodiment of the invention, when the network is connected, the configured first text information and at least one attribute value group of the attribute information representing the intelligent device voice broadcast are sent to the TTS server, and the audio file synthesized by the TTS server according to the first text information and the attribute value group is received and stored, so that when the voice broadcast requirement is determined, the required audio file can be searched in the locally stored audio file for playing even if the network is interrupted, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of a voice broadcast process according to some embodiments of the present invention;

fig. 2 is a schematic diagram illustrating a detailed process of a voice broadcast according to some embodiments of the present invention;

fig. 3 is a schematic structural diagram of a voice broadcast device according to some embodiments of the present invention;

fig. 4 is an electronic device according to some embodiments of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived from the embodiments of the present invention by a person skilled in the art are within the scope of the present invention.

In order to still perform high-quality broadcast when a network of an intelligent device and a server is interrupted and improve user experience, embodiments of the present invention provide a voice broadcast method, apparatus, device, and medium.

Example 1:

fig. 1 is a schematic diagram of a voice broadcast process according to some embodiments of the present invention, and is described with reference to fig. 1:

s101: and if the intelligent equipment is communicated with the network of the TTS server, sending the configured first text information and at least one attribute value group representing the attribute information of the voice broadcast of the intelligent equipment to the TTS server.

The voice broadcasting method provided by the embodiment of the invention is applied to intelligent equipment, wherein the intelligent equipment can be equipment such as a terminal, a robot and the like.

In the embodiment of the invention, a TTS application is deployed in the intelligent device, a user can pre-configure the first text information based on the TTS application in the intelligent device, when the first text information is configured, a first text information configuration instruction can be sent to the intelligent device, and after the intelligent device receives the first text information configuration instruction, the pre-configured first text information is determined according to the first text information contained in the first text information configuration instruction. The first text information is different according to different application scenes where the intelligent device is located, wherein the application scenes can be a meal delivery scene, an office scene, a museum scene, a hotel scene and the like, for example, if the current application scene is the meal delivery scene, the first text information can be 'hello you have a happy meal', and the first text information corresponding to each application scene is preset according to the application scene requirements.

In addition, for text information with high use frequency and infrequent replacement, after the text information and the attribute information of the text information are set each time, the text information and the attribute information of the text information need to be sent to the TTS server, so that the working pressure of the TTS server is increased, and the cost is increased.

In specific implementation, for the application scenario, the text information with the highest use frequency is selected to be determined as the first text information, or the text information with the use frequency exceeding a set threshold is selected to be determined as the first text information, or the first N text information is selected to be determined as the first text information according to the order of the use frequencies, and N is a set number, and so on. The embodiment of the present invention is not limited to the specific implementation.

Furthermore, the first text message also needs to satisfy that the replacement frequency satisfies a second set condition, that is, the text message which is not frequently replaced is selected.

In a specific implementation, for the application scenario, text information that is used in each sub-period within a set time length, or text information that is used more than a set threshold number of times, is selected, where the time length may be continuous or discontinuous, and the set time length includes at least two sub-periods. For example, it may be N consecutive days, and the sub-period is every day, i.e. selecting the text message that is used every day for N consecutive days, etc. The embodiment of the present invention is not limited to the specific implementation.

If the application scene is a museum scene, the user is a volunteer or a manager of the museum, that is, the user is flexibly changed according to different use scenes.

Developers preset attribute information corresponding to first text information which can be selected by a user in the TTS application and an attribute value which can be selected in each attribute information, so that when the TTS application is used for the first time by the user, the attribute values of the attribute information of the first text information can be configured, namely, the attribute values of the attribute information which represents the intelligent equipment voice broadcast can be configured, if the attribute values of one or more attribute information are not set, the default attribute values of the attribute information which is not configured with the attribute values and the attribute values of the attribute information which is configured with the attribute values which are stored in advance are confirmed to be the attribute value groups corresponding to the first text information, the attribute value groups and the first text information are sent to a TTS server, and if the attribute values of all the attribute information are not set, the default attribute values of all the attribute information which is not configured with the attribute values which are stored in advance are confirmed to be the first text information pair And sending the attribute value set and the first text information to the TTS server.

If a user configures an attribute value group representing the attribute information broadcasted by the intelligent device in advance based on the TTS application in the intelligent device, the user can send an attribute value configuration instruction to the intelligent device when configuring the attribute value representing the attribute information broadcasted by the intelligent device, and after receiving the attribute value configuration instruction, the intelligent device determines the attribute value group configured in advance according to the attribute value instruction, wherein the attribute value group can be one or multiple.

The attribute information of the voice broadcast of the intelligent device may be at least one of a speaker role, a language, a synthetic volume, a speech rate and an audio sampling rate, wherein the speaker role is a role of a broadcast person who performs the voice broadcast, and the attribute value of the speaker role may be a man, a woman, an old man, a child, a star and the like. The language is a broadcast language for voice broadcast, and the attribute value of the language can be Chinese, English, French and the like. The TTS synthesis volume is the volume of the broadcast volume of the voice broadcast, and the attribute value of the TTS synthesis volume may be large, medium, small, or a specific value of the broadcast volume. The audio sampling rate is the number of audio points sampled per second and is used for changing the tone quality of the played audio file. Specifically, the higher the audio sampling rate, the higher the sound quality of the synthesized audio file, and the smaller the audio sampling rate, the lower the sound quality of the synthesized audio file. Wherein the attribute value of the audio sampling rate is an arbitrary value. In the using process, the user can select the corresponding attribute value from the configured attribute information according to the self requirement.

In addition, after the user sets the first text information and the attribute value corresponding to the first text information and representing the attribute information of the smart device voice broadcast based on the TTS application in the smart device, the smart device stores the first text information and the at least one attribute value group into a local database of the smart device after determining the first text information and the at least one attribute value group.

Because the network of the intelligent device and the TTS server may or may not be connected, in order to realize information interaction between the intelligent device and the TTS server, if the intelligent device is connected with the network of the TTS server, the intelligent device sends the configured first text information and at least one attribute value group representing the attribute information broadcasted by the intelligent device in a voice mode to the TTS server.

In addition, when the configured first text information and at least one attribute value group representing the attribute information broadcasted by the intelligent device in a voice mode are sent to the TTS server, the first text information and one attribute value group of the first text information can be sent to the TTS server, and the first text information and a plurality of attribute value groups of the first text information can also be sent to the TTS server. And after receiving the synthesis request, the TTS server synthesizes the audio file according to the first text information and the attribute value corresponding to the first text attribute.

S102: and receiving and storing an audio file returned by the TTS server, wherein the audio file is obtained by performing voice synthesis by the TTS server according to the first text information and the attribute value set.

In the invention, after receiving the first text information sent by the intelligent equipment and at least one attribute value group representing the attribute information broadcasted by the intelligent equipment, the TTS server carries out voice synthesis according to the first text information and each attribute value group so as to synthesize an audio file. Because the TTS server has extremely high speed for synthesizing the audio files and the time for synthesizing the audio files can be ignored, based on the fact that the TTS in the intelligent equipment needs to send the synthesis request, the TTS server can immediately synthesize the audio files after receiving the synthesis request and send the audio files to the intelligent equipment, and the intelligent equipment downloads the audio files from the TTS server and stores the audio files locally. Specifically, the smart device may store the audio file in a local SD card and update a storage path of the audio file.

In addition, in order to reduce the workload of the CPU during the downloading process, the intelligent device may download a specific number of audio files each time the audio files are downloaded, and download in a queue manner.

In addition, the TTS server may receive a composition request sent by more than one TTS application, that is, multiple TTS applications, where the multiple TTS applications are deployed on the same smart device. In order to avoid repeated synthesis, the intelligent device may find whether an audio file based on the first text information and the attribute value group corresponding to the first text information is locally stored, and if so, after determining that the audio file is synthesized according to the first text information and the attribute value group corresponding to the first text information, it is determined that the first text information and the attribute value group corresponding to the first text information do not need to be sent to the TTS server, so the TTS server does not need to perform a process of synthesizing the audio file.

Or in the application, after receiving a synthesis request sent by each TTS application, the TTS server may store the synthesized audio file locally or on the setting device, where if the first text information received by the TTS server and the attribute value group corresponding to the first text information are the same as the attribute value groups corresponding to the first text information and the first text information received before, that is, the TTS server has already synthesized the audio file according to the first text information and the attribute value group corresponding to the first text information, in order to avoid wasting resources for synthesizing the audio file by the TTS server, the TTS server may send a download address of the corresponding audio file to the intelligent device, and the intelligent device receives the download address and downloads the audio file pre-stored in the TTS server or other setting devices.

S103: and if the voice broadcasting is determined to be needed, searching a corresponding target audio file in the stored audio files, and controlling the intelligent equipment to play the target audio file.

In the invention, when the requirement of voice broadcast exists, namely the voice broadcast is determined, because the intelligent equipment locally stores a plurality of audio files, the intelligent equipment locally searches whether the audio files meeting the requirement exist, wherein the audio files meeting the requirement are also target audio files, and if the corresponding target audio files are searched in the stored audio files, the intelligent equipment can be directly controlled to play the target audio files.

When the voice broadcast is determined to be needed, whether the target audio file exists is firstly searched based on the locally stored audio, and if the network is interrupted, the intelligent device does not need to perform information interaction with a TTS server in the process of searching the target audio file based on the locally stored audio file, so that the target audio file can be searched and played under the condition of network interruption.

In the embodiment of the invention, the audio file combined by the TTS server according to the first text information and the attribute value is obtained and stored in advance, so that when the voice broadcasting requirement is determined, even if network interruption occurs, the required audio file can be searched and played according to the stored audio file, and the user experience is improved.

Example 2:

in order to solve the problem that the audio file is not completely downloaded to the intelligent device due to the network interruption, on the basis of the above embodiment, in the embodiment of the present invention, the method further includes:

if network connection is monitored, determining non-synthesized data information according to the stored audio file, wherein the data information comprises second text information of the non-synthesized audio file in the first text information and/or a first target attribute value group of the non-synthesized audio file in the first text information; and

In the invention, after the intelligent device sends the first text information and the at least one attribute value set to the TTS server, the TTS server synthesizes the audio file, and the network may be interrupted in the process of synthesizing the audio file by the TTS server or in the process of downloading the audio file synthesized by the TTS server to the local by the intelligent device, at this time, part or all of the audio file is not downloaded and stored to the local.

In order to ensure the integrity of the audio file locally stored by the smart device, that is, to obtain all audio files combined by the first text information and the at least one attribute value, in the embodiment of the present invention, if there is a network interruption, the network status is monitored in real time or periodically after the network interruption. When the network connection is monitored, determining non-synthesized data information based on the locally stored audio file, wherein the data information comprises second text information of the non-synthesized audio file in the first text information and/or a first target attribute value group of the non-synthesized audio file in the first text information. In order to obtain all audio files, the intelligent device sends the related information of the uncomposited data information to the TTS server, so that the TTS server synthesizes the audio files according to the data information, and the intelligent device is guaranteed to download the synthesized audio files to the local.

Example 3:

in order to facilitate the modification of the audio file, on the basis of the foregoing embodiments, in an embodiment of the present invention, the method further includes:

if the stored audio file does not contain the audio file corresponding to the updated first text information, sending the updated first text information and the attribute value set of the updated first text information to the TTS server; and

In the embodiment of the present invention, there may be a need to update the first text information, and if there is a need to update the first text information, the user may update the first text information based on a TTS application in the smart device, that is, the smart device may receive an update request for the first text information. After receiving the update request for the first text information, the intelligent device determines the updated first text information.

Because the audio files saved in the smart device are synthesized based on the first text information, the audio file corresponding to the updated first text information may exist in the saved audio files, or the audio file corresponding to the updated first text information may not exist, and if the audio file corresponding to the updated first text information does not exist in the saved audio files, it indicates that the required audio file cannot be obtained based on the locally saved audio file. In order to obtain a desired audio file, the smart device transmits the updated first text information and the attribute value set of the attribute information of the updated first text information to the TTS server. If the first text information is updated, it can be considered that the attribute value of the attribute information for the first text does not need to be updated, and therefore the attribute value group of the updated first text information is the same as the attribute value group of the first text before updating.

In addition, the first text information may be updated, and the attribute value of the attribute information of the updated first text information may be updated at the same time, specifically, the attribute value of any one of the attribute information of the speaker character, the language, the TTS synthesis volume, the speech rate, and the audio sampling rate in the first text information may be updated, or the attribute values of at least two of the attribute information of the speaker character, the language, the TTS synthesis volume, the speech rate, and the audio sampling rate in the first text information may be updated.

For example, the first text information is "wish you to have a happy meal", the attribute values of the first text information are speaker character a, language A, TTS synthesized volume a, speech rate a, and audio sampling rate a, if the first text information is updated to "welcome to own hotel", the attribute value of the speaker in the attribute information is switched from speaker a to speaker B, the updated first text information is "welcome to own hotel", the attribute value group of the attribute information of the updated first text information is speaker character B, language A, TTS synthesized volume a, speech rate a, and audio sampling rate a, and the attribute value groups of the attribute information of the updated first text information and the updated first text information are transmitted to the TTS server.

For another example, if the first text information is updated to "welcome to stay in the hotel", the attribute value of the speaker in the attribute information is switched from the speaker a to the speaker B, and the attribute value of the TTS synthesis volume in the attribute information is switched from the TTS synthesis volume a to the TTS synthesis volume B, the updated first text information is "welcome to stay in the hotel", the attribute value group of the attribute information of the updated first text information is the speaker role B, the language A, TTS synthesis volume B, the speech speed a, and the audio sampling rate a, and the updated first text information and the attribute value group of the attribute information of the updated first text information are sent to the TTS server.

After receiving the updated first text information and the attribute value set of the updated first text information, the TTS server combines an audio file according to the updated first text information and the attribute value set of the updated first text information, generates the combined audio file to the intelligent device, and the intelligent device receives the audio file and stores the audio file locally.

In order to facilitate the modification of the attribute value, on the basis of the foregoing embodiments, in an embodiment of the present invention, the method further includes:

if the stored audio file does not contain the audio file corresponding to the second target attribute value set, sending the first text information and the second target attribute value set to the TTS server; and

In the embodiment of the present invention, if there is a need to switch the attribute value of the attribute information, the user may reset the attribute value of the attribute information based on the TTS application in the smart device, that is, the smart device may receive a request for switching the attribute value of the attribute information. After receiving a switching request for the attribute values of the attribute information, determining the switched second target attribute value set. Specifically, the switching may be performed for the attribute value of one piece of attribute information, or may be performed for the attribute values of a plurality of pieces of attribute information (i.e., two or more pieces of attribute information).

In the process of switching the attribute values of the attribute information, when the second target attribute value group is confirmed, the attribute values of the attribute information to be switched are switched, and the attribute values of the rest of the attribute information are kept unchanged.

Since the audio files stored in the smart device are combined based on the attribute values, there may be an audio file corresponding to the switched second target attribute value set in the already stored audio file, or there may be no audio file corresponding to the switched second target attribute value set. To obtain the desired audio file, the smart device sends the first text information and the second set of target attribute values to the TTS server.

After receiving the first text information and the second target attribute value set, the TTS server combines an audio file according to the first text information and the second target attribute value set, and generates the combined audio file to the intelligent device, and the intelligent device receives the audio file and stores the audio file locally.

In order to accurately determine the attribute information, on the basis of the foregoing embodiments, in an embodiment of the present invention, the attribute information includes at least one of the following:

In the present invention, the attribute information may be at least one of a speaker role, a language, a TTS synthesis volume, a speech rate, and an audio sampling rate, wherein the speaker role is a role of a broadcast person performing a voice broadcast, and the attribute value of the speaker role may be a man, a woman, an old person, a child, a star, or the like. The language is a broadcast language for voice broadcast, and the attribute value of the language can be Chinese, English, French and the like. The TTS synthesis volume is the volume of the broadcast volume of the voice broadcast, and the attribute value of the TTS synthesis volume may be large, medium, small, or a specific value of the broadcast volume. The audio sampling rate is the number of audio points sampled per second and is used for changing the tone quality of a played audio file, specifically, the larger the audio sampling rate is, the higher the tone quality of a synthesized audio file is, the smaller the audio sampling rate is, and the lower the tone quality of the synthesized audio file is, wherein the attribute value of the audio sampling rate is any numerical value.

In the present invention, a developer sets in advance all attribute value groups that may be matched with the attribute values of each attribute information, wherein, for an attribute value of any attribute information, the attribute value may be matched with some attribute values in other attribute information except the attribute information, the attribute value may also be matched with all attribute values in other attribute information except the attribute information, and specifically, whether matching between the attribute values of each attribute information is possible may be set in advance according to requirements.

In the process of determining the second target attribute value group, when a switching request for the attribute value of the attribute information is received, it is determined whether or not there is a case where the attribute value of the attribute information other than the attribute information before the change of the attribute value cannot be matched with the switched attribute value, if not, the attribute value of the attribute information after the switching and the attribute value of the attribute information other than the attribute information before the change of the attribute value are determined as the second target attribute value group, and if so, the attribute value after the switching and the attribute value of the attribute information other than the attribute information before the switching but capable of being matched are determined as the second target attribute value group.

In the present invention, in addition to the switching of languages, when switching the attribute values of the attribute information other than languages, all the attribute values under the attribute information other than the switched attribute information may be matched with the attribute values of the attribute information after the switching, and therefore, in the process of determining the second target attribute value group, when a switching request for the attribute value of the speaker in the attribute information is received, the target attribute value after the switching of the speaker character is determined, and for the sake of easy distinction, the attribute values of the attribute information other than the attribute value of the speaker character before the change of the attribute value of the attribute information are referred to as second attribute values, and the second attribute value and the target attribute value of the speaker character are determined as the second target attribute value group. And if a request for switching the attribute value of the TTS synthetic volume in the attribute information is received, determining a target attribute value of the TTS synthetic volume, for convenience of distinguishing, referring the attribute values of the attribute information except the attribute value of the TTS synthetic volume before the attribute value of the attribute information is changed to be third attribute values, and determining the third attribute values and the target attribute value of the TTS synthetic volume to be second target attribute value sets.

If a switching request aiming at the attribute value of the speech speed in the attribute information is received, determining a target attribute value of the speech speed, determining attribute values of other attribute information except the speech speed before the attribute value of the attribute information is changed into a fourth attribute value for convenience of distinguishing, and determining the fourth attribute value and the target attribute value of the speech speed into a second target attribute value group. And if a switching request of the attribute value of the audio sampling rate in the attribute information is received, determining a target attribute value of the audio sampling rate, determining attribute values of other attribute information except the audio sampling rate before the attribute value of the attribute information is changed to be a fifth attribute value for convenience of distinguishing, and determining the fifth attribute value and the target attribute value of the audio sampling rate to be a second target attribute value set.

For example, the attribute value of the speaker character in the attribute information is switched, and the attribute information includes: when the character, language, synthesized volume, speed and sampling rate of the speaker are changed, the attribute values before the character of the speaker are not changed as follows: if the speaker role A is converted into the speaker role B, a switching instruction of converting the attribute value of the speaker role in the attribute information of the audio file from the speaker A to the speaker role B is received, and at the moment, the determined changed second target attribute value is as follows: speaker character B, language a, synthesized volume a, speech rate, and audio sampling rate a.

Further, if the attribute values of the language in the attribute information are switched, since the number of attribute values of the speaker character corresponding to some languages in the minor languages is limited, that is, if the attribute values of the other attribute information before the change of the attribute information except the language to be switched and the attribute values of the language to be switched are directly determined as the second target attribute value group, there is a possibility that the attribute values of the speaker character in the attribute value group before the switching of the attribute values of the language cannot correspond to the attribute values of the language after the switching, that is, if there are two attributes of the language and the speaker character in the attribute value group, and the attribute values of the speaker character in the attribute value group before the switching of the language are child and chinese, the speaker character corresponding to the arabic language is set in advance only man and woman, and if arabic and child are directly determined as the second target attribute value group, since the speaker roles corresponding to the arabic language are only men and women, and there is no child, so that the voice cannot be normally played, in order to ensure normal broadcast of the voice, when the language is switched, the second target attribute value is determined according to the attribute value of the switched language and the default attribute value stored in advance, specifically, for each language, the default attribute value of the attribute information corresponding to each language except for the language is preset, that is, the attribute value of the switched language and the default attribute values of the attribute information except for the switched language can be confirmed as the second target attribute value set.

Fig. 2 is a schematic diagram of a detailed process of a voice broadcast according to some embodiments of the present invention, and is described with reference to fig. 2.

The intelligent device is deployed with an upper layer application, based on the upper layer application, that is, based on the TTS application, the intelligent device receives a preloaded TTS text content, and receives a synthesis parameter carried in the TTS text content, or when a synthesis parameter carried in the TTS text content is not received, a default synthesis parameter is adopted, wherein the TTS text content is first text information, the synthesis parameter is an attribute value of each attribute information, different attribute values of each attribute information form an attribute value group, that is, the intelligent device receives a first text information configuration instruction, determines first text information, receives at least one attribute value of attribute information corresponding to the first text information, correspondingly determines at least one attribute value group representing the attribute information broadcasted by the intelligent device in voice, and stores the synthesis parameter and the text content in a database, that is, storing the first text information and the attribute value set in the database, and sending the first text information and the attribute value set to the TTS server, after receiving the first text information and the attribute value set, the TTS server synthesizes an MP3 file, that is, a synthesized audio file, according to the first text information and the attribute value set, and returns the MP3 file to the smart device, that is, the smart device downloads the audio file, stores the audio file in the SD card, and updates the storage path of the audio file.

Specifically, in the process of determining a target audio file based on the second target attribute value group, a plurality of synthesized and downloaded audio files are pre-stored in a local database of the intelligent device, wherein the audio files corresponding to the second target attribute value group may exist in the audio files stored in the database, or the audio files corresponding to the second target attribute value group may not exist, so in order to acquire the audio files corresponding to the second target attribute value group, if the audio files corresponding to the second target attribute value group exist in the local database, the audio files corresponding to the second target attribute value group are directly played.

If the audio file corresponding to the second target attribute value group does not exist in the local database, namely the required audio file cannot be obtained from the local database of the intelligent device, and when the network is connected, the intelligent device sends a synthesis request to the TTS server, and sends the first text information and the second target attribute value group to the TTS server, so that after the TTS server receives the synthesis request, the audio file is synthesized according to the first text information and the second target attribute value group, and after the TTS server synthesizes the audio file, the audio file is sent to the intelligent device, namely the intelligent device downloads the audio file from the TTS server, stores the audio file in an SD card of the intelligent device, and updates the storage path of the audio file.

In the voice broadcasting process, whether an audio file corresponding to the configured first text information and the attribute value set corresponding to the configured first text information exists or not is searched in a local database of the intelligent device, if yes, the local audio file is directly played, and if not, an audio file composed by a TTS server based on the first text information and the attribute value set corresponding to the configured first text information is obtained, stored in an SD card of the intelligent device and played.

Example 4:

in order to reduce the local memory occupation of the intelligent device, on the basis of the above embodiments, in the embodiment of the present invention, a deletion operation is further performed on the stored audio file, which specifically includes the following two ways:

In the embodiment of the invention, in order to reduce the occupation of the local memory of the intelligent device, part of the audio files can be deleted, so that the memory is released. Before deleting the audio files, determining whether a first updating condition is met, wherein the first updating condition can be used for judging whether the occupied space of the audio files locally stored by the intelligent equipment is larger than a set threshold value or not, if so, deleting the audio files which are not frequently used, namely obtaining the frequency of the stored audio files, and determining and deleting the audio files of which the frequency of use is lower than the set threshold value; the first update condition may also be that part of the audio file is deleted periodically, that is, the time interval between the deletion of the audio file and the last deletion of the audio file reaches a preset first time length, the preset first time length is set according to a requirement, and if the first update condition is met, the audio file which is not used frequently is deleted.

Because the switching frequency of the attribute values of the speaker role or the TTS synthesis volume in the attribute information is not high, in order to ensure that the memory of a local database of the intelligent device is not excessively occupied, a second time length is preset, after the attribute values of the speaker role or the TTS synthesis volume are switched, if the preset second time length is reached, a third target attribute value group currently used by the intelligent device is determined, and the audio files corresponding to the attribute value groups except the third target attribute value group are deleted. Wherein, if switching the attribute value of the speaker character, the target attribute value of the speaker character is determined, the attribute values of the attribute information other than the attribute value of the speaker character before the attribute value of the attribute information is changed are determined, and the attribute values of the attribute information other than the attribute value of the speaker character and the target attribute value of the speaker character are determined as the third target attribute value group. After the third target attribute value set is determined, audio files corresponding to attribute value sets other than the third target attribute value set are deleted.

Example 5:

to synthesize an audio file and store the audio file in a local database, on the basis of the foregoing embodiments, in an embodiment of the present invention, the sending, to the TTS server, the configured first text information and at least one attribute value group representing attribute information of a voice broadcast of the smart device includes:

In the invention, when the configured first text information and at least one attribute value group representing the attribute information broadcasted by the intelligent device in a voice mode are sent to the TTS server, the configured first text information and the third target attribute value group can be sent to the TTS server, so that the TTS server can compose an audio file according to the configured first text information and the third target attribute value.

Since there are multiple optional attribute values for an attribute value, all possible matching attribute value sets that can be configured by the smart device can be determined according to the attribute values of each attribute information, so that the smart device can send the configured first text information and all possible matching attribute value sets that can be configured by the smart device to the TTS server, that is, send the first text information and all possible matching attribute value sets that can be configured by the first text information to the TTS server, and can send the configured first text information and a part of attribute value sets of all possible matching attribute value sets that can be configured by the first text information to the TTS server, that is, send the first text information and a part of attribute value sets of all possible matching attribute value sets that can be configured by the first text information to the TTS server.

In the process of selecting part of attribute value groups, the attribute value groups with high frequency of use in all the possible matching attribute value groups can be selected, for example, the top M attribute value groups are selected in order of frequency of use, where M is a set number. Alternatively, an attribute value group including some of the third target attribute value groups may be selected based on the currently used third target attribute value group.

Specifically, all attribute value groups that can be combined may be determined according to each attribute value corresponding to each attribute information corresponding to the first text information, and if the attribute information of the first text information is a speaker and a language, the attribute value of the speaker may be a man or a woman, and the language may be chinese and english, a plurality of attribute value groups may be determined, and the attribute value groups of all possible combinations include: male, Chinese; women, Chinese; male, english; woman, English.

Example 6:

fig. 3 is a schematic structural diagram of a voice broadcasting device according to some embodiments of the present invention, where the device includes:

a sending module 301, configured to send configured first text information and at least one attribute value group representing attribute information of voice broadcast of the intelligent device to a Text To Speech (TTS) server if the intelligent device is connected to a network from a text to the TTS server;

a receiving module 302, configured to receive and store an audio file returned by the TTS server, where the audio file is obtained by performing speech synthesis by the TTS server according to the first text information and the attribute value set;

and the processing module 303 is configured to, if it is determined that voice broadcasting is required, search for a corresponding target audio file in the stored audio files, and control the intelligent device to play the target audio file.

In a possible implementation manner, the processing module 303 is further configured to monitor a network status if a network is interrupted after the first text information and the at least one attribute value set are sent to the TTS server; if network connection is monitored, determining non-synthesized data information according to the stored audio file, wherein the data information comprises second text information of the non-synthesized audio file in the first text information and/or a first target attribute value group of the non-synthesized audio file in the first text information; and sending the related information of the data information to the TTS server so that the TTS server synthesizes an audio file according to the data information.

In a possible implementation manner, the receiving module 302 is further configured to receive an update request for the first text information, and determine updated first text information;

the sending module 301 is further configured to send the updated first text information and the attribute value group of the updated first text information to the TTS server if the stored audio file does not contain the audio file corresponding to the updated first text information;

the receiving module 302 is further configured to receive and store an audio file that is combined by the TTS server according to the updated first text information and the attribute value of the updated first text information.

In a possible implementation manner, the receiving module 302 is further configured to receive a switching request for an attribute value of the attribute information, and determine a second target attribute value set after switching;

the sending module 301 is further configured to send the first text information and the second target attribute value set to the TTS server if the stored audio file does not include an audio file corresponding to the second target attribute value set;

the receiving module 302 is further configured to receive and store an audio file synthesized by the TTS server according to the second target attribute value set and the first text information.

In a possible implementation manner, the processing module 303 is further configured to, if a first update condition is met, obtain a frequency of using each saved audio file, and delete an audio file whose frequency of using is lower than a set threshold; and/or the presence of a gas in the gas,

In a possible implementation manner, the sending module 301 is specifically configured to determine a third target attribute value set currently used by the smart device, and send the configured first text information and the third target attribute value set to the TTS server; or determining a plurality of attribute value groups which can be configured by the intelligent device according to each attribute value of each attribute information, and sending the configured first text information and the plurality of attribute value groups to the TTS server.

Example 7:

on the basis of the foregoing embodiments, some embodiments of the present invention further provide an electronic device, as shown in fig. 4, including: the system comprises a processor 401, a communication interface 402, a memory 403 and a communication bus 404, wherein the processor 401, the communication interface 402 and the memory 403 are communicated with each other through the communication bus 404.

The memory 403 has stored therein a computer program which, when executed by the processor 401, causes the processor 401 to perform the steps of:

In a possible implementation, the processor 401 is further configured to monitor a network status if a network is interrupted after the first text information and the at least one attribute value set are transmitted to the TTS server; if network connection is monitored, determining non-synthesized data information according to the stored audio file, wherein the data information comprises second text information of the non-synthesized audio file in the first text information and/or a first target attribute value group of the non-synthesized audio file in the first text information; and sending the related information of the data information to the TTS server so that the TTS server synthesizes an audio file according to the data information.

In a possible implementation manner, the processor 401 is further configured to receive an update request for the first text information, and determine updated first text information; if the stored audio file does not contain the audio file corresponding to the updated first text information, sending the updated first text information and the attribute value set of the updated first text information to the TTS server; and receiving and storing an audio file combined by the TTS server according to the updated first text information and the attribute value of the updated first text information.

In a possible implementation manner, the processor 401 is further configured to receive a switching request for an attribute value of the attribute information, and determine a second target attribute value set after switching; if the stored audio file does not contain the audio file corresponding to the second target attribute value set, sending the first text information and the second target attribute value set to the TTS server; and receiving and storing an audio file synthesized by the TTS server according to the second target attribute value set and the first text information.

In a possible implementation manner, the processor 401 is further configured to, if the first update condition is satisfied, obtain a frequency of using each saved audio file, and delete an audio file whose frequency of using is lower than a set threshold; and/or if the second updating condition is met, determining a third target attribute value set currently used by the intelligent equipment, and deleting the audio files corresponding to the attribute value sets except the third target attribute value set.

In a possible implementation manner, the processor 401 is further configured to determine a third target attribute value set currently used by the smart device, and send the configured first text information and the third target attribute value set to the TTS server; or determining a plurality of attribute value groups which can be configured by the intelligent device according to each attribute value of each attribute information, and sending the configured first text information and the plurality of attribute value groups to the TTS server.

The communication bus mentioned in the above server may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 402 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Example 8:

on the basis of the foregoing embodiments, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program executable by an electronic device is stored, and when the program is run on the electronic device, the electronic device is caused to execute the following steps:

the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of:

In one possible embodiment, the method further comprises:

In one possible embodiment, the attribute information includes at least one of:

In the embodiment of the invention, the audio file combined by the TTS server according to the first text information and the attribute value is obtained and stored in advance, so that when the voice broadcasting requirement is determined, the required audio file can be searched in the locally stored audio file for playing even if network interruption occurs, and the user experience is improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A voice broadcasting method is applied to intelligent equipment and is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

3. The method of claim 1, further comprising:

4. The method of claim 1, further comprising:

5. The method of claim 1, further comprising:

6. The method according to any of claims 1-5, wherein the attribute information comprises at least one of:

7. The method of claim 1, wherein sending the configured first text information and the at least one set of attribute values representing the attribute information of the smart device voice broadcast to the TTS server comprises:

8. The utility model provides a voice broadcast device which characterized in that is applied to smart machine, the device includes:

9. An electronic device, characterized in that the electronic device comprises a processor for implementing the steps of the method according to any of claims 1-7 when executing a computer program stored in a memory.

10. A computer-readable storage medium, characterized in that it stores a computer program executable by a terminal, which program, when run on the terminal, causes the terminal to carry out the steps of the method according to any one of claims 1 to 7.