CN112863479A

CN112863479A - TTS voice processing method, device, equipment and system

Info

Publication number: CN112863479A
Application number: CN202110008812.0A
Authority: CN
Inventors: 林辉
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-05-28

Abstract

The embodiment of the application provides a method, a device, equipment and a system for processing TTS voice, wherein the method comprises the following steps: acquiring text information to be processed, and performing off-line speech synthesis processing on the text information according to a preset speech synthesis rule to obtain TTS speech; and if the fact that the playing strategy of the TTS voice is played through the intelligent equipment in the distributed network is determined, the TTS voice is sent to the intelligent equipment, so that the intelligent equipment plays the TTS voice when the fact that the TTS voice meets the preset playing conditions is determined. According to the information transmission method and device, the information transmission efficiency is improved, and the problem that the current information transmission times are limited is solved.

Description

TTS voice processing method, device, equipment and system

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a system for processing TTS speech.

Background

In the management process of a community or an industrial park, effective information dissemination is of great importance. Currently, the information dissemination modes of a community or an industrial park mainly include two types: one is that the manager advertises one by one from one house to another, and the other is that the manager advertises information through a broadcasting system. However, both of the two methods require more human resources and time cost, which not only reduces the information transmission efficiency, but also greatly limits the information transmission times.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, equipment and a system for processing TTS voice so as to realize efficient information transmission.

In order to solve the above technical problem, the embodiment of the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a method for processing TTS speech, including:

acquiring text information to be processed;

performing off-line speech synthesis processing on the text information according to a preset speech synthesis rule to obtain TTS speech;

and if the fact that the playing strategy of the TTS voice is played through the intelligent equipment in the distributed network is determined, the TTS voice is sent to the intelligent equipment, so that the intelligent equipment plays the TTS voice when the fact that the TTS voice meets the preset playing conditions is determined.

In a second aspect, an embodiment of the present application provides a device for processing TTS speech, including:

a memory for storing speech synthesis rules;

the processor is used for acquiring text information to be processed and performing off-line speech synthesis processing on the text information according to the speech synthesis rule to obtain TTS speech; and determining a playing strategy of the TTS voice, and if the playing strategy is played through the intelligent equipment in the distributed network, sending the TTS voice to the intelligent equipment so that the intelligent equipment plays the TTS voice when the intelligent equipment is determined to meet preset playing conditions.

In a third aspect, an embodiment of the present application provides a system for processing TTS speech, including: a supervisor device and at least one smart device;

the supervisor equipment is used for carrying out related processing on the TTS voice according to the TTS voice processing method;

and the intelligent device receives the TTS voice sent by the supervisor device, and plays the TTS voice if the TTS voice is determined to meet the preset playing condition.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor, a communication interface, a memory, and a communication bus; the processor, the communication interface and the memory complete mutual communication through a bus; a memory for storing a computer program; and the processor is used for executing the program stored in the memory and realizing the steps of the TTS voice processing method.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the method for processing TTS speech.

In the embodiment of the application, the supervisor device and at least one intelligent device form a distributed network in advance, and when the supervisor device acquires text information to be processed, the text information is subjected to offline speech synthesis processing according to a preset speech synthesis rule to obtain TTS speech; and if the fact that the playing strategy of the TTS voice is played through the intelligent equipment in the distributed network is determined, the TTS voice is sent to the corresponding intelligent equipment, so that the intelligent equipment plays the TTS voice when the fact that the TTS voice meets the preset playing conditions is determined. Therefore, text information is converted into TTS voice, and the TTS voice is automatically played based on a distributed network without manual reading, so that the information transmission efficiency is greatly improved; the intelligent device can play TTS voice when the playing condition is determined to be met, so that the problem that the current information propagation times are limited is effectively solved; moreover, by performing offline speech synthesis processing on the text information, the information propagation cost is reduced, and the problem that speech synthesis cannot be performed due to the fact that a server for performing online speech synthesis is not arranged is solved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a scene schematic diagram of a TTS speech processing method according to an embodiment of the present application;

fig. 2 is a first flowchart illustrating a method for processing TTS speech according to an embodiment of the present application;

fig. 3 is a second flowchart illustrating a TTS speech processing method according to an embodiment of the present application;

fig. 4 is a third flowchart illustrating a TTS speech processing method according to an embodiment of the present application;

fig. 5 is a fourth flowchart illustrating a method for processing TTS speech according to an embodiment of the present application;

fig. 6 is a fifth flowchart illustrating a TTS speech processing method according to an embodiment of the present application;

fig. 7 is a sixth flowchart illustrating a TTS speech processing method according to an embodiment of the present application;

fig. 8 is a seventh flowchart illustrating a TTS speech processing method according to an embodiment of the present application

Fig. 9 is a schematic block diagram illustrating a TTS speech processing apparatus according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a first component of a TTS speech processing system according to an embodiment of the present application;

FIG. 11 is a second block diagram of a TTS speech processing system according to an embodiment of the present application;

fig. 12 is a schematic composition diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic view of an application scenario of a TTS speech processing method according to one or more embodiments of the present application, as shown in fig. 1, the scenario includes: supervisor equipment, routing device and at least one intelligent device. The management machine equipment is connected with the intelligent equipment through the routing equipment to establish a distributed network. The supervisor equipment has the functions of synthesizing, managing, distributing, playing and the like of TTS (Text To Speech) voice, and also has the functions of inputting, inquiring, displaying and the like of Text information; in practical application, the supervisor equipment can be entrance guard equipment, visual intercom equipment, attendance equipment and the like with the functions. The intelligent device has the functions of playing and inquiring TTS voice and the like, and also can have the functions of inquiring and displaying text information and the like; in practical application, the intelligent device can be entrance guard equipment, visual intercom equipment, attendance equipment and the like with the functions. The distributed network is arranged in communities, industrial parks and the like, so that the functions of corresponding entrance guard, attendance checking and the like are realized, and meanwhile, the high-efficiency transmission of information can be realized.

Specifically, when the supervisor equipment acquires text information to be processed, performing offline speech synthesis processing on the acquired text information according to a preset speech synthesis rule to obtain TTS speech; determining a playing strategy of the TTS voice, and if the determined playing strategy is played through the intelligent equipment in the distributed network, sending the TTS voice to the corresponding intelligent equipment through the routing equipment; after the intelligent device receives the TTS voice, if the TTS voice is determined to accord with the preset playing condition, the TTS voice is played. Therefore, text information is converted into TTS voice, and the TTS voice is automatically played based on a distributed network without manual reading, so that the information transmission efficiency is greatly improved; the intelligent device can play TTS voice when the playing condition is determined to be met, so that the problem that the current information propagation times are limited is effectively solved; moreover, by performing offline speech synthesis processing on the text information, the information propagation cost is reduced, and the problem that speech synthesis cannot be performed due to the fact that a server for performing online speech synthesis is not arranged is solved.

It should be noted that the types of the manager device and the smart device are not limited to the above-mentioned access control device, the video intercom device, the attendance device, and the like, and can be set by themselves as needed in practical application.

Based on the application scenario architecture, one or more embodiments of the present application provide a method for processing TTS speech. Fig. 2 is a flowchart illustrating a method for processing TTS speech according to one or more embodiments of the present application, where the method in fig. 2 can be executed by the supervisor apparatus in fig. 1. Referring to fig. 2, the method may include the steps of:

102, acquiring text information to be processed;

specifically, an information processing request sent by a designated device is received, and text information to be processed is acquired from the information processing request; or, in response to the information processing operation of the user obtaining the information processing request, obtaining the text information to be processed from the information processing request.

The designated device may be a terminal device of a user, such as a mobile phone, a platform computer, a portable notebook, and the like; the management machine device can also be a control device which is in communication connection with the management machine device, such as a server or a server cluster formed by a plurality of servers. Specifically, the control program of the management machine device is installed in the terminal device of the user, and the control program may be an independent Application program (App), or may be an applet embedded in another Application program. When a user needs to broadcast information, the user can operate a control program in the terminal equipment, edit corresponding text information, and perform submission operation (such as clicking a determining button and the like) after the editing is completed; the terminal equipment responds to the submission operation of the user, acquires the text information submitted by the user, and sends an information processing request to the management machine equipment according to the acquired text information; the management machine equipment receives an information processing request sent by the terminal equipment of the user, and obtains text information to be processed from the received information processing request. Or the control device sends an information processing request to the supervisor device according to the acquired text information, the supervisor device receives the information processing request sent by the control device, and the text information to be processed is acquired from the received information processing request. Or, when the supervisor device is provided with an input module, the user may operate the input module of the supervisor device to input text information, and click a button such as ok or submit after the input is completed to send an information processing request to the supervisor device, and the supervisor device obtains the information processing request in response to the information processing operation of the user and obtains the text information to be processed from the information processing request.

Step 104, performing off-line speech synthesis processing on the text information according to a preset speech synthesis rule to obtain TTS speech;

the offline voice synthesis processing in the embodiment of the application refers to voice synthesis processing performed locally on the supervisor equipment; correspondingly, the on-line speech synthesis processing means that the management machine device sends the acquired text information to be processed to the server through the network, the server performs speech synthesis processing to obtain TTS speech, and the TTS speech is sent to the management machine device. In consideration of the fact that the cost of deploying the server is high in practical application, the server is not deployed in every community or industrial park. Based on this, in the embodiment of the application, the supervisor device performs offline speech synthesis processing on the text information according to the preset speech synthesis rule, which not only reduces the information propagation cost, but also avoids the problem that speech synthesis cannot be performed due to the fact that no server for performing online speech synthesis is arranged.

And step 106, if the fact that the playing strategy of the TTS voice is played through the intelligent equipment in the distributed network is determined, the TTS voice is sent to the corresponding intelligent equipment, so that the intelligent equipment plays the TTS voice when the fact that the TTS voice meets the preset playing conditions is determined.

The playing strategy may include playing, direct playing, and suspending playing through the smart device in the distributed network. Sending TTS voice to corresponding intelligent equipment in the distributed network through the intelligent equipment in the distributed network, and playing the TTS voice by the intelligent equipment; directly playing the TTS voice, namely directly playing the TTS voice by the management machine equipment; and playing processing is carried out when the playing condition is met by suspending the playing. It should be noted that multiple play strategies may be executed simultaneously, for example, two play strategies of playing and directly playing through the smart device in the distributed network may be executed simultaneously. The networking manner of the distributed network may refer to the foregoing related description, and is not described herein again. The preset playing condition can be set in practical application according to the need, such as playing at preset time intervals, playing when the specified time is reached, and the like.

Considering that the computing power resource of the supervisor equipment is limited, and the computing power resource consumed by the speech synthesis processing is positively correlated with the length of the text information; in order to avoid the problems that the speech synthesis fails due to excessive consumption of computational resources in the speech synthesis process caused by too long length of the text information to be processed and incapability of supporting the consumption by the supervisor device, in one or more embodiments of the present application, the step 104 may include the following steps 104-2 to 104-6:

step 104-2, if the length of the text message is determined to be larger than a first preset length, segmenting the text message according to a preset segmentation rule to obtain a plurality of sub-texts;

further, if the length of the text information is determined not to be larger than the first preset length, performing offline speech synthesis processing on the text information to obtain a corresponding TTS speech. The first preset length may be considered as a maximum length of the text information that can be processed when the supervisor device performs a single speech synthesis process.

Further, when the playing policy is to play through the intelligent device in the distributed network, considering that the voice playing capability of the intelligent device is usually limited, when the length of the TTS voice exceeds the maximum voice playing length of the intelligent device for performing voice playing once, there is a risk that the TTS voice cannot be played smoothly. Based on this, in one or more embodiments of the present application, the supervisor device performs the segmented speech synthesis process in combination with its own speech synthesis capability and the speech playing capability of the smart device. Specifically, as shown in fig. 4, step 104-2 may include the following steps 104-22 to 104-26:

step 104-22, if the playing strategy of the TTS voice to be synthesized is determined to be played through the intelligent equipment in the distributed network, determining the target text length corresponding to the maximum voice playing length of the intelligent equipment for performing single voice playing;

in one or more embodiments of the present application, a correspondence between device information of the smart device and a text length corresponding to a maximum voice playing length that can be played by the smart device may be pre-established, in order to enable the supervisor device to quickly determine the text length corresponding to the maximum voice playing length that the smart device performs a single voice playing. Accordingly, steps 104-22 may include: and the supervisor equipment acquires the corresponding text length from the preset corresponding relation according to the equipment information of the intelligent equipment for TTS voice to be sent, and determines the acquired text length as the target text length corresponding to the maximum voice playing length of the intelligent equipment for single voice playing. Further, when a plurality of intelligent devices of TTS voices to be transmitted exist, the minimum text length of the obtained text lengths is determined as a target text length corresponding to the maximum voice playing length of the intelligent device for performing single voice playing.

In one or more embodiments of the present application, the supervisor device may further obtain the target text length from the smart device. Specifically, steps 104-22 may include: the method comprises the steps of sending a length obtaining request to an intelligent device of TTS voice to be sent, receiving a text length sent by the intelligent device, and determining the received text length as a target text length corresponding to the maximum voice playing length of the intelligent device for single voice playing. Further, when a plurality of intelligent devices of TTS voices to be transmitted exist, the minimum text length of the received text lengths is determined as the target text length corresponding to the maximum voice playing length of the intelligent device for performing single voice playing.

104-24, comparing the first preset length with the target text length, and determining the small length as a standard length;

and 104-26, if the length of the text message is determined to be greater than the standard length, segmenting the text message according to a preset segmentation rule to obtain a plurality of sub-texts.

Therefore, when the length of the text information is larger than the standard length, the text information is segmented, and segmented voice synthesis is performed, so that the voice synthesis capability of the supervisor equipment can be met, and the voice playing capability of the intelligent equipment can be met.

Corresponding to steps 104-22 through 104-26, as shown in FIG. 4, step 106 may include the following step 106-2:

and step 106-2, sending the TTS voice to the corresponding intelligent equipment, so that the intelligent equipment plays the TTS voice when the intelligent equipment is determined to meet the preset playing condition.

Step 104-4, dividing the obtained subfolders into a plurality of subfolders according to preset synthesis conditions;

and step 104-6, performing off-line speech synthesis processing on the subfiles in each subfile set to obtain corresponding TTS speech.

In view of the fact that punctuation marks can generally represent a relatively complete semantic meaning in the present information, in one or more embodiments of the present application, a segmentation process is performed based on punctuation marks in the textual information. Specifically, as shown in fig. 5, the steps 104-26 may include the following steps 104-262 and 104-264:

step 104-;

the first preset length may be set in practical application as required, for example, the first preset length is 25 characters.

Step 104-.

Specifically, each time one punctuation mark in the text to be divided is detected, whether the detected punctuation mark is a mark in a first preset mark is determined, if yes, the detected punctuation mark is divided at the position where the detected punctuation mark is located, and a corresponding sub-text is obtained; if not, the detection processing of the punctuation mark is continued. The preset symbol may be set in practical application as required, for example, the first preset symbol includes a period, an exclamation point, a semicolon, and the like.

Further, considering that in practical applications, there is a possibility that the sub-text obtained by the segmentation processing according to the first preset symbol is too long, based on this, the segmentation processing is performed at the position where the punctuation mark is located in step 104-264 to obtain the corresponding sub-text, which may include:

determining whether the length of the text to be divided before the punctuation mark exceeds the standard length; if so, determining a segmentation position in the text to be segmented according to a preset mode, and performing segmentation processing at the determined segmentation position so as to enable the length of the obtained sub-text not to exceed the standard length; if not, performing segmentation processing at the position of the punctuation mark to obtain a corresponding sub-text.

Specifically, when a first preset symbol is detected for the first time, determining text information before the first preset symbol as a text to be cut; when the first preset symbol is not detected for the first time, determining a text between the first preset symbol detected at present and the first preset symbol detected at the previous time as a text to be divided. In order to meet the requirement of the text length, in one or more embodiments of the present application, the determining the cutting position in the text to be cut according to the preset mode may include: determining a cutting position in a text to be cut according to the standard length; specifically, according to the sequence from front to back, target text information with the length being the standard length in the text to be segmented is determined, the position behind the last character of the target text information is determined as the segmentation position, and segmentation processing is performed at the determined segmentation position to obtain a corresponding sub-text.

Further, when considering that the segmentation position in the text to be segmented is determined according to the standard length, a certain word may be segmented into two parts, for example, for the word "today", wherein the word "today" is in the currently obtained sub-text, and the word "day" is in the next sub-text, which results in poor hearing of the listener when performing TTS speech playing. In practical applications, when the corresponding text is too long, punctuation marks such as comma and pause are usually present in the middle. Based on this, in order to avoid semantic splitting, in one or more embodiments of the present application, the determining a splitting position in a text to be split according to a preset manner may include: and detecting whether the text to be divided comprises a second preset symbol or not, and if the text to be divided comprises the second preset symbol and the length of the text to be divided before the second preset symbol does not exceed the standard length, determining the position of the second preset symbol as the dividing position. It should be noted that the manner of determining the segmentation position in the text to be segmented is not limited to the above manner, and may be set by itself in practical application according to needs, for example, the text to be segmented is averaged, and the like.

Further, after determining a segmentation position in the text to be segmented and performing segmentation processing to obtain a corresponding sub-text, determining a position of a currently detected first preset symbol corresponding to the text to be segmented as a next segmentation position; or when the length of the text left after the text to be divided is smaller than a third preset length (such as 3 characters), continuing to perform text detection processing, and when a next first preset symbol is detected, continuing to perform processing in the manner described above.

In view of the fact that the sub-texts obtained by segmentation may have only a few characters, for example, a certain sub-text is "good early", in order to avoid too much TTS speech, which is not beneficial to the management of TTS speech, in one or more embodiments of the present application, the obtained sub-texts are divided into a plurality of sub-text sets according to preset synthesis conditions. Specifically, as shown in FIG. 6, step 104-4 may include the following steps 104-42 and steps 104-44:

104-42, sequentially determining the current subfolders to be divided;

104-44, determining that the total length of the current subfiles to be divided is greater than the standard length and the total length of the first N-1 subfiles is not greater than the N subfiles of the standard length according to the sequence from front to back, and dividing the first N-1 subfiles into a subfile set; wherein N is an integer greater than 1;

in order to improve the response speed of the information processing request, optionally, in one or more embodiments of the present application, a segmentation process may be performed while a sub-text set is divided, and a TTS speech synthesis process may be performed; and determining whether the corresponding sub-text set can be divided every time one sub-text is obtained, and performing the off-line synthesis processing of the corresponding TTS voice when the corresponding sub-text set is divided. Specifically, as shown in FIG. 7, the steps 104-42 may include the following steps 104-422:

step 104-;

corresponding to step 104-422, as shown in FIG. 7, the steps 104-44 may include the following steps 104-44-2 and 104-44-4: step 104-6 may include the following steps 104-62 and steps 104-64:

step 104-44-2, determining whether the total length of the current subfolders to be divided is greater than the standard length, if so, executing step 104-44-4, otherwise, returning to step 104-;

it should be noted that, if there is only one sub-text in the sub-text to be divided, the length thereof is not greater than the standard length, so the process returns to step 104 and 264.

Step 104-44-4, dividing the first N-1 sub texts in the N sub texts to be divided into a sub text set;

for example, if the subfolders to be divided currently are 3 subfolders (i.e., N is 3), and the length of the first subfolder to be divided is 15 characters, the length of the second subfolder to be divided is 8 characters, the length of the third subfolder to be divided is 9 characters, and the standard length is 25 characters, the first subfolder to be divided and the second subfolder to be divided are divided into a subfolder set.

Step 104-62, performing off-line speech synthesis processing on the obtained sub-text set to obtain corresponding TTS speech,

step 104-64, determining whether unsingulated text information exists, if yes, determining the unsingulated text information as the text to be divided, and returning to step 104-264; otherwise, executing step 106-2;

it should be noted that, after the segmentation processing is performed at the position where the punctuation mark is located in step 104-264 to obtain the corresponding sub-text, a preset mark to be segmented may be further added at the position where the punctuation mark is located, so that when the next segmentation processing is performed, the position of the mark to be segmented may be determined as the initial detection position, and the punctuation mark in the text information is detected backward from the initial detection position.

The above is the process of determining whether a sub-text set can be obtained by dividing after each sub-text is obtained by the segmentation processing, and performing the off-line speech synthesis processing when the sub-text set is obtained by the dividing; after the sub-text set is obtained, the off-line speech synthesis processing is performed to obtain the corresponding TTS speech, so that the TTS speech can be played in time, and the response speed of the information processing request is improved.

In one or more embodiments of the present application, when the requirement on the response rate of the information processing request is not very high, after the obtained text information is segmented to obtain all sub-documents, set partitioning processing may be performed on each obtained sub-document, and corresponding offline speech synthesis processing is performed when each sub-document set is obtained. Specifically, as shown in FIG. 8, the steps 104-42 may include the following steps 104-424:

step 104-; otherwise, determining that the sub-texts added with the preset marks to be divided in each sub-text obtained by the segmentation processing are added, and determining the determined sub-texts and the sub-texts behind the determined sub-texts as the current sub-texts to be divided according to the sequence from front to back.

Corresponding to the steps 104-424, the steps 104-44 may include the following steps 104-44-6 to 104-44-12, and the step 104-6 may include the following steps 104-66:

104-44-6, determining the first two sub texts in the current sub text to be divided as the current sub text;

step 104-44-8, determining whether the length of the current sub text is larger than the standard length, if so, executing step 104-44-10, otherwise, executing step 104-44-12;

step 104-44-10, dividing the first N-1 sub texts in the determined N current sub texts into a subset, determining whether a sub text to be divided exists after the Nth sub text in the N current sub texts, if so, adding a preset identifier to be divided to the Nth sub text, and executing step 104-66; otherwise, the nth sub-text is divided into a sub-text set and used as the last sub-text set, and steps 104-66 are executed.

Step 104-44-12, determining whether the sub-texts to be divided are left after the Nth sub-text in the N current sub-texts, if so, determining the next sub-text of the N current sub-texts and the Nth sub-text as the current sub-text, and returning to the step 104-44-8; otherwise, dividing the N current sub-texts into a sub-text set, taking the sub-text set as the last sub-text set, and executing the steps 104-66;

step 104-66, performing off-line speech synthesis processing on the obtained sub-text set to obtain corresponding TTS speech, determining whether the obtained sub-text set is the last sub-text set, if so, executing step 106, otherwise, returning to step 104-424;

the above is a process of performing division processing of the sub-text set based on each sub-text obtained by the segmentation after the segmentation processing is completed, and performing offline speech synthesis processing when one sub-text set is obtained by the division processing. Not only the text-to-speech is realized, but also a certain TTS speech synthesis rate can be ensured when the response rate requirement of the information processing request is not very high.

Further, the offline synthesis processing of TTS speech is not limited to the above-mentioned processing performed each time a sub-text set is obtained, and for the play strategies such as deferred play, because the requirement on the response rate of the information processing request is not high, it is also possible to perform offline speech synthesis processing on each obtained sub-text set after the sub-text set is divided to obtain all sub-text sets. The specific time of the off-line speech synthesis processing can be set in practical application according to the requirement.

Therefore, the text information to be processed is divided into a plurality of sub-texts, the sub-texts are divided into sub-text sets, the sub-texts in each sub-text set are subjected to off-line speech synthesis processing, and corresponding TTS speech is obtained, so that the text can be accurately converted into speech, and the requirement of high response rate can be met.

Further, in order to meet the information promotion requirement, the step S102 may further include, in order to play the TTS voice at an appropriate time: and determining a playing strategy of the TTS voice to be synthesized.

Specifically, when an information processing request is sent to the supervisor device, the playback information may also be specified; correspondingly, the management machine equipment determines whether playing information is acquired, and if so, determines a playing strategy of the TTS voice to be synthesized according to the playing information; if not, determining the default playing strategy as the playing strategy of the TTS voice to be synthesized.

The playing information may include playing policy information, playing time information, and the like. The default playing strategy can be set in practical application according to needs, for example, the default playing strategy is played through the intelligent device in the distributed network.

Furthermore, considering that an industrial park is often very large and can be divided into a plurality of areas, and some information only needs to be publicized in a certain area; therefore, the user can also specify the device information of the intelligent device for playing the TTS voice in the playing information. Correspondingly, the step 106 of sending the TTS speech to the smart device may include: and determining target intelligent equipment to be sent in the distributed network according to the playing information, and sending an information sending request to the routing equipment in the distributed network according to the equipment information of the target intelligent equipment and the TTS voice so that the routing equipment sends the TTS voice to the target intelligent equipment. Specifically, the supervisor device obtains device information of the target intelligent device from the playing information, sends an information sending request to the routing device in the distributed network according to the obtained device information and the TTS voice, and the routing device sends the TTS voice to the intelligent device corresponding to the device information according to the received information sending request. The device information includes, for example, a device identifier, an IP address of the device, and the like.

Considering that some users may not be present when the intelligent device plays TTS voice, in order to enable the users who are not present to perform related information query, in one or more embodiments of the present application, the method may further include: and sending the text information to the intelligent equipment so that the intelligent equipment responds to a query request of a user for the text information and displays the corresponding text information. The text information and the TTS voice can be simultaneously sent to the intelligent device, and can also be respectively sent to the intelligent device.

Further, when the intelligent device receives the text information, the received text information is stored; when a user wants to inquire certain text information, operating the intelligent equipment to select the text information to be inquired; the intelligent device responds to a query request of a user for the text information and displays the corresponding text information.

It should be noted that, because the size of the storage space of the intelligent device is limited, and a large storage space is needed to be occupied for storing the audio data, the TTS speech sent by the intelligent device to the supervisor device is deleted after the TTS speech is played. When the intelligent device has the voice synthesis capability, a user can select a query mode such as querying in a text information mode, querying in a voice mode and the like when operating the intelligent device to select text information to be queried; correspondingly, the intelligent device responds to the query operation of the user, if the query is determined to be performed in a voice mode, the off-line voice synthesis processing is performed based on the corresponding text information to obtain TTS voice, and the obtained TTS voice is played. The process of the offline speech synthesis processing of the intelligent device is similar to the process of the offline speech synthesis processing of the supervisor device, and the difference is that when the intelligent device determines that the length of the text message is greater than a third preset length, the text message is segmented according to a preset segmentation rule to obtain a plurality of sub-texts; the third preset length is the maximum length of the text information which can be processed when the intelligent device carries out voice synthesis processing once.

Further, considering that when the synthesized TTS speech reaches a certain length, the synthesis of the remaining TTS speech may be performed while playing the TTS speech, so as to achieve a high response rate, in one or more embodiments of the present application, after step 104-6, the method may further include: and determining the length of the TTS voice which is synthesized currently, and if the length of the TTS voice which is synthesized is determined to be not less than the second preset length, sending the TTS voice which is synthesized currently to the intelligent equipment. Specifically, if it is determined that the length of the synthesized TTS speech is not less than the second preset length and the playing strategy is to play through the intelligent device in the distributed network, the currently synthesized TTS speech is sent to the corresponding intelligent device.

Considering that the smart device often has a format requirement when playing the speech, based on this, sending the TTS speech to the smart device may include: and coding the TTS voice to obtain the TTS voice in the first preset format, and sending the TTS voice in the first preset format to the intelligent equipment. The first preset format can be set in practical application according to needs, such as MP3 format.

Since the playing policy may also be direct playing, deferred playing, and the like, the method may further include:

and if the fact that the playing strategy of the TTS voice is direct playing is determined, storing the text information and playing the TTS voice.

If the fact that the playing strategy of the TTS voice is to suspend playing is determined, coding the TTS voice to obtain the TTS voice in a second preset format; saving the TTS voice in a second preset format; and if the playing condition of the TTS voice in the second preset format is determined, playing the TTS voice in the second preset format. The second preset format may be the same as or different from the first preset format, and may be set in practical application as needed, such as MP3 format; the playing conditions include receiving the playing instruction information sent by the specified device, reaching the specified playing time, and the like. And playing the TTS voice in the second preset format, wherein the TTS voice can be directly played or sent to the intelligent equipment in the distributed network to play the intelligent equipment and the like.

It should be noted that, when the playback strategy is direct playback, the speech synthesis capability of the smart device may not be considered in the process of segmented speech synthesis, that is, the first preset length is determined as the target text length.

Further, when the supervisor device has the function of inquiring and presenting information, the method may further include: and responding to a query request of a user for the text information, and displaying the corresponding text information.

Based on the same technical concept, one or more embodiments of the present application further provide a device for processing TTS speech, and fig. 9 illustrates a schematic block diagram of the device for processing TTS speech according to one or more embodiments of the present application, as shown in fig. 9, the device includes:

a memory 201 for storing speech synthesis rules;

the processor 202 is configured to acquire text information to be processed, and perform offline speech synthesis processing on the text information according to the speech synthesis rule to obtain TTS speech; and if the fact that the playing strategy of the TTS voice is played through the intelligent equipment in the distributed network is determined, the TTS voice is sent to the intelligent equipment, so that the intelligent equipment plays the TTS voice when the fact that the TTS voice meets the preset playing conditions is determined.

According to the processing device for the TTS voice, when the text information to be processed is obtained, the text information is subjected to off-line voice synthesis processing according to the preset voice synthesis rule, and the TTS voice is obtained; and if the fact that the playing strategy of the TTS voice is played through the intelligent equipment in the distributed network is determined, the TTS voice is sent to the corresponding intelligent equipment, so that the intelligent equipment plays the TTS voice when the fact that the TTS voice meets the preset playing conditions is determined. Therefore, text information is converted into TTS voice, and the TTS voice is automatically played based on a distributed network without manual reading, so that the information transmission efficiency is greatly improved; the intelligent device can play TTS voice when the playing condition is determined to be met, so that the problem that the current information propagation times are limited is effectively solved; moreover, by performing offline speech synthesis processing on the text information, the information propagation cost is reduced, and the problem that speech synthesis cannot be performed due to the fact that a server for performing online speech synthesis is not arranged is solved.

Optionally, the processor 202 receives an information processing request sent by a designated device, and obtains text information to be processed from the information processing request; alternatively, the first and second electrodes may be,

the method comprises the steps of responding to an information processing operation acquisition information processing request of a user, and acquiring text information to be processed from the information processing request.

Optionally, if it is determined that the length of the text message is greater than a first preset length, the processor 202 performs segmentation processing on the text message according to a preset segmentation rule to obtain a plurality of sub-texts;

dividing the subfolders into a plurality of subfolders according to preset synthesis conditions;

and performing off-line speech synthesis processing on the subfiles in each subfile set to obtain corresponding TTS speech.

Optionally, the processor 202 determines a target text length corresponding to a maximum voice playing length of the intelligent device for performing single voice playing;

comparing the first preset length with the target text length, and determining the small length as a standard length;

and if the length of the text message is determined to be larger than the standard length, segmenting the text message according to a preset segmentation rule to obtain a plurality of sub-texts.

Optionally, the processor 202 determines the text information as a text to be divided, and detects punctuation marks in the text to be divided according to a sequence from front to back;

and determining whether the detected punctuation marks are first preset marks, if so, performing segmentation processing at the positions of the punctuation marks to obtain corresponding subfiles.

Optionally, the processor 202 determines whether the length of the text to be divided before the punctuation mark is greater than the standard length;

if so, determining a segmentation position in the text to be segmented according to a preset mode, and performing segmentation processing at the segmentation position so that the length of the obtained sub-text does not exceed the standard length;

if not, performing segmentation processing at the position of the punctuation mark to obtain a corresponding sub-text.

Optionally, the processor 202 sequentially determines the current subfiles to be divided;

according to the sequence from front to back, determining that the total length of the current sub-texts to be divided is greater than the standard length, and the total length of the first N-1 sub-texts is not greater than the N sub-texts with the standard length; wherein N is an integer greater than 1;

and dividing the first N-1 sub texts into a sub text set.

Optionally, the processor 202 determines the corresponding sub-text obtained by the segmentation processing as a current sub-text;

determining whether an undivided subfolder exists before the current subfolder according to the sequence from front to back;

if so, determining the current sub-text and the sub-texts which are not divided as the sub-texts to be divided currently;

if not, determining the current sub-text as the sub-text to be divided.

Optionally, the processor 202 determines whether to perform the partitioning process of the sub-text set for the first time;

if so, determining each subfile obtained by the segmentation processing as the current subfile to be divided;

if not, determining that the sub-texts with the preset marks to be divided are added in each sub-text obtained by the segmentation processing, and determining the determined sub-texts and the sub-texts behind the determined sub-texts as the current sub-texts to be divided according to the sequence from front to back;

after the dividing the first N-1 sub-texts into a sub-text set, the method further includes:

and if the subfolders to be divided are determined after the Nth sub-text, adding the marks to be divided to the Nth sub-text.

Optionally, the processor 202, determining a length of the TTS speech that has been currently synthesized;

and if the length of the TTS voice synthesized currently is determined to be not less than a second preset length, sending the TTS voice synthesized currently to the intelligent equipment.

Optionally, the processor 202 is configured to encode the TTS speech to obtain a TTS speech in a first preset format;

and sending the TTS voice in the first preset format to the intelligent equipment.

Optionally, the processor 202 sends the text information to the smart device, so that the smart device displays the text information in response to a query request of a user for the text information.

Optionally, the processor 202 determines whether to acquire playing information of the TTS voice;

if yes, determining a playing strategy of the TTS voice according to playing information;

if not, determining the default playing strategy as the TTS voice playing strategy.

Optionally, the processor 202 determines, according to the playing information, a target intelligent device to be sent in the distributed network where the target intelligent device is located;

and sending an information sending request to the routing equipment in the distributed network according to the equipment information of the target intelligent equipment and the TTS voice so that the routing equipment sends the TTS voice to the target intelligent equipment.

Optionally, the apparatus further comprises: a speaker module;

the processor module stores the text information if the playing strategy is determined to be direct playing;

and the loudspeaker module plays the TTS voice.

Optionally, if it is determined that the play policy is to suspend play, the processor 202 encodes the TTS speech to obtain a TTS speech in a second preset format;

the memory 201 stores the TTS voice in the second preset format;

and if the playing condition of the TTS voice in the second preset format is determined to be met, the processor 202 performs playing processing on the TTS voice in the second preset format.

In addition, for the above device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to partial description of the method embodiment. Further, it should be noted that, among the respective components of the apparatus of the present invention, the components thereof are logically divided according to the functions to be realized, but the present invention is not limited thereto, and the respective components may be newly divided or combined as necessary.

Based on the same technical concept, one or more embodiments of the present application further provide a system for processing TTS speech, and fig. 10 one or more embodiments of the present application further provide a schematic composition diagram of a system for processing TTS speech, as shown in fig. 10, the system includes: a supervisor device 301 and at least one smart device 302;

the supervisor device 301 is configured to perform relevant processing on the TTS speech according to the TTS speech processing method;

the intelligent device 302 receives the TTS voice sent by the supervisor device 301, and if it is determined that the TTS voice meets a preset playing condition, plays the TTS voice.

Optionally, as shown in fig. 11, the system further includes: the management machine device 301 is connected with the intelligent device 302 through the routing device 303 to establish a distributed network;

the management machine device 301 is specifically configured to send the TTS voice obtained according to the TTS voice processing method to the routing device 303;

the routing device 303 is configured to send the received TTS voice to the intelligent device 302;

the intelligent device 302 receives the TTS voice sent by the routing device 303.

According to the processing system of the TTS voice, when the text information to be processed is obtained, the text information is subjected to off-line voice synthesis processing according to the preset voice synthesis rule, and the TTS voice is obtained; and if the fact that the playing strategy of the TTS voice is played through the intelligent equipment in the distributed network is determined, the TTS voice is sent to the corresponding intelligent equipment, so that the intelligent equipment plays the TTS voice when the fact that the TTS voice meets the preset playing conditions is determined. Therefore, text information is converted into TTS voice, and the TTS voice is automatically played based on a distributed network without manual reading, so that the information transmission efficiency is greatly improved; the intelligent device can play TTS voice when the playing condition is determined to be met, so that the problem that the current information propagation times are limited is effectively solved; moreover, by performing offline speech synthesis processing on the text information, the information propagation cost is reduced, and the problem that speech synthesis cannot be performed due to the fact that a server for performing online speech synthesis is not arranged is solved.

In addition, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant points, reference may be made to partial description of the method embodiment.

Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and referring to fig. 12, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required by other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the TTS speech processing device on the logic level. Of course, besides the software implementation, the present application does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.

The network interface, the processor and the memory may be interconnected by a bus system. The bus may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 12, but that does not indicate only one bus or one type of bus.

The memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The Memory may include a Random-Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory.

The processor is used for executing the program stored in the memory and specifically executing:

acquiring text information to be processed;

The method executed by the processing device for TTS speech according to the embodiment shown in fig. 9 of the present application can be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

Based on the same technical concept, embodiments of the present application further provide a computer-readable storage medium storing one or more programs, which when executed by an electronic device including a plurality of application programs, cause the electronic device to execute the method for processing TTS speech provided by any one of fig. 1 to 8.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of TTS speech processing, comprising:

acquiring text information to be processed;

2. The method according to claim 1, wherein the obtaining text information to be processed comprises:

receiving an information processing request sent by a designated device, and acquiring text information to be processed from the information processing request; alternatively, the first and second electrodes may be,

3. The method of claim 1, wherein performing an offline speech synthesis process on the text information according to a preset speech synthesis rule to obtain TTS speech comprises:

if the length of the text message is determined to be larger than a first preset length, segmenting the text message according to a preset segmentation rule to obtain a plurality of sub-texts;

4. The method according to claim 3, wherein if it is determined that the length of the text message is greater than a first preset length, the segmenting the text message according to a preset segmentation rule to obtain a plurality of sub-texts comprises:

determining a target text length corresponding to the maximum voice playing length of the intelligent equipment for performing single voice playing;

5. The method according to claim 4, wherein the segmenting the text message according to a preset segmentation rule to obtain a plurality of sub-texts comprises:

determining the text information as a text to be cut, and detecting punctuation marks in the text to be cut according to the sequence from front to back;

6. The method according to claim 5, wherein the performing segmentation processing at the position of the punctuation mark to obtain a corresponding sub-text comprises:

determining whether the length of the text to be divided before the punctuation mark is larger than the standard length;

7. The method of claim 5, wherein the dividing the sub-texts into a plurality of sub-text sets according to a preset synthesis condition comprises:

sequentially determining the current subfiles to be divided;

and dividing the first N-1 sub texts into a sub text set.

8. The method of claim 7, wherein the determining the current subfolders to be partitioned comprises:

determining the corresponding sub-text obtained by the segmentation processing as a current sub-text;

if not, determining the current sub-text as the sub-text to be divided.

9. The method of claim 7, wherein the determining the current subfolders to be partitioned comprises:

determining whether the division processing of the sub-text set is performed for the first time;

10. The method of claim 3, wherein after obtaining the corresponding TTS speech, further comprising:

determining the length of TTS speech which is synthesized currently;

11. The method of claim 1, wherein said sending said TTS speech to said smart device comprises:

coding the TTS voice to obtain the TTS voice in a first preset format;

12. The method of claim 1, further comprising:

and sending the text information to the intelligent equipment, so that the intelligent equipment responds to a query request of a user for the text information and displays the text information.

13. The method of claim 1, further comprising:

determining whether playing information of the TTS voice is acquired;

14. The method of claim 1, further comprising:

15. The method of claim 1, further comprising:

if the fact that the playing strategy of the TTS voice is to suspend playing is determined, coding the TTS voice to obtain the TTS voice in a second preset format;

saving the TTS voice in the second preset format; and the number of the first and second groups,

and if the playing condition of the TTS voice in the second preset format is determined, playing the TTS voice in the second preset format.

16. A device for processing TTS speech, comprising:

a memory for storing speech synthesis rules;

the processor is used for acquiring text information to be processed and performing off-line speech synthesis processing on the text information according to the speech synthesis rule to obtain TTS speech; and if the fact that the playing strategy of the TTS voice is played through the intelligent equipment in the distributed network is determined, the TTS voice is sent to the intelligent equipment, so that the intelligent equipment plays the TTS voice when the fact that the TTS voice meets the preset playing conditions is determined.

17. A system for processing TTS speech, comprising: a supervisor device and at least one smart device;

-said supervisor device for performing TTS speech correlation processing according to the method of any of claims 1-15;

18. An electronic device, comprising: a processor, a communication interface, a memory, and a communication bus; the processor, the communication interface and the memory complete mutual communication through a bus; a memory for storing a computer program; a processor for executing a program stored in a memory to perform the steps of the method of any of claims 1 to 15.

19. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of one of the preceding claims 1 to 15.