CN112735372A

CN112735372A - Outbound voice output method, device and equipment

Info

Publication number: CN112735372A
Application number: CN202011608326.4A
Authority: CN
Inventors: 简仁贤; 邓茜; 王海
Original assignee: Emotibot Technologies Ltd
Current assignee: Emotibot Technologies Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-30

Abstract

The scheme can flexibly adapt to variables such as different calling objects, addresses and the like in different outbound tasks through the content of variable speech technology slots in variable speech technology clauses, and pre-generated voice files can cover most of the content of an outbound speech sentence in an actual outbound scene. Therefore, the voice files of most contents in the outbound statement can be pre-generated, and the output can be directly called when needed, so that the outbound speed is effectively improved, the application scene is flexible, and the situation of voice broadcast blockage can be avoided.

Description

Outbound voice output method, device and equipment

Technical Field

The present application relates to the field of information technology, and in particular, to a method, an apparatus, and a device for outputting outbound voice.

Background

With the development of artificial intelligence algorithms, intelligent outbound systems are also being adopted by more and more call centers. However, the current intelligent outbound system also faces a series of problems, such as: inaccurate speech recognition, slow speech speed during man-machine conversation, etc. The problem of slow speaking speed during conversation brings very poor use experience to users, and manufacturers of all intelligent outbound systems use different methods to solve the problem: for example, a voice cache is added in a TTS (Text To Speech) service, and the exactly same Text content is directly generated by the cache instead of real time; when the voice file is generated in real time, a scheme of sectional generation is adopted, after the externally called speech sentence is segmented according to punctuation marks, the TTS service is called to generate the voice file of the first clause for playing, then the TTS service is called to generate the voice file of the subsequent clause while playing, and the voice files are played in sequence.

The above-described methods can solve some of the problems, but still have respective drawbacks. For example, although the scheme using TTS speech buffer can solve the issue of speaking speed of partial statements, different outbound statements need to be set for different callers when the actual intelligent outbound system calls. However, during the buffering, it is difficult to exhaust the possible outgoing call statements that may be output under all conditions, so in an actual scene, the applicable scenario of the TTS speech buffering technology is not flexible enough, and the above-mentioned problems cannot be solved easily.

Some of the outgoing speech speed can indeed be increased if speech segmentation synthesis techniques are used. However, the outbound sentence is split by the punctuation mark, so that the speaking speed completely depends on the punctuation sentence-breaking position, and the situation that some clauses are too long easily occurs. If the text content of the first clause is too long, the speaking speed is still slow, and if the text content of the middle clause is too long, the situation that the voice of the previous clause is played and the voice of the next clause is still not generated occurs, so that the user can be answered with a voice broadcast delay.

Therefore, the problem of low outgoing speed cannot be well solved by the scheme adopted in the existing intelligent outbound system.

Disclosure of Invention

An object of the present application is to provide an outbound voice output scheme, which is used to solve the problem of slow outbound speed of an outbound system.

In order to achieve the above object, the present application provides an outbound voice output method, including:

acquiring a fixed-phrase clause and a variable-phrase clause in an outbound statement, and calling a text-to-speech service to generate a first speech file of the fixed-phrase clause, wherein the variable-phrase clause comprises a variable-phrase slot;

determining the content of a variable speech skill slot position in the variable speech skill clause according to variable information of an outbound task, and calling a text-to-speech service to generate a second speech file of the variable speech skill clause;

splicing according to the first voice file and/or the second voice file to obtain a pre-generated voice file corresponding to the outbound statement;

and calling and outputting the pre-generated voice file according to the target outbound statement corresponding to the outbound task.

The application also provides a voice output device calls out, and the device includes:

the system comprises a speaking operation acquisition module, a fixed speaking operation clause and a variable speaking operation clause, wherein the fixed speaking operation clause and the variable speaking operation clause are in an outbound statement and comprise variable speaking operation slots;

the voice pre-generation module is used for calling a text-to-voice service to generate a first voice file of the fixed-language technical clause, determining the content of a variable-language technical slot position in the variable-language technical clause according to variable information of an outbound task, and calling the text-to-voice service to generate a second voice file of the variable-language technical clause;

the voice splicing module is used for splicing and obtaining a pre-generated voice file corresponding to the outbound statement according to the first voice file and/or the second voice file;

and the voice output module is used for calling and outputting the pre-generated voice file according to the target outbound statement corresponding to the outbound task.

The present application also provides an outbound speech output device comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, cause the device to perform the outbound speech output method.

In addition, a computer readable medium is provided, on which computer readable instructions are stored, the computer readable instructions being executable by a processor to implement the outbound voice output method.

Compared with the prior art, in the outbound voice output scheme provided by the application, the fixed-language technical clause and the variable-language technical clause in the outbound sentence can be acquired, the text-to-voice service is called to generate the first voice file of the fixed-language technical clause, then before the outbound task under the outbound scene is executed, the content of the variable-language technical slot in the variable-language technical clause is determined according to the variable information of the outbound task, the text-to-voice service is called to generate the second voice file of the variable-language technical clause, and the pre-generated voice file corresponding to the outbound sentence is obtained by splicing according to the first voice file and/or the second voice file. Because the variable technical slot position content in the variable technical clause can flexibly adapt to the variables in different outbound tasks, such as different calling objects, addresses and the like, the pre-generated voice file can cover most of the content of the outbound sentence in the actual outbound scene, generally reaching more than 95 percent, even reaching 100 percent in some outbound scenes. Therefore, the voice files of most contents in the outbound statement can be pre-generated, and the outbound statement can be directly called and output when needed, so that the outbound speed is effectively improved, the application scene is flexible, and the situation that the voice broadcast is blocked can be avoided.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of a method for outputting outbound voice in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an outbound system for implementing outbound voice output by using the solution provided in the embodiment of the present application;

fig. 3 is an interaction flowchart of an outbound call system when implementing a fast outgoing call in an embodiment of the present application;

the same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In a typical configuration of the present application, the terminal, the devices serving the network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The method can flexibly adapt to variables such as different calling objects, addresses and the like in different outbound tasks through the content of variable tactical slots in the variable tactical clauses, and pre-generated voice files can cover most of the content of the outbound sentences in an actual outbound scene. Therefore, the voice files of most contents in the outbound statement can be pre-generated, and the outbound statement can be directly called and output when needed, so that the outbound speed is effectively improved, the application scene is flexible, and the situation that the voice broadcast is blocked can be avoided.

In an actual scenario, the execution subject of the method may be a user device, a network device, or a device formed by integrating the user device and the network device through a network, or may also be an application program running on the device. The user equipment comprises but is not limited to various terminal equipment such as a computer, a mobile phone and a tablet computer; including but not limited to implementations such as a network host, a single network server, multiple sets of network servers, or a cloud-computing-based collection of computers. Here, the Cloud is made up of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a type of distributed Computing, one virtual computer consisting of a collection of loosely coupled computers.

Fig. 1 shows a processing flow of an outbound voice output method in an embodiment of the present application, where the method at least includes the following processing steps:

step S101, the external calling sentence acquires a fixed-language clause and a variable-language clause in the external calling sentence, and a text-to-speech service is called to generate a first speech file of the fixed-language clause.

The outbound statement may correspond to an outbound scenario, where the outbound scenario refers to a specific application scenario of the outbound, and different outbound scenarios, such as return visit, notification, call admission, and the like, may be set according to different purposes of the outbound. Taking the callback scenario as an example, 10 sentences that may be used in multiple rounds of conversations with the user may be preset as the callout sentences corresponding to the callback scenario. During the actual voice conversation with the user, the required sentence can be selected from the 10 outgoing call sentences according to the requirement to complete the voice conversation with the user.

Each outbound sentence can be divided into at least one clause, and the division can be realized according to a division identifier, wherein the division identifier can adopt specific content of the outbound sentence, and can be an additional identifier added in the sentence when the outbound sentence is configured. For example, "in an outbound statement may be used. ","; for example, when the outbound sentence is configured, an additional identifier may be added to a specific position in the outbound sentence according to the dividing requirement to serve as the dividing identifier. In addition, the specific content of the outbound sentence and the added additional identifier can be used as the first division identifier and the second division identifier respectively and also used as the division clause in a combination of the two ways.

The clauses in the outbound statement may include at least two types, a fixed-utterance clause and a variable-utterance clause. The fixed-language clause is a fixed-language content which does not contain any variable part, and cannot be changed due to variable information in the call task. Taking an outbound sentence "hello, XX mr. and welcome to call" as an example, punctuation marks can be used as division marks, and the outbound sentence can be divided into three clauses, namely "hello", "XX mr" and welcome to call ", wherein the clauses" hello "and" welcome to call "are fixed-line clauses, and even if two calling tasks return visits to the client a and the client B respectively, the contents of the two clauses cannot be changed. For a fixed-line technical clause, a TTS service may be directly invoked to generate a first speech file of the fixed-line technical clause after the acquisition.

The variable-language clause is the content of the sentence comprising the variable-language slot, wherein the specific content in the variable-language slot is related to the specific call task in the call scene and is determined by the variable information in the actual call task. For example, the "XX" in the above clause "mr. XX" corresponds to a variable dialog slot, and the specific content thereof is determined by the variable information of the call object in the call task, and if the call task is a call back to the a client, the clause is "a mr", and if the call task is a call back to the B client, the clause is "B mr".

S102, determining the content of the variable speech technology slot position in the variable speech technology clause according to the variable information of the outbound task, and calling a text-to-speech service to generate a second speech file of the variable speech technology clause.

Because the content of the variable tactical slot in the variable tactical clause is related to the specific variable information of each call task, the TTS service cannot be directly called to generate a corresponding second voice file after the variable tactical clause is obtained. In an actual scene, a certain time difference exists from the starting of the call task to the execution of the call task, and the variable information in the call task is determined at the moment, so that before the call task under the call-out scene is executed, the content of the variable speech technology slot position in the variable speech technology clause can be determined according to the variable information of the call-out task, and a TTS service is called to generate a second voice file of the variable speech technology clause.

Taking the foregoing scenario as an example, when a customer service staff needs to visit a customer a, the call task is started after searching or inputting information of the customer through an interactive interface of an outbound system, and after a call scheduling center of the outbound system receives a command for starting the call task, it can be determined that a call object is the customer a, so that other function modules in other outbound systems are triggered before a call is made to call a TTS service to generate a second voice file of a variable-language clause "mr. a", thereby implementing pre-generation of the voice file of the variable-language clause before actual call.

And step S103, splicing and obtaining a pre-generated voice file corresponding to the outbound statement according to the first voice file and/or the second voice file. In practical situations, a specific outbound sentence may include only one type of clause, or may include multiple types of clauses. For example, for the outbound sentence "thank you for you, see again", two fixed-line-art sentences are included, while the outbound sentence "hello, mr. XX, welcome to call" as mentioned above, includes both fixed-line-art clauses and variable-line-art clauses. Therefore, when the pre-generated voice files are spliced, the first voice file and/or the second voice file are/is selected to be spliced to obtain the corresponding outbound statement according to the requirement of the actual scene.

And step S104, calling and outputting the pre-generated voice file according to the target outbound statement corresponding to the outbound task. The process of executing the outbound task is a process of performing voice conversation with the call object during actual call, and at this time, a pre-generated voice file corresponding to the outbound statement needs to be output to the call object. Since the pre-obtained outbound sentences may be all the outbound sentences that may be used, and only a part of the pre-obtained outbound sentences may be used when performing a certain outbound task, for example, 4 sentences of the obtained 10 candidate outbound sentences may be selected to complete 4 rounds of conversations with the user. The target outbound statement is the outbound statement required to be used by the outbound task, and when the target outbound statement needs to be output, the target outbound statement directly calls the pre-generated voice file to be output.

Therefore, the variable in different outbound tasks, such as different calling objects, addresses and the like, can be flexibly adapted through the content of the variable tactical slot in the variable tactical clause, and the pre-generated voice file can cover most of the content of the outbound sentence in the actual outbound scene. Therefore, the voice files of most contents in the outbound sentences can be pre-generated, and the output can be directly called when needed (such as when an outbound task is executed), so that the outbound speed is effectively improved, the application scene is flexible, and the situation that the voice broadcast is blocked can be avoided.

In an actual scenario, the variable content in the clause may be a real-time tactical slot in addition to the variable tactical slot. The content in the real-time tactical slot cannot be predicted before the outbound task is performed, but needs to be acquired in real-time during the voice conversation of the calling subject. For example, the content spoken by the call object is: "can't help, it is inconvenient now, please redial me after an hour", the outbound statement that needs to reply at this moment is: "good, i call you again a little time on that side". The real-time information about redialing time, namely 'one hour' corresponds to a real-time talking slot position, and the 'one hour' cannot be predicted before the outbound task is executed, so that a voice file corresponding to a statement containing the content cannot be pre-generated in advance, and the voice file can be generated and output in real time according to needs.

Therefore, some embodiments of the present application may, when obtaining a fixed-language-term clause and a variable-language-term clause in the outgoing-language sentence, divide the outgoing-language sentence into a plurality of clauses according to the division identifier, then detect a variable-language-term slot position and a real-time-language-term slot position in the clauses, determine the clause as the fixed-language-term clause if the clauses do not include the variable-language-term slot position and the real-time-language-term slot position, and determine the clause as the variable-language-term clause if the clauses include the variable-language-term slot position and do not include the real-time-language-term slot position, thereby obtaining the fixed-language-term clause and the variable-language-term clause in the outgoing-language sentence.

In addition, if the clause contains a real-time tactical slot, the clause is determined to be a real-time tactical clause. For the real-time speech clause, the content of the real-time speech slot position in the real-time speech clause can be determined according to the real-time information acquired in the execution process of the outbound task, a text-to-speech service is called in real time to generate a third speech file of the variable speech clause, and the third speech file is output.

For example, for mr. $ good, $ global { meta _ name }, i call you again at $ { meta _ redial time } "an outbound statement, where $ global { meta _ name } is a variable tactic slot and $ { meta _ redial time } is a real-time tactic slot. After dividing the punctuation marks into three clauses, it can be determined that "hello" is a fixed-language clause, "Mr.," $ global { meta _ name } is a variable-language clause, "I call you again at $ { meta _ redial time } as a real-time-language clause.

For the fixed-language term "hello" therein, the TTS service may be directly invoked to generate the corresponding first speech file vc-1. For mr. $ global { meta _ name } in the variable-word clause, after the variable information is determined and before the call task is executed, a TTS service is called to generate a corresponding second voice file vc-2, and the second voice file vc-2 and the variable information are spliced as a pre-generated voice file vc-12 of the outbound sentence according to needs. And for the real-time speech operation clause "i call you again after $ { meta _ redial time }, it is necessary to obtain real-time information according to the conversation content of the call object to determine the specific content of the real-time speech operation slot $ { meta _ redial time } in the process of executing the call task, for example, it is determined that the redial time is one hour in the current conversation process, a TTS service may be called in real time to generate the third voice file vc-3 of the variable speech operation clause after determining the content of the real-time speech operation slot, and it is only necessary to output calling and outputting the third voice file vc-3.

In an actual scenario, it may be agreed in advance that the outbound statement including the variable tactical slot may be called and output a related voice file for multiple times when being output. For example, in this embodiment, a pre-generated voice file vc-12 of the content of the first half of the pinyin may be called and played first, and then a third voice file vc-3 generated after the real-time information is determined may be called and played again, so as to realize a fast voice output of a sentence "you are good, XX is mr. and i will call you again after an hour.

In other embodiments of the present application, when obtaining a fixed-phrase clause and a variable-phrase clause in the outbound sentence, the outbound sentence may be further divided according to the first division identifier to obtain a first clause including a real-time phrase slot and a second clause not including the real-time phrase slot, and then the second clause is divided into a plurality of clauses according to the second division identifier, and if the clause does not include the variable-phrase slot, the clause is determined as the fixed-phrase clause, and if the clause includes the variable-phrase slot, the clause is determined as the variable-phrase clause.

The first division mark may be a division mark additionally added to the outbound sentence corresponding to the configured outbound scene and including the real-time tactical slot, for example, a "[ ] ]" may be used as the first division mark in the actual scene, and the second division mark may be a punctuation mark in the sentence. Thus, taking the aforementioned outbound call sentence "you are good, $ global { meta _ name } my here call you again at $ { meta _ redial time }" as an example, it may be stored as "you are good, $ global { meta _ name } my here, and $ { meta _ redial time } my here call you again at $ meta _ redial time" at configuration, and it may be divided into two clauses of "you are good, $ global { meta _ name } my" and "my here call you again at $ { meta _ redial time" during processing.

In the process, for the second clause "you good, $ global { meta _ name } mr. in it," divide it into two clauses "you good" and "$ global { meta _ name } mr. according to it," fixed-word technical clauses and variable-word technical clauses, respectively. Then, the TTS service can be called in advance to generate corresponding voice files vc-1 and vc-2 by adopting the processing mode, and the pre-generated voice files vc-12 are obtained by splicing. For the first clause, because the corresponding voice file cannot be directly generated in advance, when the outbound task is executed, the content of the real-time speech slot position in the first clause can be determined according to the real-time information obtained in the process of executing the outbound task, the text-to-speech service is called in real time to generate a fourth voice file vc-4 of the first clause, and the fourth voice file is output. In a practical scene, the related voice file can be called and output for multiple times in the manner described above, so as to realize quick voice output of a statement "good you, mr. XX, i will call you again after an hour".

In addition, in some embodiments of the present application, after the pre-generated voice file corresponding to the outbound sentence is obtained by concatenating the first voice file and/or the second voice file, the flag information of the corresponding pre-generated voice file may also be determined according to the outbound sentence, and the flag information is added to the pre-generated voice file and then stored in the database. The tag information is any information that can be used to identify and search the pre-generated voice file, for example, digest information obtained by performing hash calculation on an outgoing voice statement may be used as a message-digest algorithm 5 (MD-digest algorithm, fifth edition), and the like, and when the pre-generated voice file needs to be stored in the database, the text of the outgoing voice statement may be subjected to hash calculation by using an MD5 algorithm to obtain an MD5 value, and then the corresponding pre-generated voice file is named by using an MD5 value and then stored in the database.

In addition, in other embodiments of the present application, the tag information may also be combined with generation parameter information of the pre-generated voice file. The generating parameter information may be TTS parameters, such as voice type, speech speed, volume, etc., when the TTS service is called to generate a voice file. The TTS parameters can be provided by the user before the TTS service is called according to the requirement of the outbound scene, and in the actual scene, the user can select the TTS parameters in the interactive interface, so that the TTS service can use the TTS parameters to synthesize the voice, and the required voice file is generated. Therefore, when the pre-generated voice file needs to be stored in the database, the MD5 algorithm can be adopted to perform hash calculation on the text of the outgoing call statement to obtain an MD5 value, then TTS parameters used by the TTS service in generating the first voice file and/or the second voice file corresponding to the pre-generated voice file are obtained, and after the MD5 value + the TTS parameters are adopted to name the corresponding pre-generated voice file, the pre-generated voice file is stored in the database.

Correspondingly, when a pre-generated voice file needs to be output (such as an outbound task is executed), the database can be searched for a corresponding pre-generated voice file by using target marking information as a search condition, wherein the target marking information is marking information determined according to a target outbound statement corresponding to the outbound task at this time, and if the pre-generated voice file meeting the search condition is searched, the pre-generated voice file is output. If the pre-generated voice file meeting the search condition is not searched, the target outbound sentence corresponding to the search condition comprises the real-time speech technology slot position and cannot be generated in advance, so that a TTS service can be called in real time to generate a fifth voice file of the target outbound sentence, and the fifth voice file is output.

Fig. 2 shows a structure of an outbound system for implementing outbound voice output by using the solution provided in the embodiment of the present application, where the outbound system includes a call dispatch center 210, a multi-turn task engine 220, a TTS pre-recording service 230, a TTS service 240, a minIO database 250, an mrcp service 260, and a soft switch service 270. The outbound system is used for realizing voice communication between the robot 280 and a calling object 290, wherein the call scheduling center 210 is used for triggering pre-generation of a voice file before calling, the multi-turn task engine is used for storing an outbound statement required in a certain application scene, the TTS pre-generation service 230 is used for calling a TTS service to pre-generate the voice file, the TTS service 240 is used for synthesizing the voice file according to a text of the outbound statement, the miniO database 250 is used for storing the generated voice file, the mrcp service 260 is used for calling the generated voice file in the communication for a soft switch service to use, and the soft switch service 270 is used for managing the communication connection between the robot 280 and the calling object 290.

The outbound call system may use an interactive flow as shown in fig. 3 when implementing a fast outbound call, which includes several steps:

1. multi-round scene configuration

Before the outbound task is carried out, multi-turn dialogue information of an outbound scene needs to be compiled in advance, and the multi-turn dialogue information is used for controlling an outbound statement of the robot. In the multi-turn dialogue information, the client can configure the robot outbound sentences according to the outbound scene, and the outbound sentences can comprise fixed-language terminology sentences, variable-language terminology sentences and real-time language terminology sentences. After the scene is written in multiple rounds, fixed language statements which need to be pre-generated can be determined, and variable language statements can be determined only by matching variable information in a calling task in the starting stage of the calling task.

The following processing steps can be included in the process of implementing multi-turn writing of the task scene:

step S301, the user inputs the outbound statement needed by the outbound scene through the user interface provided by the outbound system, and the configuration of the multi-turn scene is carried out.

Step S302, multi-turn dialogue information is stored in a multi-turn task engine to be used as an outbound statement corresponding to the outbound scene.

2. Fixed-line pre-generation

After the multi-turn scene configuration is completed, the user needs to configure relevant parameters, including TTS parameters. The TTS parameters may specifically include a voice type, a speech rate, a volume, and the like when generating speech. After TTS parameters are determined, the TTS pre-generation service can obtain fixed-line-operation pre-generation requests, then all external calling statements in an external calling scene are pulled from the multi-turn task engine, the TTS pre-generation service obtains the fixed-line-operation statements according to rules, and after the TTS service is called to synthesize voice files of the fixed-line-operation statements, the voice files are stored in an miniO database.

The following processing steps can be included in the process of implementing fixed-line pre-generation:

step S303, the TTS pre-generation service acquires a fixed telephone technology pre-generation request;

step S304, the TTS pre-generation service acquires fixed speech statements from a multi-turn task engine;

in step S305, the TTS pre-generation service calls the TTS service to synthesize the text of the fixed-dialog sentence into a voice file.

Step S306, storing the voice file corresponding to the fixed speech terminology sentence in the miniO database.

3. Variable-pitch pre-generation and concatenation

The user starts the outbound task from the user interface of the outbound system, and the call dispatching center service sends a variable speech technology pre-generation request to the TTS pre-generation service after receiving the task starting command. The TTS pre-generation service acquires the variable speech term sentences containing the variable speech slot positions according to rules, and fills the variable speech slot positions with variable information corresponding to the outbound task to form actual sentence contents required by the call. The TTS service is then invoked to generate a voice file of the variable conversational utterances. If the variable conversational sentence is a clause segmented according to punctuation marks, the TTS pre-generation service combines the voice file of the variable conversational sentence with other clauses, such as the voice files of a fixed conversational terminology sentence or other variable conversational sentences, and splices the voice file of a longer sentence into a voice file of a minio database for calling and using during actual conversation.

The process of realizing the variable speech technology pre-generation and splicing can comprise the following processing steps:

step S307, the user starts the outbound task through the user interface provided by the outbound system.

Step S308, the call dispatching center receives the task starting command.

Step S309, the TTS pre-generation service acquires a variable dialog pre-generation request from the call dispatch center.

In step S310, the TTS pre-generation service obtains the variable conversational sentence from the multi-turn task engine.

Step S311, the TTS pre-generation service calls the TTS service according to the variable information of the outbound task to synthesize the text of the variable dialect sentence into a voice file.

Step S312, the voice file is spliced.

And step 313, storing the spliced voice file in an miniO database.

4. Voice document retrieval and invocation

After a call is made, the mrcp service is responsible for sending a required audio stream to the soft switch service, and after receiving the text of the outgoing call statement of the robot, the mrcp service calculates the md5 value of the text of the outgoing call statement at this time, and then searches in the miniO database by taking the md5 value as a search condition to judge whether a voice file corresponding to the text exists. If the voice file exists, the voice file of the outbound sentence is directly called and sent to the soft switch for output. If the voice file corresponding to the md5 value does not exist and represents the real-time speech terminology sentence of the current outgoing call, the mrcp service needs to call the TTS service in real time to generate a corresponding voice file, and then the voice file is sent to the soft switch for output.

step S314, the mrcp service calculates the md5 value of the outbound statement, and searches in the miniO database.

Step S315, if the corresponding voice file is searched, the voice file is called and then sent to the soft switch for output.

And step S316, if the corresponding voice file is not searched, the TTS service is called in real time to generate the corresponding voice file, and then the voice file is sent to a soft switch for output.

Therefore, after the voice output scheme provided by the embodiment of the application is used, the time for waiting for the robot to make a call in the outbound system can be greatly shortened, and as most outbound sentences in the outbound scene are generated in advance, the time for preparing the voice file can be compressed to within 1 ms. However, directly calling TTS to generate a voice file in real time generally requires about 2s of voice file preparation time, the waiting time is proportional to the length of an outbound sentence, and when the length of the sentence is long, the waiting time even exceeds more than 5s, which results in poor customer experience.

In addition, the utilization rate of TTS service can be greatly saved, and as most fixed-line terms of an outbound call scene are generated in advance and only need to be generated once, the fixed-line terms do not need to call TTS service when a large-scale concurrent call is carried out, and the calculation cost of an outbound call system can be saved by less TTS service call.

And, because most voice files are stored in the database in the system before the call task is executed, the outgoing speed of the outgoing call system is stable, and the phenomenon of slow and fast call is not easy to appear.

Based on the same inventive concept, the embodiment of the application also provides an outbound voice output device, the corresponding method of the device is the outbound voice output method in the previous embodiment, and the principle of solving the problem is similar to the method.

The outbound voice output device provided by the embodiment of the application can comprise a dialect acquisition module, a voice pre-generation module, a voice splicing module and a voice output module. The variable speech operation clause comprises a variable speech operation slot and a fixed speech operation clause, wherein the speech operation acquisition module is used for acquiring the fixed speech operation clause and the variable speech operation clause in an outbound speech statement, and the variable speech operation clause comprises the variable speech operation slot. The voice pre-generation module is used for calling a text-to-voice service to generate a first voice file of the fixed-language-art clause, determining the content of a variable-language-art slot in the variable-language-art clause according to variable information of an outbound task, and calling the text-to-voice service to generate a second voice file of the variable-language-art clause. And the voice splicing module is used for splicing and obtaining a pre-generated voice file corresponding to the outbound statement according to the first voice file and/or the second voice file. And the voice output module is used for calling and outputting the pre-generated voice file according to the target outgoing statement corresponding to the outbound task.

Each outbound sentence can be divided into at least one clause, and the sentence acquisition module can be realized according to a division identifier during division, wherein the division identifier can adopt specific content of the outbound sentence and can be an additional identifier added in the sentence during configuration of the outbound sentence. For example, "in an outbound statement may be used. ","; for example, when the outbound sentence is configured, an additional identifier may be added to a specific position in the outbound sentence according to the dividing requirement to serve as the dividing identifier. In addition, the speech acquiring module can also combine the two modes to respectively use the specific content of the outbound sentence and the added extra mark as a first division mark and a second division mark, and simultaneously use the first division mark and the second division mark as the division clauses.

In practical situations, a specific outbound sentence may include only one type of clause, or may include multiple types of clauses. For example, for the outbound sentence "thank you for you, see again", two fixed-line-art sentences are included, while the outbound sentence "hello, mr. XX, welcome to call" as mentioned above, includes both fixed-line-art clauses and variable-line-art clauses. Therefore, when the voice splicing module splices the pre-generated voice files, the first voice file and/or the second voice file are/is selected to splice to obtain the corresponding outbound statement according to the requirement of the actual scene.

The process of executing the outbound task is a process of performing voice conversation with the call object during actual call, and at this time, a pre-generated voice file corresponding to the outbound statement needs to be output to the call object. Since the pre-obtained outbound sentences may be all the outbound sentences that may be used, and only a part of the pre-obtained outbound sentences may be used when performing a certain outbound task, for example, 4 sentences of the obtained 10 candidate outbound sentences may be selected to complete 4 rounds of conversations with the user. The target outbound statement is the outbound statement required to be used by the outbound task, and when the target outbound statement needs to be output, the target outbound statement directly calls the pre-generated voice file to be output.

Therefore, the variable in different outbound tasks, such as different calling objects, addresses and the like, can be flexibly adapted through the content of the variable tactical slot in the variable tactical clause, and the pre-generated voice file can cover most of the content of the outbound sentence in the actual outbound scene. Therefore, the voice files of most contents in the outbound statement can be pre-generated, and the outbound statement can be directly called and output when needed, so that the outbound speed is effectively improved, the application scene is flexible, and the situation that the voice broadcast is blocked can be avoided.

Therefore, in some embodiments of the application, when a fixed-phrase and a variable-phrase clause in the outgoing-language sentence are obtained, the phrase obtaining module may first divide the outgoing-language sentence into a plurality of clauses according to the division identifier, then detect a variable-phrase slot position and a real-time-phrase slot position in the clauses, determine the clause as the fixed-phrase clause if the clauses do not include the variable-phrase slot position and the real-time-phrase slot position, and determine the clause as the variable-phrase clause if the clauses include the variable-phrase slot position and do not include the real-time-phrase slot position, thereby obtaining the fixed-phrase and the variable-phrase clause in the outgoing-language sentence.

In addition, if the clause includes a real-time jargon slot, the jargon obtaining module determines the clause as a real-time jargon clause. For the real-time speech operation clause, the voice output module can determine the content of the real-time speech operation slot position in the real-time speech operation clause according to the real-time information acquired in the execution process of the outbound task, call a text-to-voice service in real time to generate a third voice file of the variable speech operation clause, and output the third voice file.

In other embodiments of the present application, when obtaining a fixed-phrase clause and a variable-phrase clause in the outbound sentence, the phrase obtaining module may further divide the outbound sentence according to a first division identifier to obtain a first clause including a real-time phrase slot and a second clause not including the real-time phrase slot, and then divide the second clause into a plurality of clauses according to a second division identifier, and if the clauses do not include the variable-phrase slot, determine the clauses as the fixed-phrase clauses, and if the clauses include the variable-phrase slot, determine the clauses as the variable-phrase clauses.

In the process, for the second clause "you good, $ global { meta _ name } mr. in it," divide it into two clauses "you good" and "$ global { meta _ name } mr. according to it," fixed-word technical clauses and variable-word technical clauses, respectively. Then, the TTS service can be called in advance to generate corresponding voice files vc-1 and vc-2 by adopting the processing mode, and the pre-generated voice files vc-12 are obtained by splicing. For the first clause, because the corresponding voice file cannot be directly generated in advance, the voice output module can determine the content of the real-time speech slot position in the first clause according to the real-time information obtained in the execution process of the outbound task when the outbound task is executed, call a text-to-voice service in real time to generate a fourth voice file vc-4 of the first clause, and output the fourth voice file. In a practical scene, the related voice file can be called and output for multiple times in the manner described above, so as to realize quick voice output of a statement "good you, mr. XX, i will call you again after an hour".

In addition, in some embodiments of the present application, after the pre-generated voice file corresponding to the outbound sentence is obtained by concatenating the first voice file and/or the second voice file, the voice concatenation module may further determine the tag information of the corresponding pre-generated voice file according to the outbound sentence, add the tag information to the pre-generated voice file, and store the pre-generated voice file in the database. The tag information is any information that can be used to identify and search the pre-generated voice file, for example, digest information obtained by performing hash calculation on an outgoing voice statement may be used as a message-digest algorithm 5 (MD-digest algorithm, fifth edition), and the like, and when the pre-generated voice file needs to be stored in the database, the text of the outgoing voice statement may be subjected to hash calculation by using an MD5 algorithm to obtain an MD5 value, and then the corresponding pre-generated voice file is named by using an MD5 value and then stored in the database.

Correspondingly, when a pre-generated voice file needs to be output (such as an outbound task is executed), the voice output module may search the corresponding pre-generated voice file in the database by using the target marking information as a search condition, where the target marking information is marking information determined according to a target outbound statement corresponding to the outbound task of this time, and output the pre-generated voice file if the pre-generated voice file meeting the search condition is searched. If the pre-generated voice file meeting the search condition is not searched, the target outbound sentence corresponding to the search condition comprises the real-time speech technology slot position and cannot be generated in advance, so that a TTS service can be called in real time to generate a fifth voice file of the target outbound sentence, and the fifth voice file is output.

The present application also provides another outbound voice output device, which includes a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to execute the outbound voice output method described above.

In an actual scenario, the outbound voice output device may be a user device, a network device, or a device formed by integrating the user device and the network device through a network, or may also be an application program running on the device. The user equipment comprises but is not limited to various terminal equipment such as a computer, a mobile phone and a tablet computer; including but not limited to implementations such as a network host, a single network server, multiple sets of network servers, or a cloud-computing-based collection of computers. Here, the Cloud is made up of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a type of distributed Computing, one virtual computer consisting of a collection of loosely coupled computers.

In particular, the methods and/or embodiments in the embodiments of the present application may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. The computer program, when executed by a processing unit, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present application also provides a computer-readable medium, which may be included in the apparatus or device described in the foregoing embodiments; or may be present alone without being assembled into the device or apparatus. The computer-readable medium carries one or more computer-readable instructions executable by a processor to implement the methods and/or aspects of the embodiments of the present application as described above.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In some embodiments, the software programs of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, and not to denote any particular order.

Claims

1. A method for outbound voice output, the method comprising:

2. The method of claim 1, wherein obtaining fixed-conversational clauses and variable-conversational clauses in an outgoing spoken sentence comprises:

according to the division identification, dividing the outbound calling sentence into a plurality of clauses;

and if the clause does not contain the variable language skill slot position and the real-time language skill slot position, determining the clause as a fixed language skill clause, and if the clause contains the variable language skill slot position and does not contain the real-time language skill slot position, determining the clause as a variable language skill clause.

3. The method of claim 2, further comprising:

if the clauses contain real-time speech and operation slot positions, determining the clauses as real-time speech and operation clauses;

and determining the content of the real-time speech and skill slot position in the real-time speech and skill clause according to the real-time information acquired in the execution process of the outbound task, calling a text-to-speech service in real time to generate a third speech file of the variable speech and skill clause, and outputting the third speech file.

4. The method of claim 1, wherein obtaining fixed-conversational clauses and variable-conversational clauses in an outgoing spoken sentence comprises:

according to the first division identification, dividing the outgoing call statement to obtain a first clause including the real-time speech slot position and a second clause not including the real-time speech slot position;

dividing the second clause into a plurality of clauses according to a second division identifier;

and if the clauses do not contain the variable phrase slots, determining the clauses as fixed phrase clauses, and if the clauses contain the variable phrase slots, determining the clauses as variable phrase clauses.

5. The method of claim 4, further comprising:

and determining the content of the real-time speech and skill slot position in the first clause according to the real-time information acquired in the outbound task execution process, calling a text-to-speech service in real time to generate a fourth speech file of the first clause, and outputting the fourth speech file.

6. The method according to claim 1, after obtaining the pre-generated voice file corresponding to the outbound sentence according to the first voice file and/or the second voice file concatenation, further comprising:

determining the mark information of the corresponding pre-generated voice file according to the external calling statement;

adding the marking information to the pre-generated voice file, and storing the pre-generated voice file in a database;

calling and outputting the pre-generated voice file according to a target outbound statement corresponding to the outbound task, wherein the calling and outputting the pre-generated voice file comprises the following steps:

searching a corresponding pre-generated voice file in the database by using target marking information as a searching condition, wherein the target marking information is marking information determined according to a target outbound statement corresponding to the outbound task at this time;

and if the pre-generated voice file meeting the search condition is searched, outputting the pre-generated voice file.

7. The method of claim 6, further comprising:

and if the pre-generated voice file meeting the search condition is not searched, calling a text-to-voice service in real time to generate a fifth voice file of the target outbound statement, and outputting the fifth voice file.

8. The method according to claim 6 or 7, wherein the flag information includes summary information of the outbound statement or a combination of the summary information and generation parameter information.

9. An outbound voice output apparatus, comprising:

10. An outbound speech output device, the device comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, cause the device to perform the method of any of claims 1 to 8.