WO2021051506A1

WO2021051506A1 - Voice interaction method and apparatus, computer device and storage medium

Info

Publication number: WO2021051506A1
Application number: PCT/CN2019/116512
Authority: WO
Inventors: 周定军; 王健宗; 彭俊清
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-18
Filing date: 2019-11-08
Publication date: 2021-03-25
Also published as: CN110661927A; CN110661927B

Abstract

A voice interaction method, the method comprising: acquiring an audio signal of a customer channel when playing back a dialogue voice, and determining whether a specified parameter of the audio signal is greater than a first preset threshold (S10); if the specified parameter of the audio signal is greater than the first preset threshold, then suspending playback of the dialogue voice (S20); parsing the audio signal and acquiring a parsing result of the audio signal, and determining a response sentence according to the parsing result (S30); and when the specified parameter of the audio signal of the customer channel is less than a second preset threshold, generating a response voice according to the response sentence, and sending the response voice to a customer corresponding to the customer channel (S40). The described method may improve the adaptability of intelligent voice, enhance interaction with customers, and improve the fluency of communication with customers.

Description

Voice interaction method, device, computer equipment and storage medium

This application is based on the Chinese invention application with the application number 201910883213.6 filed on September 18, 2019, entitled "Voice interaction method, device, computer equipment and storage medium", and claims its priority.

Technical field

This application relates to the field of natural language processing, and in particular to a voice interaction method, device, computer equipment and storage medium.

Background technique

At present, the system architecture of an intelligent voice outbound platform is generally based on a telephone exchange platform and a variety of voice processing engines, such as a speech recognition engine (ASR), a semantic understanding engine (NLP), a speech synthesis engine (TTS), etc. The basic processing flow of this intelligent voice outbound platform includes: recognizing the customer’s voice into text information through the voice recognition engine, and then further analyzing the text information through the semantic understanding engine to obtain the analysis result, and selecting the response sentence based on the analysis result. Finally, the response sentence is synthesized into the response voice through the speech synthesis engine, and the response voice is transmitted to the customer.

However, this kind of interaction is very mechanical and tedious, making the smart voice less adaptable, unable to respond flexibly to customer feedback in time, reducing the interaction with customers, and affecting the fluency of smart voice communication with customers.

Summary of the invention

Based on this, it is necessary to provide a voice interaction method, device, computer equipment, and storage medium for the above technical problems to improve the adaptability of intelligent voice, enhance the interaction with customers, and improve the fluency of communication with customers.

A voice interaction method, including:

When playing the dialogue voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;

If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;

Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;

When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.

A voice interaction device includes:

The audio judgment module is used to obtain the audio signal of the client channel when the dialogue voice is played, and judge whether the specified parameter of the audio signal is greater than the first preset threshold;

The suspension playing module is configured to stop playing the dialogue voice if the designated parameter of the audio signal is greater than a first preset threshold;

The determining response sentence module is used to analyze the audio signal and obtain the analysis result of the audio signal, and determine the response sentence according to the analysis result;

The sending response voice module is used to generate a response voice according to the response sentence when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response voice to the corresponding client channel client.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. The processor implements the following steps when the processor executes the computer-readable instructions: During the conversation voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;

One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps: when the dialogue voice is played , Obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;

The details of one or more embodiments of the present application are set forth in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings, and claims.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

FIG. 1 is a schematic diagram of an application environment of a voice interaction method in an embodiment of the present application;

2 is a schematic flowchart of a voice interaction method in an embodiment of the present application;

FIG. 3 is a schematic flowchart of a voice interaction method in an embodiment of the present application;

FIG. 4 is a schematic flowchart of a voice interaction method in an embodiment of the present application;

FIG. 5 is a schematic flowchart of a voice interaction method in an embodiment of the present application;

FIG. 6 is a schematic flowchart of a voice interaction method in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a voice interaction device in an embodiment of the present application;

Fig. 8 is a schematic diagram of a computer device in an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The voice interaction method provided in this embodiment can be applied in the application environment as shown in FIG. 1, where the terminal device communicates with the server through the network. Among them, terminal devices include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented with an independent server or a server cluster composed of multiple servers.

In an embodiment, as shown in FIG. 2, a voice interaction method is provided. The method is applied to the server in FIG. 1 as an example for description, including the following steps:

S10. Obtain the audio signal of the client channel when the dialogue voice is played, and determine whether the designated parameter of the audio signal is greater than a first preset threshold;

S20: If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;

S30. Analyze the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;

S40: When the designated parameter of the audio signal of the client channel is less than a second preset threshold, generate a response voice according to the response sentence, and send the response voice to the client corresponding to the client channel.

In this embodiment, the voice interaction method can be applied to an intelligent outbound call platform, and can also be applied to an intelligent response platform, or other intelligent interactive platforms. The server can be set with multiple processing processes for processing audio signals transmitted through the client channel. In some cases, the client may refer to the client carried by the client, and the server establishes a communication connection with the client (in some cases, it may be a call connection) to realize intelligent interaction with the client. In this case, the voice interaction method provided in this embodiment can be applied to scenarios such as customer return visits and questionnaire surveys. In other cases, the client may refer to an application terminal with a voice recording device, such as a terminal for self-management of business.

In an example, the voice interaction method can also be applied to one-to-many interaction scenarios. For example, the server simultaneously establishes a call connection with multiple clients. At this time, the server can be based on the telephone soft switch platform (FreeSwitch) and use shared memory technology to realize the storage of audio data of a specific client channel. Here, the shared memory can realize the input and output voice of the same voice channel, sharing the same memory buffer; when input or output voice operations, the memory buffer is locked to ensure the exclusivity of the operation; after the operation is completed , Release the lock, and use the memory cache again for subsequent operations. In the specific implementation process, shared memory can be organically combined with message queues, state machines, multi-thread synchronization and other technologies to achieve multi-channel speech recognition and speech synthesis.

Specifically, the dialogue voice may be generated based on the client's last speech data, or may be generated based on a preset response text. In particular, playing the dialogue voice can send the synthesized dialogue voice to the client. In some cases, such as when an adapted application is installed on the client, the dialogue voice can be played to send the corresponding dialogue text and speech parameters to the client, and then the client synthesizes the dialogue speech according to the aforementioned dialogue text and speech parameters.

The server is also provided with a special process for monitoring whether the designated parameter of the audio signal of the client channel is greater than the first preset threshold. Here, the specified parameter may refer to the volume of the audio signal, and the first preset threshold may refer to the volume threshold. In some cases, the designated parameters may also be other audio parameters. The value of the first preset threshold can be set according to actual needs, for example, it can be set to 15-25 decibels. In other cases, the first preset threshold may be determined based on the signal-to-noise ratio of the client channel. Here, the signal in the signal-to-noise ratio of the client channel refers to the audio signal with the highest volume in the specified time period, and the noise refers to the average value of the background noise in the specified time period (can be based on the preset algorithm Determine that the audio signal within the specified time period belongs to the background noise part).

When the audio signal of the client channel is greater than the first preset threshold, it indicates that the dialogue voice played by the current server is interrupted (may be caused by the client's voice, or may be caused by the environment where the client is located, such as large noise). At this time, the server stops playing the above-mentioned dialogue voice. If the server transmits audio data to the client in real-time, the way to stop playing the dialogue voice is to stop transmitting audio data to the client; if the server transmits the audio data to the client in the form of dialogue text and voice parameters, and the client synthesizes If the dialogue voice is output, the way to stop playing the dialogue voice is to send a stop playing instruction to the client to make the client stop playing the dialogue voice.

After the dialogue voice is suspended, the corresponding response strategy needs to be determined according to the analysis result of the audio signal of the client channel. The audio signal of the client channel corresponding to the analysis result may include the audio signal when it is determined whether the specified parameter is greater than the first preset threshold and the audio signal for a certain period of time later. The longest end point may refer to the audio signal of the client channel determined A moment less than the second preset threshold. There may be a variety of different analysis results. For example, the audio signal is initially analyzed to determine whether it contains human voice. If the audio signal contains human voice, the audio signal needs to be further analyzed, and the analyzed content includes, but is not limited to, text data and tone information. You can also perform semantic analysis on the text data parsed in the previous step to determine the customer's intentions. Each analysis result can correspond to a specific response sentence.

For example, the final analysis result is "The wrong number was dialed", and the corresponding response sentence could be "Oh sorry, the call is wrong, then I will register here to avoid disturbing you in the future". The final analysis result is "the customer does not need the services currently provided", and the corresponding response sentence can be "then do not disturb you first, please hang up first, wish you happiness and safety, goodbye". The final analysis result is "the customer's intention is unclear", and the corresponding response sentence can be "Excuse me, I didn't catch it very well just now, can you repeat the question just now". The final analysis result is "The customer suspects that the customer service is a robot", and the corresponding response sentence can be "Yeah~~You are really good, you have heard it all, I am an intelligent customer service, I am honored to serve you". The final analysis result is "the customer's environment is very noisy", and the corresponding response sentence can be "the environment on your side is noisy, I don't know if you can hear what you just said".

After confirming the response sentence, you need to choose an appropriate time to send out the corresponding response voice. You can choose to generate and issue the response voice when the audio signal is less than the second preset threshold. The second preset threshold can be adjusted according to different analysis results. For example, if the analysis result determines that the audio signal is not a human voice, the second preset threshold may be 55-75 decibels; if the analysis result determines that the audio signal is a human voice, the second preset threshold may be the same as the first preset threshold. After it is determined that the response voice can be issued, the response voice can be generated according to the response sentence, and the response voice can be sent to the customer so that the customer can hear the response voice.

According to survey data, after using the voice interaction method provided by the embodiments of this application, customer satisfaction has increased from the original 50% to 80%, and the business compliance rate has also increased from the original 40% to 70%. The reason is that because the embodiments of the application have good adaptability (monitoring the audio signal of the customer channel), they can respond flexibly to customer feedback in time, improve the interaction with customers, and improve the smoothness of intelligent voice communication with customers. The degree of customer satisfaction and business compliance rate has also been greatly improved.

In steps S10-S40, when the dialogue voice is played, the audio signal of the client channel is obtained, and it is judged whether the specified parameter of the audio signal is greater than the first preset threshold, so as to monitor whether the client channel has the interrupted voice of the client or is larger. Environmental noise. If the designated parameter of the audio signal is greater than the first preset threshold, the playback of the dialogue voice is suspended to pause the voice output and prevent interference with the customer's speech. The audio signal is analyzed and the analysis result of the audio signal is obtained, and the response sentence is determined according to the analysis result, so as to generate corresponding feedback information (that is, the response sentence) in combination with the actual situation. When the designated parameter of the audio signal of the client channel is less than the second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel, so that at an appropriate time Interact with customers with appropriate response voice.

Optionally, as shown in FIG. 3, before step S10, the method further includes:

S101. Obtain customer information;

S102: Establish a call connection with the customer according to the customer information;

S103: Determine initial voice parameters and initial dialogue text according to the customer information and preset interactive tasks;

S104: Generate an initial dialogue voice according to the initial voice parameters and the initial dialogue text.

S105. Send the initial dialogue voice to the client.

In this embodiment, the customer information includes, but is not limited to, the customer's name, age, occupation, contact information, and historical communication records. Here, the contact method can refer to a mobile phone number or a landline. You can establish a call connection with the customer by calling the customer's mobile phone number or fixed-line phone.

The preset interactive tasks can refer to the purpose of this exchange, such as user return visits, user surveys, business recommendations, and so on. The initial speech parameters may include pronunciation gender, speaking speed, intonation, volume and so on. The initial dialogue text can be the first sentence or multiple sentences of dialogue text after the server establishes a call connection with the client. For example, if the last name of the customer is "Li" obtained from the customer information, the following initial dialogue text is adopted when calling the customer-"Hello, is this Mr. Li". After the customer confirms his identity, he can take the following initial dialogue text-"Hello, Mr. Li, I have a questionnaire survey, which will take you about 3 minutes. Is it convenient for you now".

After the initial speech parameters and the initial dialogue text are determined, the corresponding initial dialogue speech can be synthesized by the speech synthesis engine. Here, a speech synthesis engine with a higher degree of immersion can be selected to generate an initial dialogue voice that is closer to the voice of a real person.

After the initial dialogue voice is generated, the initial dialogue voice can be sent to the client carried by the client through the call connection, and the client receives the initial dialogue voice through the client.

In steps S101-S102, customer information is obtained to obtain the customer's contact information. Establish a call connection with the customer according to the customer data to establish a call with the customer. The initial voice parameters and initial dialogue text are determined according to the customer profile and preset interactive tasks, and data is prepared for generating the initial dialogue voice. According to the initial speech parameters and the initial conversation text, an initial conversation speech is generated to convert the text data into audio data. The initial dialogue voice is sent to the client so that the client can receive the initial dialogue voice.

Optionally, as shown in FIG. 4, step S30 includes:

S301. Analyze the audio signal of the client channel and obtain an analysis result of the audio signal, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;

S302: If the obtained analysis result is that the audio signal does not contain human voice, select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice;

S303. Generate the response sentence according to the connection sentence and the dialogue voice, and associate the response sentence with the first voice adjustment parameter.

In this embodiment, the server can set a human voice recognition program to determine whether the audio signal contains human voice. There are two judgment results of the human voice recognition program, including human voice and non-human voice. A number of different connection sentences can be preset, which are associated with different judgment results. For example, when it is judged that the audio signal does not contain human voice, and the environment of the customer is determined to be relatively noisy, the connection sentence can be "Mr. X, your side is a bit noisy, do I need to increase the volume and repeat it again". The first voice adjustment parameter may be generated based on the judgment result to change the volume of the response voice. Here, conversational speech refers to conversational speech interrupted by noise. Part or all of the content can be selected from the conversation speech interrupted by noise, and the connection sentence can be used to generate the response sentence. The generated response sentence is associated with the adjusted first voice adjustment parameter, and the two can synthesize a corresponding response voice.

In steps S301-S303, the audio signal of the client channel is analyzed and an analysis result of the audio signal is obtained, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice to distinguish Different coping scenarios. If the obtained analysis result is that the audio signal does not contain human voice, then select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice, so as to do when the analysis result is environmental noise Take the corresponding response steps. The response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter to generate a response sentence suitable for environmental noise.

Optionally, as shown in FIG. 5, after step S301, the method further includes:

S304. If the obtained analysis result is that the audio signal contains human voice, convert the audio signal of the customer channel into text data through a voice recognition engine, and recognize the audio of the customer channel through a preset tone recognition model The tone type of the signal;

S305: Recognizing the semantic information of the text data through a semantic understanding engine;

S306. Select the response sentence matching the semantic information from a preset response sentence database, and obtain a second voice adjustment parameter matching the tone type, and the second voice adjustment parameter is associated with the response sentence .

In this embodiment, if the audio signal of the client channel includes human voice, the human voice in the audio signal needs to be further identified to learn the needs of the client. The specific recognition steps include: first converting the audio signal into text data through the speech recognition engine, and then recognizing the semantic information of the text data through the semantic understanding engine. When the audio signal is converted into text data, the tone type of the audio signal can be recognized at the same time. A preset tone recognition model can be used to recognize the tone type of the audio signal. In a simplified tone recognition model, the recognized tone types include two types, one is positive and the other is negative. In the advanced tone recognition model, more than two tone types can be identified. After the tone type of the audio signal is recognized, the second voice adjustment parameter matching the tone type can be selected to adjust the voice parameter of the response voice.

The preset response sentence database is pre-stored with multiple response sentences, which are matched with specific semantic information. After recognizing the semantic information in the audio information, the response sentence with the highest matching degree can be found in the preset response sentence database. At the same time, the second voice adjustment parameter is associated with the response sentence.

In steps S304-S306, if the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a voice recognition engine, and the voice recognition model is used to recognize the The tone type of the audio signal of the customer channel to identify the content and tone of the current customer's sentence. The semantic information of the text data is recognized by the semantic understanding engine to further determine the customer's needs. The response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence to Select the appropriate response sentence to respond to the customer's words.

Optionally, as shown in FIG. 6, step S40 includes:

S401: Identify the background noise type of the audio signal of the client channel;

S402. Obtain the second preset threshold that matches the background noise type.

S403. When the designated parameter of the audio signal of the client channel is less than a second preset threshold, generate the response voice according to the response sentence and the first voice adjustment parameter, and send the response voice to Describe the customer corresponding to the customer channel.

In this embodiment, multiple background noise types can be preset, the similarity between the current audio signal and the feature values of each background noise type is calculated, and the background noise type with the highest similarity is selected as the background noise type of the audio signal. The preset background noise type can be a road scene, a commercial street scene, a supermarket scene, etc. Each background noise type matches a second preset threshold. For example, the second preset threshold for road scene matching may be 80 decibels, and the second preset threshold for commercial street scene matching may be 70 decibels.

If the audio signal is greater than the second preset threshold, it means that the background noise is very large. At this time, even if the dialogue voice is played, it is difficult for the customer to hear the content. Therefore, it is necessary to wait for the audio signal to be below the second preset threshold before the response voice Out. When judging whether the audio signal is less than the second preset threshold, a segment of audio signal can be buffered at a preset buffering time interval, and if the highest volume of the audio signal in the buffering time interval is less than the second preset threshold, the audio signal is judged The signal is less than the second preset threshold; if the highest volume of the audio signal in the buffer time interval is greater than or equal to the second preset threshold, it is determined that the audio signal is greater than or equal to the second preset threshold. The buffering time interval can be 0.3 to 0.5 seconds, and it can vary with the type of background noise.

In steps S401-S403, the background noise type of the audio signal of the client channel is identified to determine the type of scene the client is currently in. Acquire the second preset threshold that matches the type of background noise to select an appropriate response threshold (ie, the second preset threshold). When the designated parameter of the audio signal of the client channel is less than a second preset threshold, the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client The customer corresponding to the channel can interact with the customer at a better time.

The voice interaction method provided by the embodiments of the present application can improve the adaptability of intelligent voice, enhance the interaction with customers, and improve the fluency of communication with customers.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

In one embodiment, a voice interaction device is provided, and the voice interaction device corresponds to the voice interaction method in the foregoing embodiment one-to-one. As shown in FIG. 7, the voice interaction device includes an audio judgment module 10, a playback suspension module 20, a confirmation response sentence module 30, and a response voice sending module 40. The detailed description of each functional module is as follows:

The audio judging module 10 is used to obtain the audio signal of the client channel when the dialogue voice is played, and to judge whether the specified parameter of the audio signal is greater than the first preset threshold;

The suspension playing module 20 is configured to stop playing the dialogue voice if the designated parameter of the audio signal is greater than a first preset threshold;

The determining response sentence module 30 is configured to analyze the audio signal and obtain the analysis result of the audio signal, and determine the response sentence according to the analysis result;

The sending response voice module 40 is configured to generate a response voice according to the response sentence when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response voice to the corresponding client channel customer of.

Optionally, the voice interaction device further includes:

Get information module, used to obtain customer information;

A call connection establishment module, configured to establish a call connection with the customer according to the customer information;

The dialog text determining module is used to determine the initial voice parameters and the initial dialog text according to the customer information and preset interactive tasks;

Generating an initial dialogue voice module, configured to generate an initial dialogue voice according to the initial voice parameters and the initial dialogue text;

The initial dialogue voice sending module is used to send the initial dialogue voice to the client.

Optionally, the determining response sentence module 30 includes:

A parsing unit, configured to analyze the audio signal of the client channel and obtain an analysis result of the audio signal, wherein the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;

Selecting a connection sentence unit for selecting a connection sentence and a first voice adjustment parameter corresponding to the analysis result that does not contain human voice if the obtained analysis result is that the audio signal does not contain human voice;

The first generating response sentence unit is configured to generate the response sentence according to the connection sentence and the dialogue voice, and associate the response sentence with the first voice adjustment parameter.

Optionally, the determining response sentence module 30 further includes:

The voice recognition unit is configured to, if the obtained analysis result is that the audio signal contains human voice, convert the audio signal of the customer channel into text data through a voice recognition engine, and recognize the voice through a preset tone recognition model The tone type of the audio signal of the client channel;

The semantic understanding unit is used to identify the semantic information of the text data through the semantic understanding engine;

The second generating response sentence unit is configured to select the response sentence matching the semantic information from a preset response sentence database, and obtain a second voice adjustment parameter matching the tone type, and the second voice adjustment The parameter is associated with the response sentence.

Optionally, the sending and answering voice module 40 includes:

A background noise recognition unit, configured to recognize the background noise type of the audio signal of the client channel;

An acquiring threshold unit, configured to acquire the second preset threshold matching the background noise type;

Sending a response voice unit, configured to generate the response voice according to the response sentence and the first voice adjustment parameter when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response The voice is sent to the customer corresponding to the customer channel.

For the specific limitation of the voice interaction device, please refer to the above limitation of the voice interaction method, which will not be repeated here. Each module in the above-mentioned voice interaction device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The database of the computer equipment is used to store data related to the above-mentioned voice interaction method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a voice interaction method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.

In one embodiment, a computer device is provided, including a memory, a processor, and computer readable instructions stored in the memory and capable of running on the processor. When the processor executes the computer readable instructions, the following steps are implemented:

In one embodiment, a computer-readable storage medium is provided. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium. The readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by the processor, the following steps are implemented:

A person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A voice interaction method, characterized in that it comprises:

When playing the dialogue voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;

If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;

Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;

When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
The voice interaction method according to claim 1, characterized in that, before acquiring the audio signal of the client channel when the dialogue voice is played, and determining whether the designated parameter of the audio signal is greater than the first preset threshold, the method further comprises :

Obtain customer information;

Establishing a call connection with the customer according to the customer information;

Determining initial voice parameters and initial dialogue text according to the customer profile and preset interactive tasks;

Generating an initial dialogue voice according to the initial voice parameters and the initial dialogue text;

The initial dialogue voice is sent to the customer.
5. The voice interaction method according to claim 1, wherein the analyzing the audio signal and obtaining the analysis result of the audio signal, and determining the response sentence according to the analysis result, comprises:

Parse the audio signal of the client channel and obtain an analysis result of the audio signal, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;

If the obtained analysis result is that the audio signal does not contain human voice, select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice;

The response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter.
The voice interaction method of claim 3, wherein the analysis result of the audio signal of the client channel is analyzed and the analysis result of the audio signal is obtained, wherein the analysis result includes that the audio signal contains human voice or After the audio signal does not contain human voice, it also includes:

If the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a speech recognition engine, and the audio signal of the customer channel is recognized through a preset tone recognition model Tone type

Recognizing the semantic information of the text data through a semantic understanding engine;

The response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence.
The voice interaction method of claim 3, wherein when the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response The voice sent to the customer corresponding to the customer channel includes:

Identifying the background noise type of the audio signal of the client channel;

Acquiring the second preset threshold that matches the type of background noise;

When the designated parameter of the audio signal of the client channel is less than a second preset threshold, the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client The customer corresponding to the channel.
A voice interaction device, characterized in that it comprises:

The audio judgment module is used to obtain the audio signal of the client channel when the dialogue voice is played, and judge whether the specified parameter of the audio signal is greater than the first preset threshold;

The suspension playing module is configured to stop playing the dialogue voice if the designated parameter of the audio signal is greater than a first preset threshold;

The determining response sentence module is used to analyze the audio signal and obtain the analysis result of the audio signal, and determine the response sentence according to the analysis result;

The sending response voice module is used to generate a response voice according to the response sentence when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response voice to the corresponding client channel client.
8. The voice interaction device of claim 6, further comprising:

Get information module, used to obtain customer information;

A call connection establishment module, configured to establish a call connection with the customer according to the customer information;

The dialog text determining module is used to determine the initial voice parameters and the initial dialog text according to the customer information and preset interactive tasks;

Generating an initial dialogue voice module, configured to generate an initial dialogue voice according to the initial voice parameters and the initial dialogue text;

The initial dialogue voice sending module is used to send the initial dialogue voice to the client.
7. The voice interaction device according to claim 6, wherein the determining response sentence module comprises:

A parsing unit, configured to analyze the audio signal of the client channel and obtain an analysis result of the audio signal, wherein the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;

Selecting a connection sentence unit for selecting a connection sentence and a first voice adjustment parameter corresponding to the analysis result that does not contain human voice if the obtained analysis result is that the audio signal does not contain human voice;

The first generating response sentence unit is configured to generate the response sentence according to the connection sentence and the dialogue voice, and associate the response sentence with the first voice adjustment parameter.
8. The voice interaction device according to claim 8, wherein the determining response sentence module further comprises:

The voice recognition unit is configured to, if the obtained analysis result is that the audio signal contains human voice, convert the audio signal of the customer channel into text data through a voice recognition engine, and recognize the voice through a preset tone recognition model The tone type of the audio signal of the client channel;

The semantic understanding unit is used to identify the semantic information of the text data through the semantic understanding engine;

The second generating response sentence unit is configured to select the response sentence matching the semantic information from a preset response sentence database, and obtain a second voice adjustment parameter matching the tone type, and the second voice adjustment The parameter is associated with the response sentence.
8. The voice interaction device according to claim 8, wherein the voice sending and response module comprises:

A background noise recognition unit, configured to recognize the background noise type of the audio signal of the client channel;

An acquiring threshold unit, configured to acquire the second preset threshold matching the background noise type;

Sending a response voice unit, configured to generate the response voice according to the response sentence and the first voice adjustment parameter when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response The voice is sent to the customer corresponding to the customer channel.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows Steps: when the dialogue voice is played, the audio signal of the client channel is obtained, and it is judged whether the specified parameter of the audio signal is greater than the first preset threshold;

If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;

Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;

When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
The computer device according to claim 11, wherein, when the dialog voice is played, the audio signal of the client channel is acquired, and before the specified parameter of the audio signal is greater than the first preset threshold, the processing When the device executes the computer-readable instructions, the following steps are also implemented:

Obtain customer information;

Establishing a call connection with the customer according to the customer information;

Determining initial voice parameters and initial dialogue text according to the customer profile and preset interactive tasks;

Generating an initial dialogue voice according to the initial voice parameters and the initial dialogue text;

The initial dialogue voice is sent to the customer.
11. The computer device according to claim 11, wherein the analyzing the audio signal and obtaining the analysis result of the audio signal, and determining the response sentence according to the analysis result, comprises:

Parse the audio signal of the client channel and obtain an analysis result of the audio signal, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;

If the obtained analysis result is that the audio signal does not contain human voice, select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice;

The response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter.
The computer device according to claim 13, wherein the analysis result of the audio signal of the client channel is analyzed and the analysis result of the audio signal is obtained, wherein the analysis result includes that the audio signal contains a human voice or a voice signal. After the audio signal does not contain human voice, the processor further implements the following steps when executing the computer-readable instruction:

If the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a speech recognition engine, and the audio signal of the customer channel is recognized through a preset tone recognition model Tone type

Recognizing the semantic information of the text data through a semantic understanding engine;

The response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence.
The computer device according to claim 13, wherein when the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice Send to the customer corresponding to the customer channel, including:

Identifying the background noise type of the audio signal of the client channel;

Acquiring the second preset threshold that matches the type of background noise;

When the designated parameter of the audio signal of the client channel is less than a second preset threshold, the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client The customer corresponding to the channel.
One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps: when the dialogue voice is played , Obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;

If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;

Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;

When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
The readable storage medium according to claim 16, wherein when the dialogue voice is played, the audio signal of the client channel is acquired, and before the specified parameter of the audio signal is determined to be greater than the first preset threshold, When the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:

Obtain customer information;

Establishing a call connection with the customer according to the customer information;

Determining initial voice parameters and initial dialogue text according to the customer profile and preset interactive tasks;

Generating an initial dialogue voice according to the initial voice parameters and the initial dialogue text;

The initial dialogue voice is sent to the customer.
15. The readable storage medium according to claim 16, wherein the analyzing the audio signal and obtaining the analysis result of the audio signal, and determining the response sentence according to the analysis result, comprises:

Parse the audio signal of the client channel and obtain an analysis result of the audio signal, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;

If the obtained analysis result is that the audio signal does not contain human voice, select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice;

The response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter.
The readable storage medium of claim 18, wherein the analysis result of the audio signal of the client channel is analyzed and the analysis result of the audio signal is obtained, wherein the analysis result includes that the audio signal contains a human voice. Or after the audio signal does not contain human voice, when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:

If the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a speech recognition engine, and the audio signal of the customer channel is recognized through a preset tone recognition model Tone type

Recognizing the semantic information of the text data through a semantic understanding engine;

The response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence.
The readable storage medium of claim 18, wherein when the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the The response voice sent to the customer corresponding to the customer channel includes:

Identifying the background noise type of the audio signal of the client channel;

Acquiring the second preset threshold that matches the background noise type;

When the designated parameter of the audio signal of the client channel is less than a second preset threshold, the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client The customer corresponding to the channel.