CN112489642A

CN112489642A - Method, apparatus, device and storage medium for controlling voice robot response

Info

Publication number: CN112489642A
Application number: CN202011130332.3A
Authority: CN
Inventors: 刘彦华; 邓锐涛; 王艺霏
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-03-12
Anticipated expiration: 2040-10-21
Also published as: CN112489642B

Abstract

The application relates to a method, a device, equipment and a storage medium for controlling voice robot response. The method comprises the following steps: in the voice communication process between the voice robot and the user terminal, voice collection is carried out on the user terminal; determining the type of the current call state scene according to the voice acquisition result; the call state scene type is used for representing the states of a user corresponding to the user terminal and the voice robot in voice call; acquiring a robot response scheme corresponding to the call state scene type; and controlling the voice robot to respond according to the robot response scheme. By adopting the method, the accuracy of controlling the response of the robot can be improved.

Description

Method, apparatus, device and storage medium for controlling voice robot response

Technical Field

The present application relates to the field of artificial intelligence technology and the field of voice call technology, and in particular, to a method, an apparatus, a device, and a storage medium for controlling a voice robot response.

Background

With the development of artificial intelligence technology, many scenarios in which robots replace human beings have appeared. The voice robot is a commonly used intelligent robot, and can replace manual customer service to communicate with a user, so that part of customer service affairs are executed. For example, it is a common scenario to use a voice robot to make an outbound call. The outbound call refers to actively calling a user through the voice robot to establish a voice call.

In the traditional method, the voice robot can communicate with the user only by playing fixed voice according to a preset fixed flow. However, different users have different responses to the same voice content played by the voice robot, so that the unified response to the user according to a single fixed voice is too limited, resulting in the response accuracy of the robot. Therefore, the traditional method has low response accuracy and is an urgent problem to be solved.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device, and a storage medium for controlling a voice robot response, which can improve response accuracy, in view of the above technical problems.

A method of controlling voice robot response, the method comprising:

in the voice communication process between the voice robot and the user terminal, voice collection is carried out on the user terminal;

determining the type of the current call state scene according to the voice acquisition result; the call state scene type is used for representing the states of a user corresponding to the user terminal and the voice robot in voice call;

acquiring a robot response scheme corresponding to the call state scene type;

and controlling the voice robot to respond according to the robot response scheme.

In one embodiment, the performing voice collection on the user terminal includes:

and starting to record voice data from the detection of the initial voice signal of the user terminal until the voice signal of the user terminal is not detected within the continuous preset time, and stopping to obtain the recorded voice data.

In one embodiment, the determining the current call state scene type according to the voice acquisition result includes:

converting the collected voice data into text content;

and analyzing the text content and determining the type of the current call state scene.

In one embodiment, the parsing the text content to determine the current call state scene type includes:

if the text content is empty, judging that the current conversation state scene type is a silent scene;

the acquiring of the robot response scheme corresponding to the call state scene type includes:

acquiring user state confirmation voice corresponding to the silent scene;

the controlling the voice robot to respond according to the robot response scheme includes:

and controlling the voice robot to play the user state confirmation voice.

In one embodiment, the method further comprises:

acquiring a call state of the voice robot;

the analyzing the text content and determining the current call state scene type comprises the following steps:

and if the text content is not empty and the call state of the voice robot is the broadcast state, judging that the current call state scene type is an abnormal interruption scene.

In one embodiment, the obtaining a robot response scheme corresponding to the call state scene type includes:

carrying out preset label detection processing on the voice played by the voice robot in a broadcasting state;

judging whether the played voice is allowed to be interrupted or not according to the label detection result;

if the played voice is allowed to be interrupted, acquiring a broadcast stopping scheme corresponding to an abnormal interruption scene;

and if the played voice is not allowed to be interrupted, acquiring a play maintaining scheme corresponding to the abnormal interruption scene.

In one embodiment, the performing preset tag detection processing on the voice played by the voice robot in the broadcast state includes:

determining a voice file in which the played voice is located;

detecting whether the voice file carries the continuous playing label or not;

the judging whether the played voice is allowed to be interrupted or not according to the label detection result comprises the following steps:

if the fact that the continuous playing tag is carried is detected, judging that the played voice is not allowed to be interrupted;

and if the continuous playing tag is not detected to be carried, judging that the played voice is allowed to be interrupted.

In one embodiment, the determining that the played voice is allowed to be interrupted if the continuous playing tag is not detected to be carried includes:

if the fact that the continuous playing tag is carried is not detected, determining a time node to which the voice robot plays in the played voice currently;

detecting whether the voice content under the time node carries an interrupt prohibition tag or not;

if the interruption prohibition tag is not carried, judging that the played voice permission is interrupted;

the method further comprises the following steps:

and if the interruption prohibition tag is carried, judging that the played voice is not allowed to be interrupted.

if the voice robot is not in the broadcasting state and the text content is not empty, then

Performing repeated content analysis on the text content;

if the text content is analyzed to have continuous and repeated content in a preset time period, judging that the current call state scene type is a repeated scene;

and controlling the voice robot to play active interrupting voices corresponding to the repeated scenes.

if the voice signal of the user disappears and the voice signal of the user is detected again within a preset time length, judging that the current call state scene type is a long sentence waiting scene;

splicing the voice signal collected before the redetection and the voice signal collected after the redetection;

converting the spliced voice signal into text content;

performing semantic recognition on the text content, and acquiring response information corresponding to a semantic recognition result;

and converting the response information into response voice.

detecting the signal intensity of the collected voice signal; if the signal intensity is lower than a preset threshold value, judging that the current call state scene type is an inaudible scene;

alternatively, the first and second electrodes may be,

detecting the continuity of the acquired voice signal;

and if the continuity detection result is that the signal discontinuity is discontinuous, judging that the current call state scene type is an inaudible scene.

An apparatus to control voice robot response, the apparatus comprising:

the voice acquisition module is used for acquiring voice of the user terminal in the voice call process between the voice robot and the user terminal;

the call state scene recognition module is used for determining the current call state scene type according to the voice acquisition result; the call state scene type is used for representing the call state of the user between the current call state and the voice robot;

the robot response module is used for acquiring a robot response scheme corresponding to the conversation state scene type; and controlling the voice robot to respond according to the robot response scheme.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a robot response scheme corresponding to the call state scene type;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a robot response scheme corresponding to the call state scene type;

According to the method, the device, the computer equipment and the storage medium for controlling the response of the voice robot, voice collection is carried out on the user terminal in the voice communication process between the voice robot and the user terminal. And determining the current call state scene type according to the voice acquisition result, wherein the call state scene type is used for representing the states of the user corresponding to the user terminal and the voice robot in the voice call. Therefore, the voice robot can be controlled to respond according to the robot response scheme corresponding to the call state scene type. Therefore, the robot is controlled to flexibly respond to different communication states, and the accuracy of response control on the voice robot is improved.

Drawings

FIG. 1 is a diagram of an application environment for a method of controlling voice robot response in one embodiment;

FIG. 2 is a schematic flow chart diagram illustrating a method for controlling voice robot response in one embodiment;

FIG. 3 is a flow diagram illustrating a method for controlling voice robot response in one embodiment;

FIG. 4 is a schematic flow chart diagram illustrating a method for controlling voice robot response in one embodiment;

FIG. 5 is a schematic flow chart diagram illustrating a method for controlling voice robot response in one embodiment;

FIG. 6 is a flow diagram illustrating a method for controlling voice robot response in one embodiment;

FIG. 7 is a block diagram of an apparatus for controlling the response of a voice robot in one embodiment;

FIG. 8 is a block diagram showing the construction of an apparatus for controlling a voice robot response according to another embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method for controlling the voice robot response provided by the application can be applied to the application environment shown in fig. 1. Wherein the call platform 102 communicates with the user terminal 104 over a network. The intelligent robot in the call platform 102 can make a voice call with the user terminal. The user terminal 104 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The call platform 102 may be implemented as a stand-alone server or as a server cluster of multiple servers. The voice robot is an intelligent calling and answering module in a calling platform and can automatically carry out voice conversation with a user in voice communication. The call platform 102 may be an outbound platform that actively initiates a call to the user terminal, or may be a platform that receives a call initiated by the user terminal, which is not limited in this respect.

The call platform 102 performs voice collection on the user terminal 104 during a voice call between the voice robot and the user terminal 104. The call platform 102 may determine the current call state scene type according to the voice acquisition result. The call platform 104 may obtain a robot response scheme corresponding to the call state scene type, and control the voice robot to respond according to the robot response scheme. If the voice robot generates a response voice, the response voice may be transmitted to the user terminal 102.

It should be noted that fig. 1 is only a schematic illustration, and in other embodiments, the voice robot may also be a stand-alone computer device (for example, a humanoid simulation robot with voice call capability), and is not limited to one intelligent module in the call platform, and communication may be performed between the voice robot itself and the user terminal. Then, the method of controlling the voice robot response in the embodiments of the present application may be performed by the voice robot itself.

In one embodiment, as shown in fig. 2, a method for controlling voice robot response is provided, which is illustrated by applying the method to the call platform in fig. 1, and includes the following steps:

step 202, during the voice communication process between the voice robot and the user terminal, voice collection is performed on the user terminal.

The voice robot is an artificial intelligent robot which is in a calling platform and can autonomously communicate with a user in a user terminal.

Specifically, the voice robot may establish a voice call with the user terminal, and in the voice call process, the call platform may perform voice acquisition on the user terminal to acquire voice data of a user at the user terminal side.

In one embodiment, the call platform may be an outbound platform, and the voice robot in the outbound platform may actively initiate a call to the user terminal to establish a voice call with the user terminal. In the voice communication process, the voice robot can carry out voice collection on the user terminal so as to collect voice data of a user at the user terminal side. In other embodiments, the voice acquisition device in the call platform may also perform voice acquisition on the user terminal.

In one embodiment, the call platform may also be a platform that receives calls initiated by user terminals. Namely, the user terminal actively initiates a call request to the call platform to establish a voice call with the voice robot which answers in the call platform. It can be understood that the voice robot in this embodiment is equivalent to an artificial intelligence customer service with a voice call function.

And step 204, determining the current call state scene type according to the voice acquisition result.

The call state scene type refers to a type of a call state scene.

The call state refers to a state of the user corresponding to the user terminal and the voice robot in the voice call, for example, the state of the user in the call, or the state of the voice robot, which is different from the call state represented by the traditional signal quality.

In an embodiment, the call state scene type may include at least one of a silence scene, an abnormal interruption scene, a repeated scene, a long sentence waiting scene, and an inaudible scene. For example, the repeated description of the scene indicates that the user is in a repeated description state during the call.

In one embodiment, the call platform may determine the current call state scene type directly from the voice data in the voice acquisition result. Specifically, the call platform may determine the current call state scene type according to the voice waveform in the voice data.

In one embodiment, the call platform may also perform text conversion on the voice capture result and identify the current call state scene type.

And step 206, acquiring a robot response scheme corresponding to the conversation state scene type.

The robot response scheme refers to a scheme in which the robot responds to the voice collected from the user terminal during the voice call.

In one embodiment, the robot response scheme may include at least one of a stop play scheme, a maintain play scheme, an active break scheme, and a scheme to generate a response voice, among others.

Specifically, a corresponding robot response scheme is preset in the call platform for different call state scene types. The call platform may look up a robot response scenario corresponding to the determined call state scenario type.

And step 208, controlling the voice robot to respond according to the robot response scheme.

Specifically, the call platform may control the voice robot to respond during the voice call between the voice robot and the user terminal according to the robot response scheme.

It can be understood that different robot response schemes exist according to different call state scene types, and therefore the voice robot can be controlled to carry out different flexible responses.

According to the method for controlling the voice robot response, voice collection is carried out on the user terminal in the voice communication process between the voice robot and the user terminal. And determining the current call state scene type according to the voice acquisition result, wherein the call state scene type is used for representing the states of the user corresponding to the user terminal and the voice robot in the voice call. Therefore, the voice robot can be controlled to respond according to the robot response scheme corresponding to the call state scene type. Therefore, the robot is controlled to flexibly respond to different communication states, and the accuracy of response control on the voice robot is improved. Furthermore, the naturalness, the authenticity and the accuracy of the human-computer interaction are improved.

In one embodiment, the voice acquisition for the user terminal includes: and starting to record voice data from the detection of the initial voice signal of the user terminal until the voice signal of the user terminal is not detected within the continuous preset time length, and then stopping to obtain the recorded voice data.

The initial voice signal refers to a voice signal generated by the user terminal from the beginning of silence detection.

Specifically, in the Voice call process, the call platform may perform Voice acquisition on the user terminal through Voice endpoint Detection (VAD). During the collection, the recording of the voice data may be started from the beginning of silence detection when an initial voice signal of the user terminal is detected (i.e., when the beginning of speaking of the user is detected), and then the recording is stopped after the voice signal of the user terminal is not detected within a continuous preset time period, so as to obtain the recorded voice data.

For example, from the beginning of silence detection, recording voice is started when the user is detected to speak, and then voice signals are not continuously detected for 20ms continuously, recording can be stopped when a speech is considered to be over, and a piece of voice information is obtained.

In the embodiment, the one-segment speech spoken by the user can be accurately collected, the calculation pressure of analyzing the full-quantity collected speech is avoided, and the speech data with the information quantity can be accurately collected through the method.

In one embodiment, the step 204 of determining the current call state scene type according to the voice collecting result includes: converting the collected voice data into text content; and analyzing the text content and determining the type of the current call state scene.

Specifically, the call platform may perform voice recognition processing on the collected voice data through the voice recognition device to convert the collected voice data into text content. The calling platform can send the converted text content to the central control device through the voice recognition device, the central control device analyzes the text content, and the type of the current conversation state scene is judged according to the analysis result.

In the embodiment, the voice data is converted into the text content to perform the call state scene analysis, so that compared with the method of directly performing the scene analysis according to the voice, the analysis difficulty is reduced, the scene analysis efficiency is improved, and the analysis resources of the system are saved.

As shown in fig. 3, in one embodiment, a method for controlling voice robot response is provided, which specifically comprises the following steps:

step 302, during the voice communication between the voice robot and the user terminal, voice collection is performed on the user terminal.

Step 304, converting the collected voice data into text content, and analyzing the text content.

Step 306, if the text content is empty, it is determined that the current call state scene type is a silent scene.

The silence scene refers to a silence state in which the user is not speaking during the call. The user state confirmation voice is a voice for inquiring the state where the user is currently located.

Specifically, after the collected voice data is converted into text content, the call platform may detect whether the text content is empty through the central control device, that is, detect whether there is text information in the text content, and when the text content is empty, it indicates that the user at the user terminal side does not speak during the process of collecting the voice data, so that it may be determined that the current call state scene type is a silent scene.

Step 308, obtaining a user state confirmation voice corresponding to the silent scene; and the user state confirmation voice is used for confirming whether the user is in the answering state or not.

And step 310, controlling the voice robot to play the user state confirmation voice.

The call platform sets user state confirmation voice aiming at the silent scene in advance, and can acquire the user state confirmation voice corresponding to the silent scene and control the voice robot to play the user state confirmation voice so as to inquire whether the user is in an answering state currently.

For example, in a silent scene, the voice robot can be controlled to play "ask you still listen? "this user status confirms the voice.

It can be understood that, in the call platform, a unified user status confirmation voice can be set in advance for all silence scenarios. In other embodiments, the call platform may further determine the time for generating the silent scene after determining the silent scene, and play the different user state confirmation voices according to different times for generating the silent scene. For example, when the timing of generating the silent scene is after the voice robot plays the active query voice (i.e., the silent scene is generated after the voice robot plays the active query voice), the voice for querying whether the user keeps the listening state may be played. For example, "ask you are still listening? I.e. a voice for asking the user whether to keep listening. When the timing of the silent scene is after the voice robot plays the answering voice (i.e. the silent scene is generated after the voice robot plays the answering user question voice), it can play the voice for asking the user whether there are other questions or whether the user question is successfully answered. For example, play "ask you for other questions? "speech, or play" ask if there is a question to answer your question? "is used.

In the above embodiment, if it is detected that the current state is in the silent scene, the voice robot is controlled to play the user state confirmation voice to inquire the current state of the user, so that the accuracy and the authenticity of the human-computer interaction are improved.

In one embodiment, the method further comprises: and acquiring the call state of the voice robot. In this embodiment, parsing the text content and determining the current call state scene type includes: and if the text content is not empty and the call state of the voice robot is the broadcast state, judging that the current call state scene type is an abnormal interruption scene. The communication state of the voice robot refers to a state in which the voice robot is in communication. The broadcast state means that the voice robot is playing voice to the user terminal. The abnormal interruption scene refers to a scene interrupted by the sound of the user when the voice robot broadcasts normally.

Specifically, if the text content is not empty and the call state of the voice robot is a broadcast state, it indicates that the user speaks when the voice robot is playing voice, and it indicates that the user wants to interrupt the broadcast of the voice robot, so that it can be determined that the current call state scene type is an abnormal interrupt scene.

It can be understood that, for an abnormal interruption scene, the call platform can judge whether interruption is allowed or not according to the content of the current broadcast of the voice robot, so as to flexibly control the voice robot to perform corresponding response.

In the embodiment, the current call state scene type can be conveniently judged to be an abnormal interruption scene according to the condition that the text content is empty and the voice robot is in the broadcasting state, and complex scene analysis processing is not needed.

As shown in fig. 4, in one embodiment, a method for controlling voice robot response is provided, which specifically includes the following steps:

and 402, carrying out voice acquisition on the user terminal in the voice communication process between the voice robot and the user terminal.

Step 404, converting the collected voice data into text content, and analyzing the text content.

If the text content is empty, step 406 is executed. If the text content is not empty and the call status of the voice robot is on air, step 412 is executed.

Step 406, determining that the current call state scene type is a silence scene.

Step 408, obtaining the user state confirmation voice corresponding to the silence scene.

And step 410, controlling the voice robot to play the user state confirmation voice.

Step 412, determining the current call state scene type as an abnormal interruption scene.

And step 414, performing preset tag detection processing on the voice played by the voice robot in the broadcasting state.

Step 416, according to the tag detection result, determine whether the played voice is allowed to be interrupted. If the played voice is allowed to be interrupted, step 418 is performed, and if the played voice is not allowed to be interrupted, step 420 is performed.

Specifically, after the current call state scene type is determined to be an abnormal interruption scene, the call platform may perform preset tag detection processing on the voice played when the voice robot is in the broadcast state, and determine whether the played voice is allowed to be interrupted according to a tag detection result.

It can be understood that when the preset tag is detected, it is determined that the played voice is not allowed to be interrupted, and when the preset tag is not detected, it is determined that the played voice is allowed to be interrupted. The method may further include determining that the played voice is allowed to be interrupted when the preset tag is detected, and determining that the played voice is not allowed to be interrupted when the preset tag is not detected. This is not limitative.

And 418, acquiring a broadcast stopping scheme corresponding to the abnormal interruption scene, and controlling the voice robot to stop the current broadcast according to the broadcast stopping scheme.

And step 420, acquiring a play maintaining scheme corresponding to the abnormal interruption scene, and controlling the voice robot to continue the current broadcast.

Further, if the played voice is allowed to be interrupted, the call platform may acquire a broadcast stopping scheme corresponding to the abnormal interruption scene, so as to control the voice robot to stop the current broadcast according to the broadcast stopping scheme. It can be understood that the call platform can suspend the current broadcast of the voice robot according to the broadcast stopping scheme, and resume the broadcast after the abnormal interruption scene is processed. The call platform may also end the current announcement according to the stop announcement scheme.

If the played voice is not allowed to be interrupted, the call platform can acquire a play maintaining scheme corresponding to the abnormal interruption scene, so that the voice robot is controlled to continue the current broadcast without interruption according to the play maintaining scheme.

In the above embodiment, by detecting the tag carried in the voice played by the voice robot, whether the voice is allowed to be interrupted or not can be accurately and quickly detected, so that the voice robot can be accurately controlled to respond.

In one embodiment, the preset tag detection processing is performed on the voice played by the voice robot in the broadcasting state, and includes: determining a voice file in which the played voice is located; and detecting whether the voice file carries a continuous playing label or not. In this embodiment, determining whether the played voice is allowed to be interrupted according to the tag detection result includes: if the fact that the continuous playing tag is carried is detected, judging that the played voice is not allowed to be interrupted; if the fact that the continuous playing tag is carried is not detected, the fact that the played voice is allowed to be interrupted is judged.

The continuous play tag is a tag for indicating that the voice needs to be continuously played.

Specifically, a continuous playing label is added to a key voice file to be played by the voice robot in the call platform aiming at the dimension of the whole voice file. The call platform can determine a voice file where the played voice is located; and detecting whether the voice file carries a continuous playing label or not. If the continuous playing tag is detected to be carried, the played voice needs to be continuously played, and interruption is not allowed. That is, the voice content in the entire voice file is not allowed to be interrupted when played. If the continuous playing tag is not detected to be carried, the voice which is not specified to be played needs to be played continuously, and therefore interruption is allowed.

In the embodiment, the continuous playing tag is added to the key voice file which is not allowed to be interrupted, so that whether the currently played voice file is allowed to be interrupted or not can be quickly judged, and the voice robot can be accurately controlled to respond.

As shown in fig. 5, in one embodiment, a method for controlling voice robot response is provided, which specifically includes the following steps:

step 502, during the voice communication process between the voice robot and the user terminal, voice collection is performed on the user terminal.

Step 504, converting the collected voice data into text content, and analyzing the text content.

If the text content is empty, go to step 506. If the text content is not empty and the call status of the voice robot is on air, go to step 512.

Step 506, determining the current call state scene type as a silence scene.

Step 508, obtaining the user state confirmation voice corresponding to the silence scene.

And step 510, controlling the voice robot to play the user state confirmation voice.

And step 512, judging the current call state scene type as an abnormal interruption scene.

Step 514, detecting whether the voice file in which the played voice is located carries a continuous playing tag.

If the continuous playing tag is detected to be carried, it is determined that the played voice is not allowed to be interrupted, and step 516 is executed. If the carrying of the continuous playing tag is not detected, go to step 518.

And 516, acquiring a play maintaining scheme corresponding to the abnormal interruption scene, and controlling the voice robot to continue the current broadcast.

Step 518, determining a time node to which the voice robot plays in the played voice currently.

Step 520, detecting whether the voice content under the time node carries the interrupt prohibition tag. If the interrupt disabled flag is carried, it is determined that the played voice is not allowed to be interrupted, and step 516 is executed. If the interrupt prohibition tag is not carried, it is determined that the permission of the played voice is interrupted, and step 522 is executed.

And 522, acquiring a broadcast stopping scheme corresponding to the abnormal interruption scene, and controlling the voice robot to stop the current broadcast according to the broadcast stopping scheme.

The interruption forbidding label is a label which is added aiming at the voice content under the time node granularity and is not allowed to be interrupted, and is used for indicating that the voice content under the time node is not allowed to be interrupted.

Specifically, an interrupt prohibition tag is added in the call platform in advance for the key voice content in the voice file to be played by the voice robot, so as to prohibit the key voice content from being abnormally interrupted when being played. When the call platform does not detect that the voice file in which the played voice is located carries the continuous playing tag, the call platform can further perform secondary tag detection, namely, determine the time node to which the voice robot plays in the played voice currently, locate the voice content under the time node in the played voice, and detect whether the voice content under the time node carries the interrupt prohibition tag or not. If the interruption prohibition tag is not carried, the played voice permission is judged to be interrupted.

It can be understood that if the interruption prohibition tag is carried, it is determined that the played voice is not allowed to be interrupted. I.e. the speech content under the current time node is not allowed to be interrupted. If the voice content under the subsequent time node does not carry the interrupt prohibition tag, the broadcasting can be stopped after the voice content under the current node which is not allowed to be interrupted is played.

It should be noted that the interruption prohibition tag is added for the local voice content in the entire voice file, so that the interrupted voice content is not allowed to be played from the fine dimension control of the time node, and the interruption is not allowed in the entire voice file playing process, so that the control response to the voice robot is more accurate and flexible, the bad experience brought to the user by the interruption of the entire voice is avoided, and the accuracy of the control response is improved.

In one embodiment, parsing the text content to determine the current call state scene type comprises: if the voice robot is not in the broadcasting state and the text content is not empty, performing repeated content analysis on the text content; and if the text content obtained through analysis has continuous and repeated content in the preset time period, judging that the current call state scene type is a repeated scene. In this embodiment, according to the robot response scheme, controlling the voice robot to respond includes: and controlling the voice robot to play active interrupting voice corresponding to the repeated scenes.

The repeated content analysis refers to a process of analyzing whether the text content has repeated content. The active voice interruption is the voice for the voice robot to play so as to interrupt the user to speak.

Specifically, if the speech robot is not in the broadcast state and the text content is not empty, it indicates that the user is speaking alone, and the repeated content analysis can be performed on the text content. And if the text content obtained through analysis has continuous and repeated content in the preset time period, judging that the current call state scene type is a repeated scene. The calling platform stores active interrupting voice in advance, and can control the voice robot to play the active interrupting voice corresponding to the repeated scenes so as to interrupt repeated speaking of the user.

For example, in an application scenario in which the voice robot calls out to urge the user to collect the account, the user may continuously say that the user has no money, and then the voice robot may be controlled to play preset voice information to interrupt the user.

In the embodiment, repeated content analysis is performed on the text content, so that the repeated scenes can be quickly and accurately determined, and the accuracy and efficiency of robot response are improved.

As shown in fig. 6, in one embodiment, a method for controlling voice robot response is provided, which specifically includes the following steps:

step 602, starting to record voice data from the detection of the initial voice signal of the user terminal until the voice signal of the user terminal is not detected within the continuous preset time period, and then stopping.

Step 604, converting the collected voice data into text content, analyzing the text content and obtaining the current call state of the voice robot. If the text content is empty, go to step 606, and if the text content is not empty and the call status of the voice robot is on-air, go to step 610. If the voice robot is not in the on-air state and the text content is not empty, step 622 is performed.

Step 606, determining the current call state scene type as a silent scene.

Step 608, obtaining the user state confirmation voice corresponding to the silent scene; and controlling the voice robot to play the user state confirmation voice.

And step 610, judging that the current call state scene type is an abnormal interruption scene, and determining a voice file where the played voice is located.

Step 612, detecting whether the voice file carries a continuous playing label.

If the continuous playing tag is detected to be carried, it is determined that the played voice is not allowed to be interrupted, and step 620 is executed. If the continuous playing tag is not detected to be carried, step 614 is executed.

And step 614, determining the time node played by the voice robot in the played voice currently.

Step 616, detecting whether the voice content under the time node carries the interrupt prohibition tag.

If the forbidden interrupt tag is not carried, it is determined that the played voice permission is interrupted, and step 618 is executed. If the interrupt prohibition tag is carried, it is determined that the played voice is not allowed to be interrupted, and step 620 is executed.

And step 618, acquiring a broadcast stopping scheme corresponding to the abnormal interruption scene to control the voice robot to stop the current broadcast.

And step 620, acquiring a play maintaining scheme corresponding to the abnormal interruption scene to control the voice robot to continue the current broadcast.

Step 622, performing repeated content analysis on the text content; if the text content obtained through analysis has continuous and repeated content in a preset time period, judging that the current conversation state scene type is a repeated scene; and controlling the voice robot to play active interrupting voice corresponding to the repeated scenes.

In one embodiment, determining the current call state scene type according to the voice acquisition result includes: and if the voice signal of the user disappears and the voice signal of the user is detected again within the preset time length, judging that the current conversation state scene type is a long sentence waiting scene. In this embodiment, obtaining a robot response scheme corresponding to a call state scene type includes: splicing the voice signal collected before the redetection and the voice signal collected after the redetection; converting the spliced voice signal into text content; performing semantic recognition on the text content, and acquiring response information corresponding to a semantic recognition result; and converting the response information into response voice.

The long sentence refers to a sentence which needs to be stopped in the expression to express a complete sentence and comprises a plurality of sentences. The long sentence waiting scene refers to a scene in which the voice robot needs to wait for the user to express the long sentence.

Specifically, if the voice signal of the user disappears but the voice signal of the user is detected again within the preset duration, it indicates that the user only pauses after speaking a sentence, and still does not speak the sentence and needs to continue speaking, and it may be determined that the current conversation state scene type is a long sentence waiting scene. Then, after the user is detected to finish speaking, the speech signal collected before the redetection (i.e. the speech of the user before the pause) and the speech signal collected after the redetection (i.e. the speech of the user after the pause) can be spliced, so as to splice into a long sentence which can express the complete expression of the user. The calling platform can convert the spliced voice signal (namely the spliced long sentence voice) into text content and perform semantic recognition on the text content. The calling platform can acquire response information corresponding to the semantic recognition result; and converting the response information into response voice. The generated response voice is the robot response scheme corresponding to the long sentence waiting scene. The calling platform can control the voice robot to play the response voice.

In the embodiment, the long sentence waiting scene can be accurately detected, the situation that the user plays the voice without saying the voice to cause wrong response is avoided, the voice signal collected before the redetection and the voice signal collected after the redetection are spliced, the response is carried out according to the content after the splicing, and the response accuracy is improved.

In one embodiment, determining the current call state scene type according to the voice acquisition result includes: detecting the signal intensity of the collected voice signal; and if the signal intensity is lower than a preset threshold value, judging that the current call state scene type is an inaudible scene.

The inaudible scene refers to a scene in which the content of the voice call is inaudible.

Specifically, the call platform can detect the signal strength of the collected voice signal, and if the signal strength is lower than a preset threshold value, it indicates that the voice signal is weak and the call quality is poor, and then it can be determined that the current call state scene type is an inaudible scene.

In one embodiment, determining the current call state scene type according to the voice acquisition result includes: detecting the continuity of the collected voice signals; and if the continuity detection result is that the signal discontinuity is discontinuous, judging that the current call state scene type is an inaudible scene.

The discontinuous signal refers to the discontinuous signal and the repeated discontinuous signal.

Specifically, the call platform may detect continuity of the acquired voice signal, and if the continuity detection result is that the signal is discontinuous, it indicates that the voice signal is weak and the call quality is poor, it determines that the current call state scene type is an inaudible scene.

It can be understood that, for a scene that cannot be listened to, the call platform can control the voice robot to play preset voice. For example, can play "is the signal not good enough, can you change places to answer the phone? A prompt voice of "such a signal is poor", or a voice of "give you later" to end the call is played.

In the embodiment, the inaudible scene can be accurately identified according to the voice signal, so that the robot response can be accurately controlled.

It should be understood that, although the steps in the flowcharts of the embodiments of the present application are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts of the embodiments of the present application may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 7, there is provided an apparatus for controlling a voice robot response, comprising: a voice acquisition module 702, a call state scene recognition module 704, and a robot response module 706, wherein:

and the voice acquisition module 702 is configured to perform voice acquisition on the user terminal in a voice call process between the voice robot and the user terminal.

A call state scene recognition module 704, configured to determine a current call state scene type according to a voice acquisition result; and the call state scene type is used for representing the call state of the user between the current call state and the voice robot.

A robot response module 706, configured to obtain a robot response scheme corresponding to the call state scene type; and controlling the voice robot to respond according to the robot response scheme.

In an embodiment, the voice collecting module 702 is further configured to start recording voice data from the detection of the initial voice signal of the user terminal until the recording of the voice data is stopped after the voice signal of the user terminal is not detected within a continuous preset time period, so as to obtain the recorded voice data.

In one embodiment, the call state scene recognition module 704 is further configured to convert the collected voice data into text content; and analyzing the text content and determining the type of the current call state scene.

In one embodiment, the call state scene recognition module 704 is further configured to determine that the current call state scene type is a silent scene if the text content is empty; the robot response module 706 is further configured to obtain a user state confirmation voice corresponding to the silence scene; and controlling the voice robot to play the user state confirmation voice.

In one embodiment, the call state scene recognition module 704 is further configured to obtain a call state of the voice robot; and if the text content is not empty and the call state of the voice robot is the broadcast state, judging that the current call state scene type is an abnormal interruption scene.

As shown in fig. 8, in one embodiment, the robotic response module 706 includes:

the tag detection module 706a is configured to perform preset tag detection processing on the voice played by the voice robot in the broadcast state; and judging whether the played voice is allowed to be interrupted or not according to the label detection result.

A response scheme obtaining module 706b, configured to obtain a broadcast stop scheme corresponding to the abnormal interruption scene if the played voice is allowed to be interrupted; and if the played voice is not allowed to be interrupted, acquiring a play maintaining scheme corresponding to the abnormal interruption scene.

In one embodiment, the tag detection module 706a is further configured to determine a voice file in which the played voice is located; detecting whether the voice file carries the continuous playing label or not; if the fact that the continuous playing tag is carried is detected, judging that the played voice is not allowed to be interrupted; and if the continuous playing tag is not detected to be carried, judging that the played voice is allowed to be interrupted.

In one embodiment, the tag detection module 706a is further configured to determine a time node to which the voice robot plays in the played voice currently if the voice robot does not detect that the voice robot carries the continuous playing tag; detecting whether the voice content under the time node carries an interrupt prohibition tag or not; if the interruption prohibition tag is not carried, judging that the played voice permission is interrupted; and if the interruption prohibition tag is carried, judging that the played voice is not allowed to be interrupted.

In one embodiment, the call state scene recognition module 704 is further configured to perform repeated content analysis on the text content if the voice robot is not in an on-air state and the text content is not empty; if the text content is analyzed to have continuous and repeated content in a preset time period, judging that the current call state scene type is a repeated scene; the robot response module 706 is further configured to control the voice robot to play an active interrupting voice corresponding to the repeated scene.

In one embodiment, the call state scene recognition module 704 is further configured to determine that the current call state scene type is a long sentence waiting scene if the voice signal of the user disappears but the voice signal of the user is detected again within a preset time duration; the robot response module 706 is further configured to splice the voice signal collected before the re-detection and the voice signal collected after the re-detection; converting the spliced voice signal into text content; performing semantic recognition on the text content, and acquiring response information corresponding to a semantic recognition result; and converting the response information into response voice.

In one embodiment, the call state scene recognition module 704 is further configured to detect a signal strength of the collected voice signal; if the signal intensity is lower than a preset threshold value, judging that the current call state scene type is an inaudible scene; or, detecting the continuity of the acquired voice signal; and if the continuity detection result is that the signal discontinuity is discontinuous, judging that the current call state scene type is an inaudible scene.

For specific limitations of the apparatus for controlling the voice robot response, reference may be made to the above limitations of the method for controlling the voice robot response, which are not described herein again. The modules in the device for controlling the voice robot response can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server of a call platform, the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the voice played by the voice robot. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of controlling voice robot response.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: in the voice communication process between the voice robot and the user terminal, voice collection is carried out on the user terminal; determining the type of the current call state scene according to the voice acquisition result; the call state scene type is used for representing the states of a user corresponding to the user terminal and the voice robot in voice call; acquiring a robot response scheme corresponding to the call state scene type; and controlling the voice robot to respond according to the robot response scheme.

In one embodiment, the performing voice collection on the user terminal includes: and starting to record voice data from the detection of the initial voice signal of the user terminal until the voice signal of the user terminal is not detected within the continuous preset time, and stopping to obtain the recorded voice data.

In one embodiment, the determining the current call state scene type according to the voice acquisition result includes: converting the collected voice data into text content; and analyzing the text content and determining the type of the current call state scene.

In one embodiment, the parsing the text content to determine the current call state scene type includes: if the text content is empty, judging that the current conversation state scene type is a silent scene; the acquiring of the robot response scheme corresponding to the call state scene type includes: acquiring user state confirmation voice corresponding to the silent scene; the controlling the voice robot to respond according to the robot response scheme includes: and controlling the voice robot to play the user state confirmation voice.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a call state of the voice robot; the analyzing the text content and determining the current call state scene type comprises the following steps: and if the text content is not empty and the call state of the voice robot is the broadcast state, judging that the current call state scene type is an abnormal interruption scene.

In one embodiment, the acquiring a robot response scheme corresponding to the call state scene type includes: carrying out preset label detection processing on the voice played by the voice robot in a broadcasting state; judging whether the played voice is allowed to be interrupted or not according to the label detection result; if the played voice is allowed to be interrupted, acquiring a broadcast stopping scheme corresponding to an abnormal interruption scene; and if the played voice is not allowed to be interrupted, acquiring a play maintaining scheme corresponding to the abnormal interruption scene.

In one embodiment, the performing preset tag detection processing on the voice played by the voice robot in the broadcast state includes: determining a voice file in which the played voice is located; detecting whether the voice file carries the continuous playing label or not; the judging whether the played voice is allowed to be interrupted or not according to the label detection result comprises the following steps: if the fact that the continuous playing tag is carried is detected, judging that the played voice is not allowed to be interrupted; and if the continuous playing tag is not detected to be carried, judging that the played voice is allowed to be interrupted.

In an embodiment, the determining that the played voice is allowed to be interrupted if the continuous playing tag is not detected to be carried includes: if the fact that the continuous playing tag is carried is not detected, determining a time node to which the voice robot plays in the played voice currently; detecting whether the voice content under the time node carries an interrupt prohibition tag or not; and if the interruption prohibition tag is not carried, judging that the played voice permission is interrupted. In this embodiment, the computer program further implements the following steps when executed by the processor: and if the interruption prohibition tag is carried, judging that the played voice is not allowed to be interrupted.

In one embodiment, the parsing the text content to determine the current call state scene type includes: if the voice robot is not in a broadcasting state and the text content is not empty, performing repeated content analysis on the text content; if the text content is analyzed to have continuous and repeated content in a preset time period, judging that the current call state scene type is a repeated scene; the controlling the voice robot to respond according to the robot response scheme includes: and controlling the voice robot to play active interrupting voices corresponding to the repeated scenes.

In one embodiment, the determining the current call state scene type according to the voice acquisition result includes: if the voice signal of the user disappears and the voice signal of the user is detected again within a preset time length, judging that the current call state scene type is a long sentence waiting scene; the acquiring of the robot response scheme corresponding to the call state scene type includes: splicing the voice signal collected before the redetection and the voice signal collected after the redetection; converting the spliced voice signal into text content; performing semantic recognition on the text content, and acquiring response information corresponding to a semantic recognition result; and converting the response information into response voice.

In one embodiment, the determining the current call state scene type according to the voice acquisition result includes: detecting the signal intensity of the collected voice signal; if the signal intensity is lower than a preset threshold value, judging that the current call state scene type is an inaudible scene; or, detecting the continuity of the acquired voice signal; and if the continuity detection result is that the signal discontinuity is discontinuous, judging that the current call state scene type is an inaudible scene.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of controlling voice robot response, the method comprising:

acquiring a robot response scheme corresponding to the call state scene type;

2. The method of claim 1, wherein the performing voice acquisition for the user terminal comprises:

3. The method of claim 1, wherein determining the current call state scene type according to the voice capture result comprises:

converting the collected voice data into text content;

4. The method of claim 3, wherein parsing the text content to determine a current call state scene type comprises:

acquiring user state confirmation voice corresponding to the silent scene;

and controlling the voice robot to play the user state confirmation voice.

5. The method of claim 3, further comprising:

acquiring a call state of the voice robot;

6. The method of claim 5, wherein obtaining a robot response scenario corresponding to the call state scenario type comprises:

7. The method according to claim 6, wherein the performing of the preset tag detection processing on the voice played by the voice robot in the broadcasting state includes:

determining a voice file in which the played voice is located;

detecting whether the voice file carries the continuous playing label or not;

8. The method of claim 7, wherein determining that the played voice is allowed to be interrupted if the continuous play tag is not detected, comprises:

the method further comprises the following steps:

9. The method of claim 3, wherein parsing the text content to determine a current call state scene type comprises:

Performing repeated content analysis on the text content;

10. The method of claim 1, wherein determining the current call state scene type according to the voice capture result comprises:

converting the spliced voice signal into text content;

and converting the response information into response voice.

11. The method according to any one of claims 1 to 10, wherein the determining a current call state scene type according to the voice acquisition result comprises:

alternatively, the first and second electrodes may be,

detecting the continuity of the acquired voice signal;

12. An apparatus for controlling voice robotic response, the apparatus comprising:

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 11.