CN112242135A

CN112242135A - Voice data processing method and intelligent customer service device

Info

Publication number: CN112242135A
Application number: CN201910650265.9A
Authority: CN
Inventors: 陈孝良; 祖拓; 王江; 冯大航
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2021-01-19

Abstract

The invention provides a voice data processing method and an intelligent customer service device, wherein the method comprises the following steps: the method comprises the steps that an intelligent customer service device collects audio signals sent by user equipment in real time in the process of playing preset voice contents to the user equipment; detecting audio information in the audio signal for indicating the user behavior type; if the audio information is determined to be used for indicating that the user has a question, the playing of the preset voice content is interrupted; and if the audio information is determined to be used for indicating that the user is speaking, reducing the volume of playing the preset voice content within the preset time. In the scheme, when the intelligent customer service device detects that the user speaks in the process of playing the voice content, the voice playing volume is reduced or the voice playing is interrupted according to the behavior type of the user. The speech content of the user is collected and identified, subsequent services are provided for the user, and the use experience of the user is improved.

Description

Voice data processing method and intelligent customer service device

Technical Field

The invention relates to the technical field of voice data processing, in particular to a voice data processing method and an intelligent customer service device.

Background

With the continuous development of science and technology, artificial intelligence technology is also widely used gradually. The intelligent customer service device is a common artificial intelligence technology for serving users.

The intelligent customer service device generally provides services for users in a voice playing mode, and the current mode of providing services for users by the intelligent customer service device is as follows: the method comprises the steps of introducing contents such as services and activities to a user, identifying the user's questions according to audio signals sent by user equipment, and finally answering the user's questions. However, the current intelligent customer service device cannot interrupt the voice playing of the intelligent customer service device in the process of introducing services and activities to the user or in the process of answering the user questions. In other words, in the process of playing the voice by the intelligent customer service device, even if the user has a new problem or does not want to hear the currently played content, the intelligent customer service device still plays the currently played voice content completely, and then re-identifies the audio signal of the user, thereby greatly reducing the user experience.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for processing voice data and an intelligent customer service device, so as to solve the problems of low user experience and the like of the existing intelligent customer service device.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

the first aspect of the embodiment of the invention discloses a method for processing voice data, which is suitable for an intelligent customer service device, and comprises the following steps:

the method comprises the steps that in the process that an intelligent customer service device plays preset voice contents to user equipment, the intelligent customer service device collects audio signals sent by the user equipment in real time;

detecting audio information which is used for indicating a user behavior type in the audio signal, wherein the user behavior type is that a user has a question or the user is speaking;

if the audio information is determined to be used for indicating that the user has a question, the preset voice content is interrupted to be played;

and if the audio information is determined to be used for indicating that the user is speaking, reducing the volume of playing the preset voice content within a preset time.

Preferably, if the audio information is used to indicate that the user has a question, after the playing of the preset voice content is interrupted, the method further includes:

inquiring the user about the user's question and collecting an audio signal sent by the user equipment;

performing voice recognition and emotion recognition by using the audio signal, and determining speaking content of the user and an emotion tag for indicating the speaking emotion of the user;

and answering the questions of the user, switching to manual customer service or ending the call according to the speaking content and the emotion label.

Preferably, after determining that the audio information is used to indicate that the user is speaking, the method further includes:

after the preset time, if audio information used for indicating that the user is speaking is detected to exist in the audio signal, the preset voice content is interrupted to be played;

Preferably, the answering the question of the user, forwarding a manual customer service or ending a call according to the speaking content and the emotion label includes:

if the speaking content and/or the emotion label accord with a preset reply rule, inquiring the user about the user question and replying the user question;

if the speaking content and/or the emotion label accord with a preset switching rule, switching to manual customer service for the user;

if the speaking content and/or the emotion label accord with a preset hang-up rule, ending the conversation with the user equipment;

and executing a reply rule, a forwarding rule or a hang-up rule according to the speaking content, wherein the priority of the reply rule, the forwarding rule or the hang-up rule is higher than that of the emotion label.

Preferably, the voice recognition and emotion recognition by using the audio signal, and the determination of the speaking content of the user and the emotion label for indicating the speaking emotion of the user, include:

and simultaneously inputting the audio signal into a preset voice recognition model and a preset emotion recognition model for voice recognition and emotion recognition, and determining the speaking content of the user and an emotion label for indicating the speaking emotion of the user, wherein the voice recognition model and the emotion recognition model are obtained by training a neural network model based on audio sample data.

Preferably, before performing speech recognition and emotion recognition by using the audio signal, and determining the speaking content of the user and an emotion tag indicating the emotion of speaking of the user, the method further includes:

and determining the age of the user based on the user information of the user, and selecting an emotion recognition model corresponding to the age, wherein emotion recognition models corresponding to different age groups are preset.

A second aspect of an embodiment of the present invention discloses an intelligent customer service device, including:

the intelligent customer service device is used for collecting audio signals sent by the user equipment in real time in the process of playing preset voice contents to the user equipment;

the determining unit is used for detecting audio information which is used for indicating a user behavior type in the audio signal, wherein the user behavior type is that a user has a question or the user is speaking;

the first interruption unit is used for interrupting the playing of the preset voice content if the audio information is determined to indicate that the user has a question;

and the adjusting unit is used for reducing the volume of the preset voice content in the preset time if the audio information is determined to indicate that the user is speaking.

Preferably, the intelligent customer service device further comprises:

a second interruption unit, configured to, after the preset time, interrupt playing of the preset voice content if it is detected that audio information indicating that a user is speaking exists in the audio signal;

the recognition unit is used for carrying out voice recognition and emotion recognition by utilizing the audio signal, and determining speaking content of the user and an emotion label for indicating the speaking emotion of the user;

and the processing unit is used for answering the questions of the user, switching to manual customer service or ending the call according to the speaking content and the emotion label.

Preferably, the determining unit is specifically configured to: and simultaneously inputting the audio signal into a preset voice recognition model and a preset emotion recognition model for voice recognition and emotion recognition, and determining the speaking content of the user and an emotion label for indicating the speaking emotion of the user, wherein the voice recognition model and the emotion recognition model are obtained by training a neural network model based on audio sample data.

Preferably, the processing unit includes:

the reply module is used for inquiring the user about the question and replying the question of the user if the speaking content and/or the emotion label accord with a preset reply rule;

the switching module is used for switching to the manual customer service for the user if the speaking content and/or the emotion label accord with a preset switching rule;

and the hang-up module is used for ending the conversation with the user if the speaking content and/or the emotion label accord with a preset hang-up rule.

Based on the above-mentioned voice data processing method and intelligent customer service device provided by the embodiments of the present invention, the method is: the method comprises the steps that an intelligent customer service device collects audio signals sent by user equipment in real time in the process of playing preset voice contents to the user equipment; detecting audio information in the audio signal for indicating the user behavior type; if the audio information is determined to be used for indicating that the user has a question, the playing of the preset voice content is interrupted; and if the audio information is determined to be used for indicating that the user is speaking, reducing the volume of playing the preset voice content within the preset time. In the scheme, when the intelligent customer service device detects that the user speaks in the process of playing the voice content, the voice playing volume is reduced or the voice playing is interrupted according to the behavior type of the user. The speech content of the user is collected and recognized, follow-up services such as answering questions, switching manual customer service or ending conversation are provided for the user, and the use experience of the user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for processing voice data according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a processing method of voice data according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating another voice data processing method according to an embodiment of the present invention;

fig. 4 is a block diagram of an intelligent customer service device according to an embodiment of the present invention;

FIG. 5 is a block diagram of another intelligent customer service device according to an embodiment of the present invention;

FIG. 6 is a block diagram of another intelligent customer service device according to an embodiment of the present invention;

fig. 7 is a block diagram of another intelligent customer service device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As can be seen from the background art, the current intelligent customer service device cannot interrupt the voice playing of the intelligent customer service device in the process of introducing services and activities to the user or in the process of answering the user questions. In the process of playing the voice by the intelligent customer service device, even if the user has a new problem or does not want to hear the currently played content, the intelligent customer service device still can completely play the currently played voice content, and then re-identifies the audio signal of the user, so that the use experience of the user is greatly reduced.

Therefore, the embodiment of the present invention provides a method for processing voice data and an intelligent customer service device, where in the process of voice playing, when a user is detected to speak, the intelligent customer service device reduces the voice playing volume or interrupts the voice playing. The speech content of the user is collected and identified, and subsequent services are provided for the user, so that the use experience of the user is improved.

Referring to fig. 1, a flowchart of a method for processing voice data, which is applicable to an intelligent customer service device and provided by an embodiment of the present invention, is shown, where the method includes the following steps:

step S101: the method comprises the steps that in the process that the intelligent customer service device plays preset voice contents to user equipment, the intelligent customer service device collects audio signals sent by the user equipment in real time.

In the process of implementing step S101 specifically, when the user communicates with the intelligent customer service device through the user equipment, the intelligent customer service device may play preset voice content to the user equipment, for example: for the intelligent customer service device of the bank, when the intelligent customer service device of the bank communicates with a customer, the intelligent customer service device of the bank introduces related products released by the bank through playing voice contents. And when the intelligent customer service device plays the voice content, the intelligent customer service device can collect the audio signal sent by the user equipment in real time.

Step S102: and the intelligent customer service device detects audio information which is used for indicating the behavior type of the user in the audio signal.

It should be noted that, when a user communicates with the intelligent customer service device through user equipment, the user equipment may collect an audio signal and send the audio signal to the intelligent customer service device. And determining a user behavior type according to the audio information in the audio signal, wherein the user behavior type is that the user has a question or the user is speaking.

In the process of implementing step S102, Voice Activity Detection (VAD) and Voice judgment are performed on the audio signal at the same time, and it is determined whether the user is speaking or whether the user has a question. And judging the tone of the audio signal by using a preset tone judgment model.

And if the audio information is determined to be used for indicating that the user has a question, interrupting the playing of the preset voice content. The user is asked questions and audio signals sent by the user device are collected. For example: is the user "casting" determined when the user's audio signal is detected? "," o? When the words of the query are represented, interrupting the currently played voice content, inquiring whether the user needs help or not, and collecting audio signals acquired by the user equipment after inquiring. And performing voice recognition and emotion recognition by using the audio signal, and determining the speaking content of the user and an emotion tag for indicating the speaking emotion of the user. And answering the questions of the user, switching to manual customer service or ending the call according to the speaking content and the emotion label.

If the user is determined to be speaking through the VAD, in order to further determine that the user is speaking, the audio signal is used as the input of a preset VAD model to determine whether audio information used for indicating that the user is speaking exists in the audio signal, and if the audio information exists in the audio signal determined through the VAD model, the user is finally determined to be speaking.

It should be noted that, the VAD model is obtained by training the neural network model based on the audio sample data in advance. And training a neural network model based on the tone word sample data in advance to obtain the tone judgment model.

Preferably, the VAD and VAD models are used simultaneously to determine audio information in the audio signal indicating that the user is speaking, and when both the VAD and VAD models determine that the audio information is present in the audio signal, the user is finally determined to be speaking.

Step S103: and if the audio information is determined to be used for indicating that the user has a question, interrupting the playing of the preset voice content.

Step S104: and if the audio information is determined to be used for indicating that the user is speaking, reducing the volume of playing the preset voice content within a preset time.

It should be noted that, when the intelligent customer service device plays the voice, in order to ensure that the user clearly listens to the played content, the volume of the played voice is usually large. When a user has a problem and needs to ask, if the intelligent customer service device still plays the voice with a large volume, the user experience is seriously influenced.

In the process of implementing step S104 specifically, when the intelligent customer service device determines that the user is speaking, in order to ensure the user experience, the intelligent customer service device needs to reduce the volume of playing voice within a preset time.

Preferably, after the preset time, if it is detected that the audio signal contains the audio information indicating that the user is speaking, the preset voice content is interrupted to be played.

In a specific implementation, after the preset time, if it is detected that audio information for indicating that the user is speaking exists in the audio signal, that is, after the volume of the voice content is reduced for the preset time, the user is still speaking, the preset voice content is interrupted from being played.

Preferably, when it is determined that the user is speaking, the form of adjusting the voice playing by the intelligent customer service device includes, but is not limited to, the following three cases:

the first condition is as follows: and the intelligent customer service device reduces the volume of playing the preset voice content when the user speaks. Namely, when the user speaks, the intelligent voice reduces the volume of the played voice and keeps the volume of the played voice at a preset value in the whole process.

Case two: and the intelligent customer service device interrupts the preset voice content which is played. Namely, when the user speaks, the preset voice content being played is interrupted.

Case three: and reducing the volume of playing the preset voice content within the preset time, and if the user does not stop speaking after the preset time, interrupting the playing of the preset voice content. For example: and firstly reducing the volume of the played voice within 1 second of the fact that the user is speaking. And if the user still speaks after 1 second, interrupting the playing of the current voice content.

Preferably, after step S104 is executed, voice recognition and emotion recognition are performed by using the audio signal, and the speaking content of the user and the emotion label indicating the emotion of speaking of the user are determined. And answering the questions of the user, switching to manual customer service or ending the call according to the speaking content and the emotion label.

In a further implementation, according to the speaking content and the emotion label, when the speaking content and the emotion label conform to a preset pushing rule, interface operation information including an interface operation website is pushed to the user equipment. For example: and when the user is determined to be impatient or complain that the voice operation is slow through the acquired audio signal, determining the operation which the user wants to execute according to the audio information. And if the user equipment is equipment with a built-in operation interface, for example, an app application with an operation interface or a special counter machine, directly pushing the operation interface to the user equipment. And if the user equipment is equipment without a built-in operation interface, pushing an operation interface website to the user equipment, and switching the user equipment to the corresponding operation interface when the user clicks the website.

In the voice recognition and emotion recognition process, the intelligent customer service device uploads the audio signal to a cloud server, and meanwhile, voice recognition and emotion recognition are carried out on the audio signal by using a voice recognition model and an emotion recognition model preset in the cloud server, so that the speaking content of the user and an emotion label used for indicating the speaking emotion of the user are determined. The speech recognition model and the emotion recognition model are obtained by training a neural network model based on audio sample data.

It should be noted that, because the expression modes of different emotions are different for people of different ages, emotion recognition models corresponding to different ages are preset. For example: and (3) respectively setting emotion recognition models corresponding to six categories of male juveniles, female juveniles, male middle-aged persons, female middle-aged persons, male old persons and female old persons. Before emotion recognition is carried out on the audio signal of the user, the age of the user is determined according to the user information of the user, and an emotion recognition model corresponding to the age is selected for emotion recognition.

Further, it should be noted that the above described division of the emotion recognition model includes, but is not limited to, the above six categories.

In a further implementation, the intelligent customer service device performs corresponding operations according to the speaking content and the emotion label, and the operations include but are not limited to: answering the user's question, switching to manual customer service, or ending the call. The specific contents are detailed as follows:

and if the speaking content and/or the emotion label accord with a preset reply rule, inquiring the user about the user question and replying the user question. For example: when detecting that the user says "what you are saying" in a flat mood, the smart customer service device asks the user what questions are not clear, and replies to the questions asked by the user.

And if the speaking content and/or the emotion label accord with a preset switching rule, switching to the manual customer service for the user. For example: and when detecting that the user says 'I need to transfer to the artificial customer service', transferring to the artificial customer service for the user.

And if the speaking content and/or the emotion label accord with a preset hang-up rule, ending the conversation with the user. For example: when it is detected that the user says "i are not interested" in a angry emotion, the intelligent customer service device ends the call with the user.

It should be noted that different types of emotion labels are preset, and when emotion recognition is performed on the audio signal, an emotion label for indicating a speaking emotion of the user is determined according to the audio signal.

It should be further noted that the priority of executing the reply rule, the forwarding rule or the hang-up rule according to the speaking content is higher than that of the emotion label. For example: and if the speaking content conforms to a reply rule and the emotion tag conforms to a switching rule, executing the reply rule, namely inquiring the user about the question and replying the question of the user. Another example is: and if the speaking content accords with a hangup rule and the emotion tag accords with a switching rule, executing the hangup rule, namely ending the conversation with the user.

In the embodiment of the invention, in the process of playing the voice content, when the intelligent customer service device detects that the user speaks, the intelligent customer service device detects the audio information which is used for indicating the behavior type of the user in the audio signal. And interrupting playing the preset voice content according to the behavior type of the user, inquiring the requirement of the user and collecting the audio signal, or reducing the volume of playing the preset voice content within the preset time, and after the preset time, the user still speaks to interrupt playing the preset voice content. According to the collected audio signals of the user, the speaking content and the speaking emotion of the user are identified, corresponding operation is executed according to the speaking content and the speaking emotion, and the use experience of the user is improved.

To better explain the content shown in each step in fig. 1, the processing method of voice data shown in fig. 2 and 3 is illustrated by way of example.

Referring to fig. 2, a schematic flow chart of a processing method of voice data provided by an embodiment of the present invention is shown, including the following steps:

step S201: the intelligent customer service device collects audio signals of the user side.

Step S202: and the intelligent customer service device detects whether the user speaks or not by utilizing a VAD algorithm based on the collected audio signals, if so, step S203 is executed, and if not, the step S201 is executed again.

Step S203: the intelligent customer service device further determines whether the user is speaking by using a neural network VAD model based on the collected audio signals, if so, interrupts voice playing or reduces voice playing volume, and executes step S204, otherwise, returns to execute step S201.

Step S204: and the intelligent customer service device performs voice recognition and emotion recognition on the audio signal, and is switched to manual customer service and answer questions or hang up according to the voice recognition result and the emotion recognition result.

Referring to fig. 3, a flow chart of a processing method of voice data provided by the embodiment of the invention is shown, which includes the following steps:

step S301: the intelligent customer service device collects audio signals of the user side.

Step S302: the intelligent customer service device simultaneously utilizes the VAD algorithm and the neural network VAD model to determine whether the user is speaking, if the VAD algorithm and the neural network VAD model both determine that the user is speaking, the voice playing is interrupted or the voice playing volume is reduced, and step S303 is executed. If the VAD algorithm and/or the neural network VAD model determines that the user does not speak, the step S301 is executed.

Step S303: and the intelligent customer service device performs voice recognition and emotion recognition on the audio signal, and is switched to manual customer service and answer questions or hang up according to the voice recognition result and the emotion recognition result.

It should be noted that, for the execution principle of each step in fig. 2 and fig. 3, reference may be made to the content corresponding to each step in fig. 1 in the embodiment of the present invention, and details are not repeated here.

In the embodiment of the invention, in the process of playing the voice content, when the intelligent customer service device detects that the user speaks, the volume of playing the preset voice content is reduced within the preset time, and after the preset time, the user still speaks to interrupt the playing of the preset voice content. According to the collected audio signals of the user, the speaking content and the speaking emotion of the user are identified, corresponding operation is executed according to the speaking content and the speaking emotion, and the use experience of the user is improved.

Corresponding to the method for processing voice data provided in the foregoing embodiment of the present invention, referring to fig. 4, an embodiment of the present invention further provides a structural block diagram of an intelligent customer service device, where the intelligent customer service device includes: the system comprises a collecting unit 401, a determining unit 402, a first interrupting unit 403 and an adjusting unit 404;

the acquisition unit 401 is configured to collect, in real time, an audio signal sent by the user equipment when the intelligent customer service device plays the preset voice content to the user equipment.

A determining unit 402, configured to detect audio information in the audio signal, where the audio information is used to indicate a user behavior type, where the user behavior type is that a user has a question or that the user is speaking. The process of determining the user behavior type is described in the above embodiment of the present invention, and reference is made to the corresponding contents in step S102 in fig. 1.

A first interrupting unit 403, configured to interrupt playing of the preset voice content if it is determined that the audio information is used to indicate that the user has a question.

An adjusting unit 404, configured to reduce the volume of playing the preset voice content within a preset time.

In the embodiment of the invention, in the process of playing the voice content, when the intelligent customer service device detects that the user speaks, the intelligent customer service device detects the audio information which is used for indicating the behavior type of the user in the audio signal. The preset voice content is interrupted to be played according to the behavior type of the user, or the volume of the preset voice content is reduced within the preset time, the user still speaks after the preset time, the preset voice content is interrupted to be played, and the use experience of the user is improved.

Preferably, referring to fig. 5 in conjunction with fig. 4, a structural block diagram of an intelligent customer service provided in an embodiment of the present invention is shown, where the intelligent customer service device further includes:

a second interruption unit 405, configured to, after the preset time, interrupt playing of the preset voice content if it is detected that audio information indicating that the user is speaking exists in the audio signal.

In a specific implementation, after the adjusting unit 404 reduces the volume of playing the preset voice content within a preset time, the second interrupting unit 405 is executed.

A recognition unit 406, configured to perform speech recognition and emotion recognition by using the audio signal, and determine speaking content of the user and an emotion tag indicating a speaking emotion of the user.

Preferably, in a specific implementation, the identifying unit 406 is further configured to determine an age of the user based on the user information of the user, and select an emotion recognition model corresponding to the age, where emotion recognition models corresponding to different age groups are preset.

And the processing unit 407 is configured to answer the question of the user, transfer manual customer service, or end a call according to the speaking content and the emotion tag.

Correspondingly, the determining unit 402 is specifically configured to: and simultaneously inputting the audio signal into a preset voice recognition model and a preset emotion recognition model for voice recognition and emotion recognition, and determining the speaking content of the user and an emotion label for indicating the speaking emotion of the user, wherein the voice recognition model and the emotion recognition model are obtained by training a neural network model based on audio sample data.

In the embodiment of the invention, the speaking content and speaking emotion of the user are identified according to the acquired audio signal of the user. And corresponding operation is subsequently executed according to the speaking content and the speaking emotion, so that the use experience of the user is improved.

Preferably, referring to fig. 6 in combination with fig. 5, a structural block diagram of an intelligent customer service provided in an embodiment of the present invention is shown, after the first interrupt unit 403 is executed, the intelligent customer service device further includes:

an inquiry unit 408, configured to inquire the user about the question of the user, and collect an audio signal sent by the user equipment. The recognition unit 406 and the processing unit 407 are executed.

Preferably, referring to fig. 7 in conjunction with fig. 5, a structural block diagram of an intelligent customer service provided in an embodiment of the present invention is shown, where the processing unit 407 includes:

a replying module 4071, configured to ask the user a question and reply to the question if the speaking content and/or the emotion tag meet a preset replying rule.

A switching module 4072, configured to switch to an artificial customer service for the user if the speaking content and/or the emotion tag meet a preset switching rule.

A hang-up module 4073, configured to terminate the call with the user if the speech content and/or the emotion tag meet a preset hang-up rule.

In the embodiment of the invention, the intelligent customer service device collects and identifies the speaking content and speaking emotion of the user. And providing follow-up services such as answering questions, switching manual customer service or ending the conversation for the user according to the speaking content and the speaking mood, and improving the use experience of the user.

In summary, an embodiment of the present invention provides a method for processing voice data and an intelligent customer service device, where the method includes: the method comprises the steps that an intelligent customer service device collects audio signals sent by user equipment in real time in the process of playing preset voice contents to the user equipment; detecting audio information in the audio signal for indicating the user behavior type; if the audio information is determined to be used for indicating that the user has a question, the playing of the preset voice content is interrupted; and if the audio information is determined to be used for indicating that the user is speaking, reducing the volume of playing the preset voice content within the preset time. In the scheme, when the intelligent customer service device detects that the user speaks in the process of playing the voice content, the voice playing volume is reduced or the voice playing is interrupted according to the behavior type of the user. The speech content of the user is collected and recognized, follow-up services such as answering questions, switching manual customer service or ending conversation are provided for the user, and the use experience of the user is improved.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing voice data is suitable for an intelligent customer service device, and comprises the following steps:

2. The method according to claim 1, wherein if the audio information indicates that the user has a question, after the interruption of the playing of the preset audio content, further comprising:

3. The method of claim 1, wherein after determining that the audio information indicates that the user is speaking, further comprising:

4. The method of claim 2 or 3, wherein said answering said user's question, forwarding human customer service, or ending a call based on said speech content and emotion tags comprises:

5. The method of claim 2 or 3, wherein the using the audio signal for speech recognition and emotion recognition to determine the speaking content of the user and an emotion label indicating the emotion of the user speaking comprises:

6. The method of claim 5, wherein the using the audio signal for speech recognition and emotion recognition before determining the speaking content of the user and the emotion label indicating the emotion of the user speaking, further comprises:

7. An intelligent customer service device, comprising:

8. The intelligent customer service device of claim 7 further comprising:

9. The intelligent customer service device according to claim 8, wherein the determining unit is specifically configured to: and simultaneously inputting the audio signal into a preset voice recognition model and a preset emotion recognition model for voice recognition and emotion recognition, and determining the speaking content of the user and an emotion label for indicating the speaking emotion of the user, wherein the voice recognition model and the emotion recognition model are obtained by training a neural network model based on audio sample data.

10. The intelligent customer service device of claim 8 wherein the processing unit comprises:

the hang-up module is used for ending the conversation with the user if the speaking content and/or the emotion label accord with a preset hang-up rule;