CN112201277A

CN112201277A - Voice response method, device and equipment and computer readable storage medium

Info

Publication number: CN112201277A
Application number: CN202011052933.7A
Authority: CN
Inventors: 申亚坤
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-08
Anticipated expiration: 2040-09-29
Also published as: CN112201277B

Abstract

The application provides a voice response method, a voice response device, voice response equipment and a computer readable storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining user voice, determining a tone type corresponding to the user voice according to voice characteristics and voice content of the user voice, generating response voice corresponding to the user voice based on the tone type corresponding to the user voice and the voice content, and finally broadcasting the response voice. The broadcasted response voice is obtained according to the tone type and the voice content of the voice of the user, so that the broadcasted response voice can be different as long as the tone type of the voice of the user is different, and the personalized response according to the voice of the user is realized, so that the experience of the user can be improved. In addition, the tone type corresponding to the user voice is determined according to the voice characteristics of the user voice and two dimensionalities of the voice content, so that the tone type corresponding to the user voice has higher accuracy, and the accuracy of the broadcasted response voice can be improved.

Description

Voice response method, device and equipment and computer readable storage medium

Technical Field

The present application relates to the field of voice processing, and in particular, to a method and an apparatus for voice response, an electronic device, and a computer-readable storage medium.

Background

In many service scenarios, intelligent voice response devices are provided for voice interaction with users. However, many current intelligent voice response devices have a single response mode, for example, a uniform tone response mode is adopted for response, so that personalized response cannot be performed according to different user voices, and the service experience of users cannot be improved.

Disclosure of Invention

The application provides a voice response method and device, electronic equipment and a computer readable storage medium, and aims to solve the problem of how to perform personalized response according to user voice in application of voice response equipment.

In order to achieve the above object, the present application provides the following technical solutions:

a method of voice response comprising:

acquiring user voice;

determining a tone type corresponding to the user voice according to the voice characteristics and the voice content of the user voice;

generating response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content;

and broadcasting the response voice.

Optionally, in the foregoing method, the intonation types include at least two specified intonation types, and any one of the intonation types is preset according to the voice feature of the voice of the historical user and the voice content;

the speech features include at least a pitch feature and a pitch feature.

Optionally, the determining, according to the voice feature of the user voice and the voice content, a tone type corresponding to the user voice includes:

inputting the user voice into a pre-trained Bayes classification model, and enabling the Bayes classification model to determine the intonation type corresponding to the user voice according to the voice characteristics of the user voice;

recognizing and obtaining the voice content corresponding to the user voice;

inputting the voice content of the user voice into a pre-trained voice classification model; enabling the voice classification model to determine the intonation type corresponding to the user voice according to the voice content of the user voice;

respectively acquiring the intonation types corresponding to the user voices output by the Bayesian model and the voice classification model;

and if the intonation type output by the Bayes classification model and the intonation type output by the voice classification model are the same intonation type, taking the same intonation type as the intonation type corresponding to the voice of the user.

The above method, optionally, further includes:

and if the intonation types output by the Bayes classification model and the intonation types output by the voice classification model are different intonation types, determining the intonation type corresponding to the user voice as a preset default intonation type.

Optionally, in the method, the bayesian classification model is obtained by training according to a voice training sample, where the voice training sample carries the voice feature;

the Bayesian classification model determines the intonation type corresponding to the user voice as follows: and the Bayesian classification model calculates the probability that the user voice belongs to each intonation type respectively according to the voice characteristics of the user voice, and determines the intonation type corresponding to the maximum probability value as the intonation type corresponding to the user voice.

Optionally, in the foregoing method, the speech classification model is a GA-BP neural network model, and the GA-BP neural network model is a model obtained by optimizing an initial BP neural network model;

the number of input layer nodes of the initial BP neural network model is determined according to the voice content length of a voice training sample, the number of output layer nodes is determined according to the intonation type, and the number of hidden layer nodes is determined based on a trial and error method;

the optimization of the initial BP neural network model is as follows: and training and learning the initial weight and the threshold of each layer of the input layer, the hidden layer and the output layer of the initial BP neural network model according to preset sample data and a genetic algorithm, and determining the optimal initial weight and threshold of each layer to obtain the optimized BP neural network model.

Optionally, in the method, the generating, based on the intonation type corresponding to the user voice and the voice content, a response voice corresponding to the user voice includes:

determining responsive voice content based on the voice content;

and generating the response voice with the voice content being the response voice content and the tone type being the tone type corresponding to the user voice.

An apparatus for voice response, comprising:

an acquisition unit configured to acquire a user voice;

the determining unit is used for determining the tone type corresponding to the user voice according to the voice characteristics and the voice content of the user voice;

a generating unit, configured to generate a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content;

and the broadcasting unit is used for broadcasting the response voice.

A voice response apparatus comprising: a processor and a memory for storing a program; the processor is used for running the program to realize the voice response method.

A computer-readable storage medium having stored thereon instructions which, when executed on a computer, cause the computer to perform the above-described method of voice response.

The method and the device comprise the following steps: the method comprises the steps of obtaining user voice, determining a tone type corresponding to the user voice according to voice characteristics and voice content of the user voice, generating response voice corresponding to the user voice based on the tone type corresponding to the user voice and the voice content, and finally broadcasting the response voice. The broadcasted response voice is obtained according to the tone type and the voice content of the voice of the user, so that the broadcasted response voice can be different as long as the tone type of the voice of the user is different, and the personalized response according to the voice of the user is realized, so that the experience of the user can be improved.

In addition, the tone type corresponding to the user voice is determined according to the voice characteristics of the user voice and two dimensionalities of the voice content, so that the tone type corresponding to the user voice has higher accuracy, and the accuracy of the broadcasted response voice can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for voice response provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for determining a type of intonation corresponding to a user's voice according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a voice response apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a voice response apparatus according to an embodiment of the present application.

Detailed Description

In many occasions, the intelligent voice broadcasting equipment is adopted to carry out voice interaction on the user, however, at present, a plurality of intelligent voice response equipment only pay attention to the content of the voice of the user and do not pay attention to the tone used for the voice, so that the user generally adopts a uniform tone response mode to respond, the user cannot carry out personalized response according to different user voices, and the service experience of the user cannot be improved.

Therefore, the embodiment of the present application provides a voice response method, which aims to respond to a user by combining a user voice and a voice content of the user voice, so as to implement a personalized response according to different user voices.

In the present application, the speech content of the user speech refers to the speech text content corresponding to the user speech.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The execution main body of the embodiment is an intelligent voice broadcasting device with a voice processing function, such as an intelligent voice robot.

Fig. 1 is a method for responding to a voice according to an embodiment of the present application, and the method may include the following steps:

and S101, acquiring the voice of the user.

The user's pronunciation is user's pronunciation, and intelligent voice broadcast equipment can be in the pronunciation collection scope under the running state, and the collection obtains user's pronunciation.

S102, determining the tone type corresponding to the user voice according to the voice characteristics and the voice content of the user voice.

In this embodiment, the voice feature is information that can be used to describe the mood and emotion of the user voice, and the voice feature includes a tone feature, an amplitude feature, a tone color feature, and the like.

The tone type includes at least two designated tone types, the voice feature and the voice content of the voice of the historical user are preset, that is, the designated tone type is set according to the voice information of the mood and the emotion of the voice of the historical user and the voice content of the voice of the historical user, and the designated tone type can be a cheerful and funny interactive tone type, a mild and formal interactive tone type and the like, for example, the cheerful and funny interactive tone type can be a type in which the tone or the tone amplitude of the voice is changed greatly and the correlation between the voice content and the service inquiry problem is weak. The gentle and formal interactive tone type can be a type with less change of tone or amplitude of voice and stronger correlation between voice content and service inquiry problem.

The specific embodiment of this step can refer to the flowchart shown in fig. 2.

S103, generating response voice corresponding to the user voice based on the tone type and the voice content corresponding to the user voice.

The specific implementation mode of the step comprises a step A1 and a step A2:

step a1, based on the speech content of the user's speech, determines the responsive speech content.

The response voice content corresponding to the voice content of the user voice is determined according to the voice content of the user voice, for example, the corresponding response voice content may be determined according to a keyword included in the voice content.

Certainly, in the step, the response voice content may also be determined based on the voice content and the tone type of the user voice, that is, the response voice content of the response voice is not only related to the voice content of the user voice, but also related to the tone type of the user voice, that is, the voice content of the response voice may be different under the condition that the voice content of the user voice is the same and the tone type is different, so that the step has better personalized characteristics.

Step a2, generating the response voice with the voice content being the response voice content and the tone type being the tone type corresponding to the user voice.

The tone type of the response voice is the same as that of the user voice, so that the personalized effect of the response voice can be enhanced.

And S104, broadcasting response voice.

For example, the intelligent voice broadcasting device calls a preset voice broadcaster to broadcast the response voice.

The method provided by the embodiment comprises the following steps: the method comprises the steps of obtaining user voice, determining a tone type corresponding to the user voice according to voice characteristics and voice content of the user voice, generating response voice corresponding to the user voice based on the tone type corresponding to the user voice and the voice content, and finally broadcasting the response voice. The broadcasted response voice is obtained according to the tone type and the voice content of the voice of the user, so that the broadcasted response voice can be different as long as the tone type of the voice of the user is different, and the personalized response according to the voice of the user is realized, so that the experience of the user can be improved.

Fig. 2 is a specific implementation manner of determining, by S102 according to the speech feature and the speech content of the user speech, the intonation type corresponding to the user speech, in the above embodiment, which may include the following steps:

s201, inputting the user voice into a pre-trained Bayes classification model, and enabling the Bayes classification model to determine the intonation type corresponding to the user voice according to the voice characteristics of the user voice.

In this step, the Bayesian classification model is obtained by training according to the voice training sample. The voice training sample carries a plurality of voice characteristics, wherein the prior art can be referred to in the training method for obtaining the Bayesian classification model by training the training sample.

The pre-trained Bayes classification model can extract tone characteristics of the user voice, and tone types corresponding to the user voice are determined based on the tone characteristics of the user voice.

The method specifically comprises the following steps: and the Bayes classification model calculates the probability that the user voice belongs to each appointed intonation type respectively according to the voice characteristics of the user voice, and determines the appointed intonation type corresponding to the maximum probability value as the intonation type corresponding to the user voice.

For example, X represents the feature set of all the intonation features of the user's voice, and Y1 represents the first intonation type, then the probability that the user's voice belongs to the first intonation type is calculated by substituting all the intonation features of the user's voice into a probability formula.

Wherein, the probability formula is:

p (Y1| X) is the probability that the user's speech belongs to the common first intonation type Y1, A, given the user's speech feature set X_iRepresenting the ith feature in the feature set X corresponding to the user's voice, n is the number of features in the feature set X, P (Y1) is the probability that any one intonation type belongs to the first intonation type Y1, P (A)_i| Y1) is the condition that the intonation type is the first intonation type Y1, the corresponding characteristic is A_iIs the probability that user speech occurs in all of the specified intonation types, P (A)_i) Having feature A for any one speech_iThe probability of (a) of (b) being,

wherein P (Y1) and P (A)_i| Y1), and P (A)_i) The method is obtained by pre-estimating a plurality of feature sets X with determined tone types. The larger the number of the feature set X is, the more accurate the intonation type corresponding to the feature set X is, and the estimated P (Y1) and P (A) are_i| Y1), and P (A)_i) The more accurate.

S202, recognizing and obtaining the voice content corresponding to the user voice.

The voice content of the user voice can be obtained by adopting the existing voice recognition method.

S203, inputting the voice content of the user voice into a pre-trained voice classification model, and enabling the voice classification model to determine the tone type corresponding to the user voice according to the voice content of the user voice.

The Bayes classification model determines the tone type corresponding to the user voice based on the voice characteristics used for the voice, and the voice classification model determines the tone type corresponding to the user voice according to the voice content of the user voice.

Optionally, the speech classification model is a GA-BP neural network model. And the GA-BP neural network model is obtained by optimizing the initial BP neural network model. The trained voice classification model can obtain the tone type corresponding to the input voice content.

The number of input layer nodes of the initial BP neural network model is determined according to the voice content length of a voice training sample, the number of output layer nodes is determined according to the tone type, and the number of hidden layer nodes is determined based on a trial and error method. The voice training sample is voice content of the historical user voice carrying tone types.

Optimizing the initial BP neural network model as follows: training and learning the initial weight and the threshold of each of the input layer, the hidden layer and the output layer of the initial BP neural network model according to preset sample data and a genetic algorithm, determining the optimal initial weight and threshold of each layer, and obtaining the optimized BP neural network model. For a specific optimization process, reference may be made to the prior art.

And S204, obtaining the intonation types corresponding to the user voice output by the Bayes model and the voice classification model respectively.

S205, judging whether the intonation types corresponding to the user voices output by the Bayesian model and the voice classification model are the same or not. If the two are the same, S206 is executed, and if the two are not the same, S207 is executed.

And S206, taking the same tone type as the tone type corresponding to the voice of the user.

The speech type corresponding to the user speech output by the Bayesian model and the speech classification model is the same, which indicates that the probability that the same speech type is the correct speech type of the user speech is very high.

And S207, determining the tone type corresponding to the voice of the user as a preset default tone type.

For example, a default intonation type may be preset as a flat intonation type, and the intonation type corresponding to the user voice is determined as a flat formal interactive intonation type when the intonation types corresponding to the user voice output by the bayesian model and the voice classification model are different.

In the method provided by this embodiment, the bayesian classification model determines the intonation type corresponding to the user voice based on the voice feature used for the voice, and the voice classification model determines the intonation type corresponding to the user voice according to the voice content of the user voice, which is equivalent to determining the intonation type used for the voice from different dimensions.

Fig. 3 is a schematic structural diagram of a voice response apparatus according to an embodiment of the present application, including: a processor 301 and a memory 302, the memory for storing a program and the processor for executing the program to implement the method of voice response provided herein.

The intelligent voice response equipment can be placed at each service point and used for providing automatic voice response service for users. For example, the intelligent voice response equipment can be used for a service network point for transacting business, and the service experience of the user is improved by providing joyful interaction with the user and business transaction-like interaction.

For example, when the user voice is of a tone type of cheerful tone, the user is likely to want to have informal comma-like interaction with the intelligent voice device, and when the user voice is of a tone type of flat tone, the user is likely to want to have formal business interaction with the intelligent voice device.

Correspondingly, the tone type of the user voice is specified as a cheerful tone type or a flat tone type in advance, the intelligent voice response equipment is configured in advance, when the tone type of the user voice is determined to be the cheerful tone type, the user is responded by adopting the cheerful tone of fun, and when the tone type of the user voice is determined to be the flat tone type, the user is responded by adopting the flat tone. The intelligent voice response equipment can improve the service experience of the user by providing two different interaction modes.

Fig. 4 is a schematic structural diagram of a voice response apparatus according to an embodiment of the present application, including:

an obtaining unit 401, configured to obtain a user voice;

a determining unit 402, configured to determine, according to a voice feature and a voice content of the user voice, a tone type corresponding to the user voice;

a generating unit 403, configured to generate a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content;

and the broadcasting unit 404 is used for broadcasting the response voice.

The tone type comprises at least two appointed tone types, any one tone type is preset according to the voice characteristics and the voice content of the voice of the historical user, and the voice characteristics at least comprise tone characteristics and tone amplitude characteristics.

The specific implementation manner of determining the intonation type corresponding to the user voice by the determining unit 402 according to the voice feature and the voice content of the user voice is as follows:

recognizing and obtaining the voice content corresponding to the user voice;

if the intonation type output by the Bayes classification model and the intonation type output by the voice classification model are the same intonation type, taking the same intonation type as the intonation type corresponding to the voice of the user;

Optionally, the bayesian classification model is obtained by training according to a voice training sample, where the voice training sample carries the voice feature; the Bayesian classification model determines the intonation type corresponding to the user voice as follows: and the Bayesian classification model calculates the probability that the user voice belongs to each intonation type respectively according to the voice characteristics of the user voice, and determines the intonation type corresponding to the maximum probability value as the intonation type corresponding to the user voice.

Optionally, the speech classification model is a GA-BP neural network model, and the GA-BP neural network model is a model obtained by optimizing an initial BP neural network model;

Optionally, the specific implementation manner of generating, by the generating unit 403, the response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content is as follows:

determining responsive voice content based on the voice content;

and generating the response voice with the voice content as response voice content and the tone type as the tone type corresponding to the voice of the user.

The device that this application embodiment provided includes: the method comprises the steps of obtaining user voice, determining a tone type corresponding to the user voice according to voice characteristics and voice content of the user voice, generating response voice corresponding to the user voice based on the tone type corresponding to the user voice and the voice content, and finally broadcasting the response voice. The broadcasted response voice is obtained according to the tone type and the voice content of the voice of the user, so that the broadcasted response voice can be different as long as the tone type of the voice of the user is different, and the personalized response according to the voice of the user is realized, so that the experience of the user can be improved.

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the method of voice response of the present application, namely to perform the steps of:

acquiring user voice;

and broadcasting the response voice.

The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of voice response, comprising:

acquiring user voice;

and broadcasting the response voice.

2. The method according to claim 1, wherein the intonation types include at least two specified intonation types, any one of the intonation types being preset according to the voice characteristics of the historical user voice and the voice content;

the speech features include at least a pitch feature and a pitch feature.

3. The method according to claim 2, wherein the determining the type of intonation corresponding to the user speech according to the speech feature of the user speech and the speech content comprises:

recognizing and obtaining the voice content corresponding to the user voice;

4. The method of claim 3, further comprising:

5. The method according to claim 3, wherein the Bayesian classification model is trained based on speech training samples, the speech training samples carrying the speech features;

6. The method of claim 3, wherein the speech classification model is a GA-BP neural network model, and the GA-BP neural network model is optimized from an initial BP neural network model;

7. The method according to claim 1, wherein generating a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content comprises:

determining responsive voice content based on the voice content;

8. An apparatus for voice response, comprising:

an acquisition unit configured to acquire a user voice;

and the broadcasting unit is used for broadcasting the response voice.

9. A voice response apparatus, characterized by comprising: a processor and a memory for storing a program; the processor is configured to execute the program to implement the method of voice response according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of voice answering of any one of claims 1-7.