CN113963699A

CN113963699A - Intelligent voice interaction method for financial equipment

Info

Publication number: CN113963699A
Application number: CN202111283365.6A
Authority: CN
Inventors: 田立刚; 张云峰; 张海华; 魏巍; 杨孟超
Original assignee: Cashway Technology Co Ltd
Current assignee: Cashway Technology Co Ltd
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-01-21

Abstract

The invention discloses an intelligent voice interaction method for financial equipment, which comprises the following steps: signal acquisition and separation: collecting audio signals, and separating the audio signals into voice signals and noise signals by adopting a separation algorithm; synthesis of a speech signal: carrying out voice recognition on the voice signals, carrying out semantic understanding, finding out the best answer text, and synthesizing the answer text into answer voice signals; determining the playback audio signal according to equation 1: wherein, f (n) is the playing audio signal, s3(n) is the sound signal heard by the pre-estimating user, except the amplitude, other parameters are the same with the answering voice signal, d1(n) is the noise signal, n is the sampling frequency for discrete analysis of the audio signal; and determining the loudness of the played sound as the sum of the basic sound loudness and the loudness attenuation amount, and setting the equipment according to the loudness of the played sound to realize volume adjustment.

Description

Intelligent voice interaction method for financial equipment

Technical Field

The invention relates to the technical field of financial self-service terminals, in particular to an intelligent voice interaction method for financial equipment.

Background

The intelligent voice interaction is a new generation interaction mode based on voice input, and a feedback result can be obtained by speaking. The biggest problem of voice interaction is not accurate enough. Firstly, the accuracy of voice recognition is low due to the influence of the environment; moreover, the expression of an intention is diversified, and the intention cannot be completely covered; finally, voice interaction is an open domain and many unexpected situations need to be handled. There are no scenarios considered to be suitable for voice interaction, such as meeting scenarios, when family members sleep, etc.

With the wide application of financial self-service equipment and customer service robots, the volume of the existing equipment is constant in the interaction process, and in a complex environment, the listening effect of a user can be influenced by environmental sound, so that the satisfaction degree of customer experience is influenced to a certain extent.

Disclosure of Invention

The invention aims to provide an intelligent voice interaction method for financial equipment aiming at the technical defect that playing sound is constant in the prior art.

The technical scheme adopted for realizing the purpose of the invention is as follows:

an intelligent voice interaction method for financial equipment is characterized by comprising the following steps:

(1) acquisition of a playing audio signal:

signal acquisition and separation: collecting audio signals, and separating the audio signals into voice signals and noise signals by adopting a separation algorithm;

synthesis of a speech signal: carrying out voice recognition on the voice signals, carrying out semantic understanding, finding out the best answer text, and synthesizing the answer text into answer voice signals;

determining to play the audio signal according to formula 1;

formula 1 convolution formula

Wherein f (n) is a playing audio signal, s3(n) is a sound signal heard by the pre-estimated user, except amplitude, other parameters are the same as the answering voice signal, d1(n) is a noise signal, n is a sampling frequency for performing discrete analysis on the audio signal, and m is 0-n and is an integer;

according to the noise signal d1(n) and the signal s3(n) which is predicted to be heard by the user, the playing audio signal is obtained through deconvolution

The noise reduction function is achieved by superposing the noise signals;

(2) acquisition of loudness of played sound

Determining the loudness of the played sound as the sum of the loudness of the basic sound and the loudness attenuation quantity;

(3) the information content played by the equipment is determined by playing the audio signal, and the volume played by the equipment is determined by playing the sound loudness, so that intelligent voice interaction is realized.

Preferably, the audio signal is separated by an ICA blind source separation algorithm.

Preferably, the step of determining the distance r from the speaker to the user is as follows:

judging whether the front of the equipment is a living body or not through an infrared sensor, and if the front of the equipment is the living body, measuring the distance between a user and the equipment through an ultrasonic sensor;

collecting audio signals through a microphone array to obtain a relative angle between a user and equipment;

and obtaining the distance r from the loudspeaker to the user according to the distance from the ultrasonic sensor to the user, the relative angle between the user and the equipment and the relative distances among the ultrasonic sensor, the microphone array and the loudspeaker.

Preferably, the device starts audio signal acquisition after being awakened; performing framing processing on the audio signal, judging the audio signal to be pause when the pause time exceeds a set time threshold, and separating the audio signal; the awakening mode comprises awakening by an awakening word or awakening triggered by infrared rays.

Preferably, after the device is awakened, the device acquires a first audio signal, and the noise signal obtained by separation is the noise signal used when the audio signal is determined to be played each time in the voice interaction; in one voice interaction, when the situation that the position change of a user exceeds a set distance threshold or the noise loudness of a service environment exceeds a set loudness threshold is detected, separating the newly obtained audio signals, and obtaining a noise signal again to serve as the noise signal used when the audio signal is determined to be played next time in the voice interaction.

Preferably, the basic sound loudness is a fixed known value, and the loudness attenuation is calculated as follows:

equation 2

Where r is the distance from the horn to the user.

Preferably, every time the change of the position of the user is detected to exceed the set distance threshold, the loudness attenuation amount is recalculated, the equipment is set according to the loudness of the new playing sound, and the real-time adjustment of the volume is realized.

Preferably, the maximum value of the loudness of the playback sound is twice the loudness of the basic sound

The invention has the beneficial effects that:

1. the invention provides a method for automatically adjusting the playing volume of equipment with different noises and different user positions, which improves the satisfaction degree of a client in the voice exchange process of intelligent equipment.

2. According to the audio signal that gathers at every turn to separate into speech signal and noise signal, the audio signal that gathers at every turn is different, and speech signal and noise signal are also different, have realized carrying out the purpose that audio signal adjusted respectively to different users for every user can all hear the most comfortable, most suitable sound.

3. When the customer is not communicating with the device, the signal directly measured is a noise signal through the microphone array of the device. When a client communicates with the equipment, the audio signal mixed with noise is collected through a microphone array of the equipment: y1(n) = s1(n) + d1(n), y (n) is the captured audio signal, s1(n) is the speech signal, and d1(n) is the noise signal. The method comprises the steps of firstly carrying out noise elimination processing on an audio signal mixed with noise, separating the audio signal by adopting an ICA blind source separation algorithm to obtain a voice signal s1(n) and a noise signal d1(n), and improving the accuracy of converting the voice signal into text information by noise reduction processing.

Drawings

FIG. 1 is a schematic flow diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

An intelligent voice interaction method for financial equipment comprises the following steps:

(1) acquisition of a playing audio signal:

the audio signal is a regular sound wave frequency and amplitude variation information carrier with voice, music and sound effects. The answer speech signal synthesized by the device from the answer text is also an audio signal, also called a sound wave, which has three important parameters: frequency, amplitude and phase, which also determine the characteristics of the audio signal. In the prior art, a device directly synthesizes answer texts into answer voice signals, and the answer voice signals are played through a loudspeaker, wherein the frequency, the amplitude and the phase of the answer voice signals are fixed and are values initially set by the device, so that the sound waves are the same no matter what environment a user is on site, and in different noise environments, although the audio signals played by the device are the same, the audio signals heard by the user are different. In signal processing, useful call signals and useless call noise are utilized, and the noise signals are also utilized by the application.

In view of this, in the technical scheme designed by the invention, the collected and separated noise signal is used as a calculation known quantity; the words sent by the equipment are synthesized by a language, and the synthesized voice signal is s2 (n); s3(n) the frequency and phase angle are the same as s2(n), and the amplitude is calculated by equation 2 and the basic loudness, resulting in s3(n), which is the sound signal that the user is predicted to hear. S3(n) and d1(n) in formula 1 are known quantities, so that deconvolution can be performed to determine f (n) by superimposing noise signals, and the amplitude is related to loudness, which is obtained in the following acquisition of loudness of played sound, to determine the loudness of played sound.

The noise signal is the known quantity calculated by the formula 1 convolution in the interaction at this time, and when the equipment is awakened again, the noise signal changes, so that the noise signal in the interaction process is ensured to be obtained as the calculated known quantity according to the actual condition at that time.

Determining the playback audio signal according to equation 1:

equation 1

Where f (n) is the playing audio signal, s3(n) is the predicted sound signal heard by the user, except for the amplitude, the other parameters are the same as the answering voice signal, d1(n) is the noise signal, n is the sampling frequency for discrete analysis of the audio signal, and m is 0-n and is an integer.

The conventional method is to find the best answer text, synthesize it into a speech signal and then play it to the user by the device, and in this scheme, after synthesizing the answer text into an answer speech signal, the noise reduction is performed, and f (n) is solved in reverse for formula 1. By this method, a noise reduction effect is achieved. Explications formula 1 is detailed below:

by parity of reasoning, obtain

(2) Acquisition of loudness of played sound

And determining the loudness of the playing sound as the sum of the loudness of the basic sound and the loudness attenuation amount.

Basic sound loudness is the loudness of comfortable sound heard by a user, and for a fixed known value, the sound pressure is in db, 1 db is the sound just heard by the human ear, and less than 20 db is generally considered to be quiet, and generally less than 15 db is considered to be dead. About 20-40 decibels are furanic linguis. 40-60 decibels pertain to our normal conversation voice. Since the financial device is generally installed in a bank hall, the loudness at which the user can hear the sound comfortably is set to 50 db.

The loudness attenuation is calculated as follows:

equation 2

Where r is the distance from the horn to the user.

In this embodiment, an FFT noise reduction algorithm and an ICA blind source separation algorithm are adopted to separate the audio signals.

The determination of the distance r from the horn to the user is as follows:

and obtaining the distance r from the loudspeaker to the user according to the distance from the ultrasonic sensor to the user, the relative angle between the user and the equipment and the relative distances among the sensor, the microphone array and the loudspeaker. The relative distance of the sensor to the horn is a fixed known value.

After the equipment is awakened, audio signal acquisition is started; performing framing processing on the audio signal, judging the audio signal to be pause when the pause time exceeds a set time threshold, and separating the audio signal; the awakening mode comprises awakening by an awakening word or awakening triggered by infrared rays.

Generally, after the device is awakened, a first audio signal is acquired, and the noise signal obtained by separation is the noise signal used when the audio signal is determined to be played each time in the voice interaction.

However, in practical applications, it is found that some changes occur in the user environment, which results in different noise signals, and therefore, the noise signals need to be determined again, and then the noise-free frequency signals to be played back are obtained through a common subtraction method. Thereby ensuring that the audio signal heard by the person is a clean audio signal. Therefore, in one voice interaction, when the situation that the position change of the user exceeds the set distance threshold or the noise loudness of the service environment exceeds the set loudness threshold is detected, the newly obtained audio signals are separated, and the noise signals are obtained again and used as the noise signals used when the audio signals are determined to be played next time in the voice interaction.

Furthermore, in order to ensure that the volume at each moment is most appropriate, every time the change of the position of the user is detected to exceed the set distance threshold, the loudness attenuation amount is recalculated, and the device is set according to the loudness of the new playing sound, so that the volume adjustment is realized.

The maximum value of the loudness of the played sound is set to be twice the loudness of the basic sound.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An intelligent voice interaction method for financial equipment is characterized by comprising the following steps:

(1) acquisition of a playing audio signal:

determining the playback audio signal according to equation 1:

formula 1 convolution formula

The noise reduction function is achieved by superposing the noise signals;

(2) acquisition of loudness of played sound

2. The intelligent voice interaction method for financial equipment as claimed in claim 1, wherein the audio signal is separated by ICA blind source separation algorithm.

3. The intelligent voice interaction method for financial equipment as claimed in claim 1, wherein the distance r from the speaker to the user is determined as follows:

4. The intelligent voice interaction method for financial equipment as claimed in claim 1, wherein the equipment starts audio signal acquisition after being awakened; performing framing processing on the audio signal, judging the audio signal to be pause when the pause time exceeds a set time threshold, and separating the audio signal; the awakening mode comprises awakening by an awakening word or awakening triggered by infrared rays.

5. The intelligent voice interaction method for financial equipment as claimed in claim 4,

after the equipment is awakened, acquiring a first audio signal, wherein a noise signal obtained by separation is a noise signal used when the audio signal is determined to be played each time in the voice interaction; in one voice interaction, when the situation that the position change of a user exceeds a set distance threshold or the noise loudness of a service environment exceeds a set loudness threshold is detected, separating the newly obtained audio signals, and obtaining a noise signal again to serve as the noise signal used when the audio signal is determined to be played next time in the voice interaction.

6. The intelligent voice interaction method for financial equipment according to claim 1, wherein the loudness of the fundamental sound is a fixed known value, and the loudness attenuation is calculated as follows:

equation 2

Where r is the distance from the horn to the user.

7. The intelligent voice interaction method for financial devices as claimed in claim 6, wherein the loudness attenuation is recalculated whenever a change in the user position is detected to exceed a set distance threshold, and the device is set according to the loudness of a new playing sound, so as to achieve real-time adjustment of the volume.

8. The financial device intelligent voice interaction method of claim 1, wherein the maximum value of the loudness of the played sound is set to twice the loudness of the basic sound.