CN112185393A

CN112185393A - Voice recognition processing method for power supply intelligent client

Info

Publication number: CN112185393A
Application number: CN202011059062.1A
Authority: CN
Inventors: 练芯妤; 陈琳; 林磊; 罗陆宁; 黄媚; 刘家学; 李艳; 王婷婷; 税洁; 谢钰莹; 徐艳如; 陈诚; 罗建国; 黎怡均; 罗益会; 赵峻; 莫屾; 付婷婷; 陈辉; 黄公跃
Original assignee: Shenzhen Power Supply Bureau Co Ltd
Current assignee: Shenzhen Power Supply Bureau Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-05

Abstract

The invention provides a voice recognition processing method for a power supply intelligent client, which is characterized in that a power supply intelligent seat is used for answering a voice signal of a client; preprocessing a voice signal of a client; performing feature extraction on the preprocessed voice signal to obtain voice features in the voice signal; and recognizing the extracted voice features through a pre-constructed voice recognition model to obtain a voice recognition result in a text format, and outputting the voice recognition result. By implementing the invention, because the voice signals are effectively preprocessed, the voice recognition accuracy can be improved; meanwhile, due to the adoption of an intelligent interrupt processing mechanism, the intelligent degree can be improved, and intelligent interrupt response can be realized, so that the use experience of a client can be improved; in addition, by training the speech recognition model, the accuracy of speech recognition can be improved.

Description

Voice recognition processing method for power supply intelligent client

Technical Field

The invention relates to the technical field of power supply intelligent clients, in particular to a voice recognition processing method for a power supply intelligent client.

Background

For customer service work, intelligent voice is one of trends of future development, and although many power supply enterprises are actively building intelligent customer service systems, most of the existing voice navigation systems have some disadvantages, mainly embodied in the disadvantages of low intelligent degree, limited voice recognition effect, complex service flow, poor integrity, poor serviceability and the like.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a voice recognition processing method for a power supply intelligent client, which can improve the voice recognition effect.

In order to solve the above technical problem, an aspect of the present invention provides a speech recognition processing method for a power supply smart client, which includes the following steps:

step S10, answering the voice signal of the client through the intelligent power supply seat;

step S11, preprocessing the voice signal of the client, wherein the preprocessing comprises endpoint detection, noise elimination and intelligent interruption processing;

step S12, extracting the characteristics of the preprocessed voice signal to obtain the voice characteristics in the voice signal;

and step S13, recognizing the extracted voice features through a pre-constructed voice recognition model, obtaining a voice recognition result in a text format, and outputting the voice recognition result.

Preferably, the step S11 further includes:

step S110, converting the voice data into a data format suitable for acoustic model processing, where the data format includes: wav or pcm audio at a mono 16-bit 16000 sampling rate;

step S111, identifying end points in the voice signal to determine the speaking start point and the speaking end point of the user;

step S112, eliminating background noise, carrying out scene analysis, firstly carrying out scene type judgment on the voice signal, and after determining the scene type, carrying out noise elimination according to the preset noise elimination modes corresponding to different scenes;

and step S113, responding to the latest voice signal of the user in real time, and interrupting the current task or continuing the current task according to the set interrupt processing mode.

Preferably, the step S111 further includes:

when a voice signal is received to trigger voice recognition, judging whether a speaking habit of a corresponding user is stored or not according to user login information, if so, performing endpoint detection on detection parameters after corresponding adjustment according to the speaking habit of the user; and if not, carrying out endpoint detection based on the universal detection parameters.

Preferably, in the step S111, the endpoint in the speech signal is identified by a time domain feature method and a frequency domain feature method.

Preferably, the step S112 specifically includes:

presetting different interrupt processing mode types for different users, wherein the interrupt processing mode types comprise: a interruptible mode and an non-interruptible mode;

when voice recognition is carried out in real time, if a new voice signal is received, judging the type of the current interrupt processing mode;

if the current mode is the interruptible mode, interrupting the response of the current voice recognition when a new voice instruction is received;

and if the current mode is the non-interruptible mode, after the user triggers the voice recognition processing, re-receiving a new voice signal instruction after the instruction execution is finished.

Preferably, the step S12 further includes:

and performing feature extraction on the preprocessed voice signal by adopting a linear predictive coding technology to obtain the voice feature in the voice signal.

Preferably, further comprising: the method comprises the steps of pre-constructing a speech recognition model, wherein the speech recognition model comprises an acoustic model, a dictionary and a language model, and specifically comprises the following steps:

training voice characteristics corresponding to voice sample data in a voice database to obtain mapping from the voice characteristics to phonemes to form an acoustic model;

training texts in a text database to obtain a language model, and obtaining mapping between words and sentences to form the language model;

and constructing a mapping relation between the voice and the characters according to the acoustic model and the language model to form a dictionary.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a voice recognition processing method for a power supply intelligent client, which comprises the steps of firstly answering a voice signal of a client through a power supply intelligent seat; then, preprocessing the voice signals of the client, wherein the preprocessing comprises endpoint detection, noise elimination and intelligent interruption processing; performing feature extraction on the preprocessed voice signal to obtain voice features in the voice signal; and recognizing the extracted voice features through a pre-constructed voice recognition model to obtain a voice recognition result in a text format, and outputting the voice recognition result. Because the voice signals are effectively preprocessed, the voice recognition accuracy can be improved; meanwhile, due to the adoption of an intelligent interrupt processing mechanism, the intelligent degree can be improved, and intelligent interrupt response can be realized, so that the use experience of a client can be improved;

in addition, by training the speech recognition model, the accuracy of speech recognition can be improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a speech recognition processing method for a power-supplying smart client according to the present invention;

FIG. 2 is a more detailed flowchart of step S11 in FIG. 1;

FIG. 3 is a schematic diagram of an embodiment of a speech recognition processing system for a powered Smart client according to the present invention;

FIG. 4 is a schematic diagram of the pretreatment unit of FIG. 3;

FIG. 5 is a schematic diagram of the structure of the intelligent breaking unit in FIG. 4;

fig. 6 is a schematic structural diagram of the speech recognition model building unit in fig. 3.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

For those skilled in the art to more clearly understand the objects, technical solutions and advantages of the present invention, the following description will be further provided in conjunction with the accompanying drawings and examples.

Referring to fig. 1, a schematic flow chart of an embodiment of a speech recognition processing method for a power supply smart client according to the present invention is shown; referring to fig. 2 together, in this embodiment, the speech recognition processing method for a power supply smart client includes the following steps:

in a specific example, the step S11 further includes:

step S111, identifying end points in the voice signal to determine the speaking start point and the speaking end point of the user; once the user starts speaking, the voice starts to flow to a following recognition engine (namely a voice recognition processing unit) until the end of the user speaking is detected, and silence detection in the voice recognition can determine whether the user finishes speaking; this way the recognition engine starts the recognition process while the user is speaking.

Step S112, eliminating background noise, carrying out scene analysis, firstly carrying out scene type judgment on the voice signal, and after determining the scene type, carrying out noise elimination according to the preset noise elimination modes corresponding to different scenes; the noise elimination is mainly to eliminate background noise, and can be performed based on the scene where the user is located, for example, scene type judgment is performed on the voice signal, after the scene type is determined, noise elimination is performed according to the preset noise elimination mode corresponding to different scenes, and by setting corresponding noise elimination strategies for various scene types, the voice noise elimination processing effect can be improved, and the accuracy of subsequent voice recognition is ensured. In different scenes, different noise reduction modes can be learned; specifically, for each scene, noise reduction processing can be performed through various noise reduction algorithms, a noise reduction mode adapted to the scene is found, and after the noise reduction mode corresponding to each scene is determined, noise elimination can be performed on each scene by using the corresponding noise reduction mode.

And step S113, responding to the latest voice signal of the user in real time, and interrupting the current task or continuing the current task according to the set interrupt processing mode. In some scenes, the intelligent interruption means that a user can speak own requirements at any time in the process of playing the prompt words of the self-service voice service without waiting for the playing to be finished, and the system can automatically judge, immediately stop the playing of the prompt words and respond to the voice indication of the user. The intelligent interruption function enables a user to speak own requirements at any time in the prompt language playing process of self-service voice service without waiting for the playing to be finished, the system can automatically judge, the prompt language is immediately stopped from being played, and the voice instruction of the user is responded. The function enables human-computer interaction to be more efficient, faster and natural, and is beneficial to enhancing customer experience.

Preferably, the step S111 further includes:

when a voice signal is received to trigger voice recognition, judging whether a speaking habit of a corresponding user is stored or not according to user login information, if so, performing endpoint detection on detection parameters after corresponding adjustment according to the speaking habit of the user; and if not, carrying out endpoint detection based on the universal detection parameters. By establishing the speaking habits of different users to configure the corresponding endpoint detection strategy, the accuracy and efficiency of endpoint detection can be effectively improved.

Preferably, in the step S111, the endpoint in the speech signal is identified by a time domain feature method and a frequency domain feature method. Endpoint detection is the determination of the beginning and end of speech from a segment of a signal containing speech. Effective endpoint detection not only reduces processing time, but also eliminates noise interference in the silence. There are two main types of methods at present: a time domain characterization method and a frequency domain characterization method. The time domain characteristic method is to detect the end point by using the voice volume and the zero crossing rate, the calculated amount is small, but misjudgment can be caused to the gas sound, and different volume calculation can cause different detection results. The frequency domain characteristic method is used for detecting voice by using the frequency spectrum variation and entropy detection of sound, and the calculated amount is large.

Preferably, the step S112 specifically includes:

It can be understood that when speech recognition is carried out in real time, if a new speech signal is received, if a user interrupts the speech recognition, and a latest instruction is sent, the platform timely responds, so that the user can speak out own requirements at any time in the process of playing the prompt words of the self-service speech service without waiting for the completion of playing, the system can automatically judge, immediately stop the playing of the prompt words, and respond to the speech instruction of the user. The function enables human-computer interaction to be more efficient, faster and natural, and is beneficial to enhancing customer experience. Specifically, different modes such as a break-able mode and a non-break-able mode can be set, and in the break-able mode, for example, when a new voice instruction is received, the response of the current voice recognition is interrupted; and in the non-interrupt mode, after the user triggers the voice recognition processing, the new voice signal instruction is received again after the instruction execution is finished. The specific mode setting can be flexibly configured by the user according to personal use habits.

It is understood that in various embodiments, the pre-processing may also include filtering, a/D conversion, pre-emphasis, and the like.

preferably, the step S12 further includes:

and performing feature extraction on the preprocessed voice signal by adopting a linear predictive coding technology to obtain the voice feature in the voice signal. It will be appreciated that the basic idea of linear predictive coding is that there is a correlation between the speech signal sample points and that the current and future sample point values can be predicted using a linear combination of past sample points. The linear pre-measured coefficients are uniquely determined by minimizing the mean square error between the predicted signal and the actual signal.

training voice characteristics corresponding to voice sample data in a voice database to obtain mapping from the voice characteristics to phonemes to form an acoustic model; wherein the acoustic model is a knowledge representation of differences in acoustics, phonetics, environmental variables, speaker gender, accents, etc.;

training texts in a text database to obtain a language model, and obtaining mapping between words and sentences to form the language model; the language model is a knowledge representation formed by a group of word sequences and can be obtained by performing LM training by using an SRILM tool;

In general, the acoustic model is a unit that classifies acoustic features of speech into (decodes) phonemes or words, and then the language model decodes the words into a complete sentence.

And step S13, recognizing the extracted voice features through a pre-constructed voice recognition model, obtaining a voice recognition result in a text format, and outputting the voice recognition result. And matching and comparing the input voice characteristics with the acoustic model during recognition to obtain the optimal recognition result.

FIG. 3 is a schematic diagram illustrating an embodiment of a speech recognition processing system for a power-supplying smart client according to the method of the present invention; as shown in fig. 4 to 5. In this embodiment, the speech recognition processing system 1 for a power supply smart client includes:

the voice input unit 10 is used for receiving voice signals of a customer through a power supply intelligent seat;

the system comprises a preprocessing unit 11, a voice signal processing unit and a voice signal processing unit, wherein the preprocessing unit is used for preprocessing a voice signal of a client, and comprises endpoint detection, noise elimination and intelligent interruption processing;

a feature extraction unit 12, configured to perform feature extraction on the preprocessed voice signal to obtain a voice feature in the voice signal;

a speech recognition model construction unit 13 configured to construct a speech recognition model in advance, where the speech recognition model includes an acoustic model, a dictionary, and a language model;

and the voice recognition processing unit 14 is used for recognizing the extracted voice features through a pre-constructed voice recognition model, obtaining a voice recognition result in a text format and outputting the voice recognition result.

In a specific example, the preprocessing unit 11 further includes:

a channel conversion unit 110, configured to convert the voice data into a data format suitable for acoustic model processing, the data format including: wav or pcm audio at a mono 16-bit 16000 sampling rate;

an endpoint detection unit 111, configured to identify an endpoint in the speech signal to determine a user speaking start point and a user speaking end point;

the noise reduction unit 112 is configured to eliminate background noise, perform scene analysis, perform scene type judgment on the speech signal, and perform noise elimination according to a preset noise elimination manner corresponding to different scenes after determining the scene type;

and the intelligent interrupt unit 113 is used for responding to the latest voice signal of the user in real time and interrupting the current task or continuing the current task according to the set interrupt processing mode.

Specifically, in one example, the endpoint detection unit 111 further includes:

Preferably, the endpoint detecting unit 111 further identifies an endpoint in the speech signal by a time domain feature method and a frequency domain feature method.

In a specific example, the intelligent interruption unit 113 specifically includes:

an interrupt mode type setting unit 1130, configured to preset different interrupt processing mode types for different users, where the interrupt processing mode types include: a interruptible mode and an non-interruptible mode;

an interrupt mode processing unit 1131, configured to determine a current interrupt processing mode type if a new voice signal is received when performing voice recognition in real time;

Preferably, the feature extraction unit is further configured to perform feature extraction on the preprocessed voice signal by using a linear predictive coding technique, so as to obtain a voice feature in the voice signal.

Specifically, in one example, in the speech recognition model construction unit 13, the acoustic model is used for establishing a pronunciation template for each pronunciation, and the acoustic model is a knowledge representation of the difference between acoustics, phonetics, environmental variables, speaker gender and accent; the language model is a knowledge representation of a set of word sequences, which is a mapping of words to words, words to sentences; the dictionary is constructed with a mapping relationship between speech and text.

Preferably, the speech recognition model construction unit 13 includes:

a voice database 130 for storing a plurality of voice sample data;

the feature pre-extraction unit 131 is configured to perform feature extraction on voice sample data in the voice database to obtain voice features corresponding to the voice sample data;

and an acoustic model training unit 132, configured to train the speech features, obtain a mapping from the speech features to phonemes, and form an acoustic model.

Preferably, the speech recognition model construction unit 13 includes:

a text database 133 storing a plurality of text sample data;

and the language model training unit 134 is configured to train the texts in the text database to obtain mappings between words and sentences, so as to form a language model.

Preferably, the speech recognition model building unit 13 further comprises:

the dictionary building unit 135 is configured to build a mapping relationship between the speech and the text according to the acoustic model and the language model to form a dictionary.

For more details, reference may be made to the foregoing description of fig. 1 and fig. 2, which is not repeated herein.

The embodiment of the invention has the following beneficial effects:

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A speech recognition processing method for a power supply intelligent client is characterized by comprising the following steps:

2. The method of claim 1, wherein the step S11 further comprises:

3. The method of claim 2, wherein the step S111 further comprises:

4. The method of claim 3, wherein in step S111, the endpoints in the speech signal are identified by a time domain feature method and a frequency domain feature method.

5. The method according to claim 2, wherein the step S112 specifically includes:

6. The method of claim 5, wherein the step S12 further comprises:

7. The method of any of claims 1 to 6, further comprising: the method comprises the steps of pre-constructing a speech recognition model, wherein the speech recognition model comprises an acoustic model, a dictionary and a language model, and specifically comprises the following steps: