CN115862611A

CN115862611A - Power private network voice input method and system based on intelligent voice interaction

Info

Publication number: CN115862611A
Application number: CN202211415612.8A
Authority: CN
Inventors: 马俊杰; 董小菱; 窦国贤; 李明; 李周; 陈清萍; 秦丹丹; 刘朋熙; 蒋明; 周明; 张靖; 徐敏; 刘丽; 吴极; 杨安东; 郭庆; 张天奇
Original assignee: State Grid Anhui Electric Power Co Ltd; Anhui Jiyuan Software Co Ltd
Current assignee: State Grid Anhui Electric Power Co Ltd; Anhui Jiyuan Software Co Ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-03-28

Abstract

The invention belongs to the technical field of voice input, and particularly relates to a power private network voice input method and system based on intelligent voice interaction. The method comprises the steps of triggering a voice input platform to obtain voice data collected by a voice response terminal after receiving a voice input instruction collected by the voice response terminal, denoising and carrying out audio preprocessing on the obtained voice data, inputting a pre-trained voice recognition model for recognition processing, obtaining a voice input result through the recognition processing of the voice recognition model, and loading the voice input result in an input window of the voice input platform. The voice data acquired after the voice input instruction is responded is recognized and input through the voice recognition model, the application of the voice input method in the electric power private network is realized through investigating and analyzing the application scene of the voice input method of the electric power private network, and the processing efficiency of the voice data of the electric power private network when the voice input platform is used for interacting seats and customers is improved.

Description

Power private network voice input method and system based on intelligent voice interaction

Technical Field

The invention belongs to the technical field of voice input, and particularly relates to a power private network voice input method and system based on intelligent voice interaction.

Background

With the continuous improvement of the technical level of computer software and hardware, the concept of artificial intelligence has become popular all over the world. With the development of intelligent voice technology and cloud computing technology, many accurate and efficient voice input methods, such as a news-flying voice input method, appear on the market, and the voice input method is faster and more efficient than the traditional keyboard-based input method. However, the input method used inside the national grid is still a keyboard-based input method, and there is no own dedicated voice input method, and the following practical problems exist in the practical application inside the national grid:

(1) The existing input method is more marketable, the input of a plurality of proper nouns of the national network is troublesome, and the problems of low input efficiency, errors and the like of a plurality of words exist;

(2) The local area network is used in many service scenes of the national network, the input method used at present is based on keyboard input, and the method cannot be applied to the voice input method based on the external network cloud service in the market, so that the input efficiency is low, and the working efficiency is reduced;

(3) In the power business scene, a plurality of computer terminals and hardware terminals are applied in a high-risk environment, so that the input of 'free-speaking type' is realized by using the voice input method, and the input can be realized only by speaking, thereby reducing the potential risk caused by recording operation.

With the rapid development of artificial intelligence voice technology, the intelligent voice technology has reached the practical level in the scenes of human-computer interaction, voice transcription, voice input and the like. The problem is solved in a feasible manner by combining the intelligent voice technology and the power service, so that research and application of the voice input method based on the intelligent voice technology in the power private network are urgently needed to be developed, and the actual application requirements of the power private network are met.

Disclosure of Invention

In view of this, by researching a deep learning technology, a deep full-sequence convolution neural network based voice recognition technology, a voice preprocessing technology and an adaptive technology key technology and combining with a cloud platform supported by big data, the invention provides a method and a system for inputting power private network voice based on intelligent voice interaction.

At least one scheme of the invention provides a power private network voice input method based on intelligent voice interaction, which is applied to a voice input platform and comprises the following steps:

receiving a voice input instruction collected by a voice response terminal, wherein the voice input instruction comprises input request identification information;

triggering the voice input platform to acquire voice data acquired by a voice response terminal according to the voice input instruction;

after denoising and audio preprocessing are carried out on the acquired voice data, inputting a pre-trained voice recognition model for recognition processing, wherein the voice recognition model is trained on the basis of a constructed corpus containing a power private network professional lexicon;

and obtaining a voice input result through the recognition processing of the voice recognition model, and loading the voice input result in an input window of a voice input platform.

Optionally, denoising the acquired voice data, including:

extracting effective factors of noise processing in the voice data, cutting the voice data based on the effective factors, and distinguishing silent parts in the voice data to obtain the voice data subjected to denoising.

Optionally, performing audio preprocessing on the acquired voice data, including:

and pre-filtering, pre-emphasizing, windowing and end point detection processing are carried out on the voice data of the interaction between the agent and the client, which is acquired by the voice response terminal after denoising, and the voice data is subjected to role separation and scene segmentation processing.

Optionally, the pre-filtering employs anti-aliasing filtering to filter out signal components or noise higher than 1/2 of the sampling frequency, and limits the voice signal bandwidth within a preset range;

based on pre-emphasis, the pre-filtered voice signal is promoted to high frequency, and windowing and frame division are carried out to carry out framing on the voice signal; and finding out the starting point and the ending point of the voice from the background noise, and carrying out endpoint detection based on the energy-zero crossing point.

Optionally, the framing of the speech signal is weighted by using a movable window with a limited length, wherein the number of frames per second is 33-100 frames, the frame length is segmented by overlapping, the ratio of the frame shift to the frame length is 0-0.5, and the type of the sub-window is a rectangular window, a hanner window, a hamming window, or a blackman window.

Optionally, when the speech recognition model performs recognition processing, the method includes performing feature extraction and speech matching on the input speech data, where the feature extraction is to extract speech feature parameters from an input original speech signal, and the speech feature parameters include time domain parameters and frequency domain parameters.

Optionally, the speech recognition model is trained based on a constructed corpus containing a professional word bank of the private power network, and the corpus establishing method includes:

creating a power private network corpus according to the category of the power professional term, wherein the language material of the power private network corpus is from a forum, and secondarily screening text language materials according to a triphone model after primarily screening the language materials by using a greedy algorithm;

based on the written language corpus, the HMM model is utilized, and the maximum similarity principle is adopted to train the continuous speech recognition model.

Optionally, the method for training a continuous speech recognition model according to the maximum similarity principle includes: and training the single-phoneme model by using an embedded training algorithm, and connecting the single-phoneme model according to the context relationship of the text to obtain all physical three-phoneme models.

In another aspect of the present invention, an electric power private network voice input system based on intelligent voice interaction is further provided, which is configured to execute the electric power private network voice input method based on intelligent voice interaction, where the electric power private network voice input system based on intelligent voice interaction includes:

the voice response terminal comprises an instruction receiving module, a voice response module and a voice processing module, wherein the instruction receiving module is used for receiving a voice input instruction collected by the voice response terminal, and the voice input instruction comprises input request identification information;

the voice data acquisition module is used for triggering the voice input platform to acquire voice data acquired by the voice response terminal according to the voice input instruction;

the voice data processing module is used for denoising and audio preprocessing the acquired voice data and then inputting a pre-trained voice recognition model for recognition processing;

and the result input loading module is used for obtaining a voice input result through the recognition processing of the voice recognition model and loading the voice input result in an input window of the voice input platform.

Optionally, the system further comprises a corpus construction module, configured to create a power private network corpus including a power private network professional lexicon according to the category of the power private term.

One scheme of the invention also provides computer equipment which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the power private network voice input method based on intelligent voice interaction when executing the computer program.

The invention also comprises a storage medium storing a computer program which, when executed by a processor, implements the steps of the intelligent voice interaction-based power private network voice input method.

The technical scheme provided by the invention has the following beneficial effects:

according to the technical scheme provided by the invention, after a voice input instruction collected by a voice response terminal is received, a voice input platform is triggered to obtain voice data collected by the voice response terminal, the obtained voice data is denoised and subjected to audio preprocessing, a pre-trained voice recognition model is input for recognition processing, a voice input result is obtained through the recognition processing of the voice recognition model, and the voice input result is loaded on an input window of the voice input platform. The voice data acquired after the voice input instruction is responded is recognized and input through the voice recognition model, the application scene of the voice input method of the power private network is investigated and analyzed, the first-level corpus is constructed through the audio preprocessing technology, the voice characteristic parameter extraction technology and the language model processing technology, the professional lexicon of the voice input method of the power private network is further constructed, the voice recognition technology acoustic model and the language model are constructed, the application of the voice input method in the power private network is achieved, and the processing efficiency of the voice data of the power private network when a voice input platform interacts with seats and clients is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of a voice input method for a private power network based on intelligent voice interaction according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a private power network voice input method based on intelligent voice interaction in an embodiment of the present invention.

Fig. 3 is a system block diagram of a private power network voice input system based on intelligent voice interaction according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a power private network voice input method and system based on intelligent voice interaction by researching a deep learning technology, a deep full-sequence convolution neural network voice recognition technology, a voice preprocessing technology and an adaptive technology key technology and combining with a cloud platform supported by big data.

To facilitate understanding of the following embodiments of the present application, the following embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a private power network voice input method based on intelligent voice interaction according to an exemplary embodiment of the present application. Referring to fig. 1, the embodiment of the present application provides a power private network voice input method based on intelligent voice interaction, which is applied to a voice input platform; the method comprises the following steps:

s10: receiving a voice input instruction collected by a voice response terminal, wherein the voice input instruction comprises input request identification information;

s20: triggering the voice input platform to acquire voice data acquired by a voice response terminal according to the voice input instruction;

s30: denoising and audio preprocessing the acquired voice data, and inputting a pre-trained voice recognition model for recognition processing;

s40: and obtaining a voice input result through the recognition processing of the voice recognition model, and loading the voice input result in an input window of a voice input platform.

In step S30, the speech recognition model is trained based on the constructed corpus containing the electric power private network professional lexicon.

In this embodiment, the Speech Recognition technology is called Automatic Speech Recognition (ASR), and aims to convert the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes or character sequences. In this embodiment, the main technical process of speech recognition is divided into several parts, such as a drying-removing technique, a preprocessing technique, a feature extraction technique, a model matching technique, and a language model training technique.

Referring to fig. 1 and fig. 2, denoising the acquired voice data includes:

In the aspect of performance analysis, in the embodiment, a speech recognition system with unspecified persons, isolated words and small vocabularies is constructed by a combined denoising method, and then the recognition rates of different systems are compared under several different signal-to-noise ratio environments to verify the effectiveness of the combined denoising method.

In some embodiments, audio pre-processing the acquired speech data includes:

In this embodiment, the preprocessing techniques include pre-filtering, pre-emphasis, windowing, and endpoint detection techniques, mainly refer to sampling of a speech signal, and removing individual pronunciation differences and noise influence caused by devices and environments by anti-aliasing bandpass filtering, and involve primitive selection and endpoint detection in speech recognition.

In this embodiment, the pre-filtering employs anti-aliasing filtering to filter out signal components or noise higher than 1/2 of the sampling frequency, so as to limit the bandwidth of the speech signal within a preset range. The pre-filtering is also called anti-aliasing filtering, which means to filter out signal components or noise with a sampling frequency higher than 1/2, so that the bandwidth of the signal is limited in a certain range; otherwise, if the sampling rate does not meet the sampling theorem, frequency spectrum aliasing can be generated, and high-frequency components in the signal are distorted at the moment; the pre-filtering nature is a band-pass filter.

The pre-filtered speech signal is boosted to high frequencies based on pre-emphasis. The reason is that the pre-emphasis is influenced by mouth and nose, and the attenuation of 6dB is carried out in the frequency band above 800Hz, namely when the frequency spectrum is obtained, the higher the frequency is, the smaller the signal amplitude is. This often results in spectral analysis where the same signal-to-noise ratio cannot be used throughout the entire band from low to high frequencies. For this purpose, the speech signal is usually subjected to a pre-emphasis to increase the high frequencies.

Performing windowing framing processing to perform framing of the speech signal. In this embodiment, the framing of the speech signal is weighted by using a movable window with a limited length, wherein the number of frames per second is 33 to 100 frames, overlapping and segmented framing is used, the ratio of the frame shift to the frame length is 0 to 0.5, and the types of the framing window are a rectangular window, a hanner window, a hamming window and a blackman window.

In this embodiment, after the windowing process is performed with the pre-emphasis digital filtering process, the windowing and framing process is performed below, and the speech signal has short-time stationarity (the speech signal can be considered as approximately unchanged within 10-30 ms), so that the speech signal can be divided into short segments for processing, which is framing, and the framing of the speech signal is implemented by using a movable window with a limited length for weighting. Typically, the number of frames per second is about 33 to 100 frames, as the case may be. The general framing method is an overlapping segmentation method, the overlapping part of the previous frame and the next frame is called frame shift, and the ratio of the frame shift to the frame length is generally 0-0.5. Common types of sub-windows are rectangular windows, hanna windows, hamming windows, blackman windows.

The endpoint detection refers to finding out the starting point and the ending point of the voice from background noise, which is a basic problem in the field of voice processing, particularly in isolated word recognition, finding out the voice signal range of each word is important, and determining the starting point and the ending point of the voice signal can reduce a large amount of calculation of a system. Generally, a two-pole judgment method is adopted: endpoint detection based on energy-zero crossings.

In this embodiment, when the speech recognition model performs recognition processing, the method includes performing feature extraction and speech matching on input speech data, where the feature extraction is to extract speech feature parameters from an input original speech signal, and the speech feature parameters include time domain parameters and frequency domain parameters.

The characteristic parameter extraction is to extract characteristic parameters capable of reflecting the essence of the voice from an original voice signal to form a vector sequence. It is characterized in that, the time domain parameters: including short-term average energy, short-term average zero-crossing rate, pitch period, etc. The short-term average zero-crossing rate and the short-term average energy are commonly used for detecting voice endpoints, and the pitch period is used for distinguishing the pure and voiced sound class and the tone of Chinese characters. Frequency domain parameters: including the first three formants (amplitude, frequency, bandwidth) of the short-time spectrum (average spectrum of channel filter, DFT spectrum), the cepstrum (LPCC, which is the cepstrum based on linear predictive coding, MFCC, which is the cepstrum based on Mel frequency warping), etc.

The following method can be used for extracting parameters for different parameters: fast Fourier Transform (FFT), mean zero-crossing rate analysis (RBZ), and Gaussian Mixture Model (GMM).

In this embodiment, the speech recognition model is trained based on a constructed corpus containing a professional word bank of the private power network, and the corpus establishing method includes:

In this embodiment, the method for training a continuous speech recognition model according to the maximum similarity principle includes: and training the single-phoneme model by using an embedded training algorithm, and connecting the single-phoneme model according to the context relationship of the text to obtain all physical three-phoneme models. Among them, in terms of language model, a statistical language model of ternary grammar (Trigram) can be adopted.

According to the voice input method of the power private network based on the intelligent voice interaction, the voice data obtained after the voice input instruction is responded are recognized and input through the voice recognition model, the application scene of the voice input method of the power private network is investigated and analyzed, the first-level corpus is constructed through the audio preprocessing technology, the voice characteristic parameter extraction technology and the language model processing technology, the professional lexicon of the voice input method of the power private network is further constructed, the voice recognition technology acoustic model and the language model are constructed, the application of the voice input method in the power private network is achieved, and the processing efficiency of the voice data of the power private network when a voice input platform interacts with a seat and a client is improved.

As shown in fig. 3, in an embodiment of the present invention, an electric power private network voice input system based on intelligent voice interaction includes an instruction receiving module 100, a voice data acquiring module 200, a voice data processing module 300, and a result input loading module 400.

The instruction receiving module 100 is configured to receive a voice input instruction collected by a voice response terminal, where the voice input instruction includes input request identification information.

The voice data obtaining module 200 is configured to trigger the voice input platform to obtain the voice data collected by the voice response terminal according to the voice input instruction.

The voice data processing module 300 is configured to denoise the acquired voice data and perform audio preprocessing, and then input a pre-trained voice recognition model for recognition processing.

The result input loading module 400 is configured to obtain a voice input result through the voice recognition model recognition processing, and load the voice input result in an input window of the voice input platform.

The electric power private network voice input system based on intelligent voice interaction further comprises a corpus construction module used for creating an electric power private network corpus containing electric power private network professional word banks according to the category of electric power professional terms.

In this embodiment, the power private network voice input system based on intelligent voice interaction adopts the steps of the power private network voice input method based on intelligent voice interaction when executing, so that the operation process of the power private network voice input system based on intelligent voice interaction in this embodiment is not described in detail.

In an embodiment of the present invention, there is further provided a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps in the above method embodiments when executing the computer program, and the method includes the following steps:

There is further provided in an embodiment of the present invention a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps in the above-described method embodiments, the method comprising the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions.

In summary, according to the technical scheme provided by the invention, after receiving a voice input instruction collected by a voice response terminal, a voice input platform is triggered to obtain voice data collected by the voice response terminal, after denoising and audio preprocessing are performed on the obtained voice data, a pre-trained voice recognition model is input for recognition processing, and a voice input result is obtained through recognition processing of the voice recognition model and loaded in an input window of the voice input platform. The voice data acquired after the voice input instruction is responded is recognized and input through the voice recognition model, the application scene of the voice input method of the power private network is investigated and analyzed, the first-level corpus is constructed through the audio preprocessing technology, the voice characteristic parameter extraction technology and the language model processing technology, the professional lexicon of the voice input method of the power private network is further constructed, the voice recognition technology acoustic model and the language model are constructed, the application of the voice input method in the power private network is achieved, and the processing efficiency of the voice data of the power private network when a voice input platform interacts with seats and clients is improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A power private network voice input method based on intelligent voice interaction is characterized by being applied to a voice input platform and comprising the following steps:

2. The intelligent voice interaction-based power private network voice input method according to claim 1, wherein denoising the acquired voice data comprises:

and extracting effective factors of noise processing in the voice data, cutting the voice data based on the effective factors, and distinguishing a mute part in the voice data to obtain the denoised voice data.

3. The intelligent voice interaction-based power private network voice input method according to claim 2, wherein the audio preprocessing is performed on the acquired voice data, and comprises the following steps:

4. The power private network voice input method based on intelligent voice interaction as claimed in claim 3, wherein the pre-filtering adopts anti-aliasing filtering to filter out signal components or noise with a frequency higher than 1/2 of a sampling frequency, so as to limit a voice signal bandwidth within a preset range;

5. The power private network voice input method based on intelligent voice interaction of claim 4, wherein the framing of the voice signal is weighted by using a movable window with limited length, wherein the frame number per second is 33-100 frames, overlapping and segmented framing is adopted, the ratio of the frame shift to the frame length is 0-0.5, and the types of the framing windows are a rectangular window, a Hanna window, a Hamming window and a Blackman window.

6. The method as claimed in claim 1, wherein the recognizing process of the voice recognition model includes feature extraction and voice matching of the input voice data, wherein the feature extraction is to extract voice feature parameters from the input original voice signal, and the voice feature parameters include time domain parameters and frequency domain parameters.

7. The intelligent voice interaction-based power private network voice input method according to claim 6, wherein the voice recognition model is trained based on a constructed corpus containing a power private network professional lexicon, and the method for establishing the corpus comprises the following steps:

8. The intelligent voice interaction-based power private network voice input method according to claim 2, wherein the method for training the continuous voice recognition model based on the maximum similarity principle comprises the following steps: and training the single-phoneme model by using an embedded training algorithm, and connecting the single-phoneme model according to the context relationship of the text to obtain all physical three-phoneme models.

9. An intelligent voice interaction-based power private network voice input system, which is used for executing the intelligent voice interaction-based power private network voice input method of any one of claims 1 to 8, and comprises:

10. The intelligent voice interaction-based power private network voice input system according to claim 9, further comprising a corpus construction module for creating a power private network corpus including a power private network professional lexicon according to the category of the power professional term.