CN116229988A

CN116229988A - Voiceprint recognition and authentication method, system and device for personnel of power dispatching system

Info

Publication number: CN116229988A
Application number: CN202310297752.8A
Authority: CN
Inventors: 张雄威; 衷宇清; 崔兆阳; 凌健文; 徐武华; 蒋盛智; 彭丽文; 周上; 罗慕尧; 骆雅菲; 刘晨辉; 孔嘉麟; 陈文文; 张思敏; 周菲; 吴若迪; 冯雅雯
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-06-06

Abstract

The invention provides a voiceprint recognition and authentication method, a voiceprint recognition and authentication system and a voiceprint recognition and authentication device for personnel of a power dispatching system, wherein the method comprises the following steps: the user sends an operation request and a voice signal to the power dispatching system; removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal; constructing a voiceprint recognition model; the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model; and allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful. The invention can accurately recognize the user voice under the condition of being interfered by current and noise.

Description

Voiceprint recognition and authentication method, system and device for personnel of power dispatching system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to the technical field of voiceprint recognition, and particularly relates to a voiceprint recognition and authentication method, system and device for power dispatching system personnel.

Background

At present, the artificial intelligence level is higher and higher, an intelligent dispatching voice processing platform is more needed, various dispatching voice information is identified, analyzed and diagnosed, and a dispatcher is assisted to make the most timely response, the most accurate judgment and the most efficient analysis.

Time-frequency analysis is a common approach in the field of acoustic signal processing. However, the acoustic signals of the operating dispatcher are inevitably affected by current, noise interference and the like, so that the acoustic signals monitored at different times are changed and have broadband non-stationary characteristics, the time-frequency characteristics of the acoustic signals show a certain complexity, and the acoustic signals are difficult to directly analyze to distinguish different working states of the dispatcher. How to improve the accuracy of the identification of the work state of the scheduler is a problem to be solved.

Disclosure of Invention

The invention aims to provide a voice print recognition and authentication method, a voice print recognition and authentication system and a voice print recognition and authentication device for power dispatching system personnel.

A voice print recognition and authentication method for power dispatching system personnel comprises the following steps:

the user sends an operation request and a voice signal to the power dispatching system;

removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal;

constructing a voiceprint recognition model;

the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;

and allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.

Removing the components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal, which comprises the following specific steps:

the method comprises the steps that a first voice signal is obtained from an input end of a power dispatching system, wherein the first voice signal comprises voices of a calling person and a called person;

the power dispatching system comprises a power dispatching system transmission line, a side-sound eliminating circuit, a second voice signal and a power dispatching system, wherein the side-sound eliminating circuit is added in the power dispatching system transmission line, the second voice is acquired from a microphone end of a calling party, and the strength of the calling party voice signal in the second voice signal is far greater than that of a called party;

and carrying out voice signal intensity analysis and signal comparison on the first voice signal and the second voice signal, and separating out the voice signal of the calling person to obtain a pure user voice signal.

After the pure user voice signal is obtained, the method further comprises the step of preprocessing the pure user voice signal, and specifically comprises the following steps:

framing the clean user speech signal, multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal:

s _w (n)＝s(n)*w(n)；

the slope at two ends of the time window is reduced by adopting a Hamming window, and the expression of the Hamming window is as follows:

different values of a will produce different hamming windows;

lifting and de-emphasizing high-frequency components of the pure user voice signal by adopting pre-emphasizing, and suppressing low-frequency components of the pure user voice signal by adopting de-emphasizing;

and performing endpoint detection on the pure user voice signal.

Endpoint detection of clean user speech signals includes:

detecting unvoiced sound by using a short-time zero-crossing rate detection algorithm combining short-time energy and zero-crossing rate detection, and detecting voiced sound by using short-time energy;

and selecting a corresponding unvoiced sound model and a corresponding voiced sound model according to the voiced sound and the unvoiced sound of the voice signal to detect the endpoint of the pure user voice signal.

Selecting corresponding unvoiced models and voiced models according to voiced and unvoiced sounds of the voice signal to perform endpoint detection of the clean user voice signal comprises:

when unvoiced, the corresponding unvoiced excitation model is simulated into random white noise, and a sequence with zero mean, 1 variance and white distribution on time and amplitude values is used;

when voiced sound, intermittent pulse waves are generated, and the mathematical expression is as follows:

in the above formula, N1 is the time of the rising part of the oblique triangular wave, and N2 is the time of the falling part thereof;

after the speech signal is framed, the energy of the nth frame of speech signal xn (m) can be expressed as:

the short-time zero-crossing rate is the number of times that the waveform of the voice signal in one frame of voice passes through the horizontal axis, namely the zero level, and can be expressed as:

where sgn () is a sign function, the number of zero crossings is evaluated by looking at whether a sign change on the waveform has occurred between the current sampled signal and the last sampled signal.

The voiceprint recognition model is formed by serially connecting a convolutional neural network CNN and a long-short-term memory network LSTM network.

Before the power dispatching system uses the trained voiceprint recognition model to match the received user voice signal and the voiceprint information which is input in advance, the power dispatching system further comprises the step of training the voiceprint recognition model, specifically:

dividing the preprocessed voice signals into a training set and a testing set;

inputting the training set into a voiceprint recognition model;

outputting a matching result of the voice signal by the voiceprint recognition model, if the matching is successful, outputting a user identity, and if the matching is unsuccessful, outputting no personnel information;

and iteratively training the voiceprint recognition model until the error rate is smaller than a preset value.

The power dispatching system matches the received user voice signal with the voice signal pre-recorded by the personnel with the operation authority by using a trained voiceprint recognition model, which comprises the following steps:

extracting the user voice signal, and generating a corresponding WAV file by using a PCM code;

the power dispatching system forwards the corresponding WAV file to the voiceprint recognition model;

taking out a voice signal which is recorded in advance by a person with the operation authority in the power dispatching system and an extracted user voice signal to perform signal matching;

and judging the user operation authority according to the matching result.

A power dispatching system personnel voiceprint recognition authentication system, comprising:

the receiving module is used for receiving an operation request and a voice signal sent by a user to the power dispatching system;

the first data processing module is used for constructing a voiceprint recognition model;

the second data processing module is used for matching the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;

and the result output module is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.

The power dispatching system personnel voiceprint recognition and authentication device is connected with a power dispatching system personnel voiceprint recognition and authentication system through a data transmission path, so that the power dispatching system personnel voiceprint recognition and authentication device executes the power dispatching system personnel voiceprint recognition and authentication method, which comprises the following steps:

the data acquisition unit is used for acquiring an operation request and a voice signal sent by a user to the power dispatching system;

the model building unit is used for building a voiceprint recognition model;

the judging unit is used for matching the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;

and the output unit is used for outputting the judging result of the judging unit, allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.

According to the invention, the user sends an operation request and a voice signal to the power dispatching system; removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal; constructing a voiceprint recognition model; the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model; and allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful. Telephone voice signal extraction can be carried out from the input end and the microphone end of the dispatching telephone at the same time, voices which do not belong to a calling party are removed through voice comparison of the telephone input end and the microphone end, the purification precision of user voice signals is improved, the processed user voice signals can enable a voiceprint recognition model to judge user voice information more accurately, work of a dispatcher is reduced, and dispatching efficiency is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of the method for obtaining clean user speech signals according to the present invention;

FIG. 3 is a flow chart of the voiceprint recognition model training of the present invention;

FIG. 4 is a flowchart illustrating the operation of the voiceprint recognition model of the present invention;

FIG. 5 is a short-term processing diagram of a speech signal according to the present invention;

fig. 6 shows the hamming window time and frequency domain signals after normalization according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are merely used to explain the relative positional relationship, movement, etc. between the components in a particular posture (as shown in the drawings), and if the particular posture is changed, the directional indicator is changed accordingly.

Furthermore, the description of "first," "second," etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

The voice recognition method based on the neural network is easy to be interfered by external environment noise and other human voices to cause inaccurate recognition results, the method can eliminate the interference of the external environment noise and other human voices to obtain pure target human voice signals, the recognition accuracy of a voice print recognition model is improved, the characteristics extracted by a single convolution network model are single, the recognition results are inaccurate, the voice print recognition model is formed by combining the convolution neural network and a long-term and short-term memory network, and the voice recognition accuracy is greatly improved.

Example 1

s100, a user sends an operation request and a voice signal to a power dispatching system;

s200, removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal;

s300, constructing a voiceprint recognition model;

s400, the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;

s500, allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.

S200, removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal, wherein the method specifically comprises the following steps:

s201, a first voice signal is acquired from an input end of a power dispatching system, wherein the first voice signal comprises voices of a calling person and a called person;

s202, a side sound eliminating circuit is added in a transmission line of the power dispatching system, and second voice is acquired from a microphone end of a calling party, wherein the strength of the voice signal of the calling party in the second voice signal is far greater than that of the voice signal of a called party;

s203, performing voice signal intensity analysis and signal comparison on the first voice signal and the second voice signal, and separating the voice signal of the calling party to obtain a pure user voice signal.

S200, after obtaining the clean user voice signal, S210 pre-processes the clean user voice signal, specifically:

s211, framing the clean user voice signal, and multiplying the voice signal S (n) by a window function w (n) to form a windowed voice signal:

s _w (n)＝s(n)*w(n)；

the analysis and processing of the speech signal must be based on short-time basis, the speech signal being divided into segments to analyze its characteristic parameters. Each of these is called a frame, which is typically 10ms to 30ms long. For the overall speech signal, a time series of characteristic parameters consisting of characteristic parameters of each frame is analyzed.

For speech signal processing, typically 33 to 100 frames per second (10 ms to 30ms per frame) are taken. Although continuous segmentation may be used, a more common method is an overlapping segmentation method, as shown in fig. 5, where the basic purpose of the overlapping segmentation is to make a smooth transition from frame to frame, and to maintain continuity. The overlapping portion of the previous frame and the next frame is referred to as frame shift. The ratio of frame shift to frame length is typically taken to be 0-1/2.

S212, reducing gradients at two ends of a time window by adopting a Hamming window, wherein the expression of the Hamming window is as follows:

different values of a will produce different hamming windows;

one good window function criterion is: the time domain is that the voice waveform is multiplied by the window function, so that the gradient at two ends of the time window needs to be reduced, and the two ends of the edge of the window do not cause abrupt change and smoothly transition to zero, so that the cut-out voice waveform can be slowly reduced to zero, and the cutting-off effect of voice frames is reduced; a wider 3dB bandwidth and a smaller sideband maximum are required in the frequency domain.

Compared with the Hamming window and the rectangular window, the width of the main lobe of the Hamming window is doubled compared with the rectangular window, namely, the bandwidth is doubled, and the out-of-band attenuation is also doubled compared with the rectangular window. The rectangular window has better smoothness, but loses high frequency content, which results in loss of waveform details. In combination, a hamming window is more suitable than a rectangular window.

Out-of-band attenuation: the ratio of the signal amplitude at a frequency outside the passband (e.g., at the frequency multiplication of the turning frequency or at the frequency multiplication of 10) relative to the signal amplitude within the passband.

S213, the pre-emphasis is adopted to boost the high-frequency component of the pure user voice signal and the de-emphasis is adopted to suppress the low-frequency component of the pure user voice signal;

the low-frequency band energy of the voice signal is large, the high-frequency band signal energy is obviously reduced, the power spectral density of the noise output by the frequency discriminator is increased along with the square of the frequency, so that the signal-to-noise ratio of the audio signal at the low frequency end is large, and the signal-to-noise ratio at the high frequency end is obviously smaller, and therefore pre-emphasis (the high-frequency component of the signal to be processed is lifted) and de-emphasis (the corresponding high-frequency component is depressed after the processing) can be adopted for processing.

The frequency domain analysis can also be carried out on the voice signal, specifically:

because speech waves are a non-stationary process, standard fourier transforms applied to periodic, transient or stationary random signals cannot directly represent the speech signal, but rather the spectrum of the speech signal should be processed using short-time fourier transforms. The corresponding spectrum is called the short-term spectrum.

Performing discrete time domain Fourier transform on the nth frame of voice signal xn (m) to obtain short time Fourier transform:

the time-wide bandwidth product of the signal is a constant, and it is known that W (e ^jω ) The main lobe width is inversely proportional to the window width, with greater N, W (e ^jω ) The narrower the main lobe of (c). N needs to take a suitable value to achieve an equilibrium between signal loss and framing processing.

S214, performing end point detection on the clean user voice signal.

S214 endpoint detection of the clean user speech signal includes:

The endpoint detection may be developed based on a number of different methods, such as a dual-threshold method, an autocorrelation method, a spectral entropy method, a scaling method, and a logarithmic spectral distance method.

1 double threshold method: short-time energy detection can better distinguish between voiced sounds and silence. For unvoiced sound, because the energy is smaller, the energy is misjudged as silence because the energy is lower than an energy threshold in short-time energy detection; short-time zero-crossing detection can then distinguish silence from unvoiced speech. The two aspects are combined with each other, so that a voice segment and a mute segment can be detected.

2 autocorrelation method: the short-time autocorrelation function Rn (k) of the speech signal xn (m) can be expressed as:

where K is the maximum delay point number.

The autocorrelation function of a speech sequence is also a periodic function of the same period, assuming that the speech sequence has periodicity. The autocorrelation function may be used to find the pitch period of the speech waveform sequence for a voiced signal. The autocorrelation function of the noise signal and the noise-containing voice has a large difference in peak amplitude, a proper threshold is set according to the size of the noise, whether the corresponding voice signal exists or not is judged, and the endpoint of the voice signal is determined.

3 log spectral distance method: let the noise-containing speech signal be x (N), the i-th frame speech signal xi (m) obtained after windowing and framing processing, and the frame length be N. FFT (fast fourier transform) is performed for xi (m), and it is possible to obtain:

taking the modulus value of the frequency spectrum Xi (k) and then taking the logarithm, the method can obtain:

because the energy spectra of the noise signal and the noise-containing speech signal differ significantly (the noise signal energy spectrum is much lower than the noise-containing speech signal energy spectrum), the end point of the speech signal can be determined by the logarithmic spectral difference between the two frames of signals.

By combining short-time zero-crossing rate, endpoint detection voice and energy spectrum judgment, the voice signals extracted by the two different methods are subjected to signal comparison, so that the voice signals of the calling party in the power dispatching system can be effectively extracted, and the voice signals are used for subsequent voiceprint recognition deep learning neural network frame training of the voice signals and voiceprint recognition identity judgment and authentication of the calling party.

The architecture of a fully connected neural network is shown as being divided into an Input Layer (Input Layers), a Hidden Layer (hiden Layers), and an output Layer, wherein the Hidden Layer may comprise multiple Layers. The deep neural network corresponds to a neural network architecture with a plurality of hidden layers.

Processing audio signals using a fully connected neural network may have 3 significant drawbacks:

expanding the speech signal into vectors may lose part of the spatial information;

too many parameters can lead to inefficiency and difficulty in training;

a large number of parameters may lead to overfitting.

In general, the first convolutional layer is responsible for capturing lower-level features, and the other convolutional layers are responsible for extracting higher-level features.

The gradient explosion or gradient disappearance is caused in the network deepening process, so that the neural network cannot transfer the gradient to the previous layer during optimization, and the optimal solution cannot be approximated. The residual convolution neural network can better solve the problems of gradient explosion and gradient disappearance. The Softmax function is modified based on basic logic functions. The method can map the K-dimensional vector into another K-dimensional vector through nonlinearity, so that each element is output in a probability mode, the sum of all elements is 1, and the probability requirement is met. The invention adopts the Softmax function as the loss function to train the voiceprint recognition deep neural network.

S400 before the power dispatching system uses the trained voiceprint recognition model to match the received user voice signal and the voiceprint information which is input in advance, the power dispatching system further comprises S310 for training the voiceprint recognition model, specifically:

s311, dividing the preprocessed voice signals into a training set and a testing set;

s312, inputting the training set into a voiceprint recognition model;

s313, outputting a matching result of the voice signal by the voiceprint recognition model, if the matching is successful, outputting the user identity, and if the matching is failed, outputting no personnel information;

s314, iteratively training the voiceprint recognition model until the error rate is smaller than a preset value.

The step 400 of the power dispatching system matching the received user voice signal and the voice signal pre-recorded by the personnel with the operation authority by using a trained voiceprint recognition model comprises the following steps:

s401, extracting the user voice signal, and generating a corresponding WAV file by using a PCM code;

s402, the power dispatching system forwards the corresponding WAV file to a voiceprint recognition model;

s403, taking out a voice signal which is recorded in advance by a person with the operation authority in the power dispatching system and an extracted user voice signal for signal matching;

s404, judging the user operation authority according to the matching result.

Example 2

Example 3

the model building unit is used for building a voiceprint recognition model;

The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The voiceprint recognition and authentication method for the power dispatching system personnel is characterized by comprising the following steps of:

constructing a voiceprint recognition model;

2. The method for identifying and authenticating voice prints of personnel in a power dispatching system according to claim 1, wherein the step of removing the component which does not belong to the user voice in the user voice signal to obtain a pure user voice signal comprises the following steps:

and carrying out voice signal intensity analysis and signal comparison on the first voice signal and the second voice signal by using the short-time zero-crossing rate, the end point detection and the voice energy spectrum, and separating the voice signal of the calling person to obtain a pure user voice signal.

3. The method for identifying and authenticating voice prints of personnel in a power dispatching system according to claim 1, wherein after the pure user voice signal is obtained, the method further comprises preprocessing the pure user voice signal, specifically:

s _w (n)＝s(n)*w(n)；

different values of a will produce different hamming windows;

and performing endpoint detection on the pure user voice signal.

4. The method for voice print recognition and authentication of power dispatching system personnel according to claim 2, wherein the endpoint detection of the clean user voice signal comprises:

5. The method for identifying and authenticating voice prints of personnel in a power dispatching system according to claim 4, wherein the selecting corresponding unvoiced models and voiced models for endpoint detection of clean user voice signals according to voiced and unvoiced voice signals comprises:

6. The method for identifying and authenticating voiceprint of personnel in a power dispatching system according to claim 1, wherein the voiceprint identification model is formed by serially connecting a convolutional neural network CNN and a long-short-term memory network LSTM.

7. The method for identifying and authenticating voiceprint of personnel of a power dispatching system according to claim 1, wherein before the power dispatching system matches the received user voice signal with the voiceprint information recorded in advance by using a trained voiceprint identification model, the method further comprises training the voiceprint identification model, specifically:

dividing the preprocessed voice signals into a training set and a testing set;

inputting the training set into a voiceprint recognition model;

8. The method for voice print recognition and authentication of personnel in a power dispatching system according to claim 1, wherein the power dispatching system matches the received user voice signal with a voice signal pre-recorded by the personnel with the operation authority by using a trained voice print recognition model, comprising:

and judging the user operation authority according to the matching result.

9. A power dispatching system personnel voiceprint recognition authentication system, comprising:

10. A power dispatching system personnel voiceprint recognition and authentication device connected with a power dispatching system personnel voiceprint recognition and authentication system through a data transmission path, so that the power dispatching system personnel voiceprint recognition and authentication device executes the power dispatching system personnel voiceprint recognition and authentication method according to claims 1-8, which is characterized by comprising the following steps:

the model building unit is used for building a voiceprint recognition model;