CN116229988A - Voiceprint recognition and authentication method, system and device for personnel of power dispatching system - Google Patents
Voiceprint recognition and authentication method, system and device for personnel of power dispatching system Download PDFInfo
- Publication number
- CN116229988A CN116229988A CN202310297752.8A CN202310297752A CN116229988A CN 116229988 A CN116229988 A CN 116229988A CN 202310297752 A CN202310297752 A CN 202310297752A CN 116229988 A CN116229988 A CN 116229988A
- Authority
- CN
- China
- Prior art keywords
- voice signal
- user
- power dispatching
- dispatching system
- personnel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000001514 detection method Methods 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 13
- 238000001228 spectrum Methods 0.000 claims description 9
- 230000005540 biological transmission Effects 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000009432 framing Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 4
- 230000005284 excitation Effects 0.000 claims description 3
- 230000015654 memory Effects 0.000 claims description 3
- 230000000630 rising effect Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 description 10
- 230000003595 spectral effect Effects 0.000 description 5
- 238000005311 autocorrelation function Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000000746 purification Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000005654 stationary process Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
- G06F21/32—User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S40/00—Systems for electrical power generation, transmission, distribution or end-user application management characterised by the use of communication or information technologies, or communication or information technology specific aspects supporting them
- Y04S40/20—Information technology specific aspects, e.g. CAD, simulation, modelling, system security
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Economics (AREA)
- Computer Security & Cryptography (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Primary Health Care (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention provides a voiceprint recognition and authentication method, a voiceprint recognition and authentication system and a voiceprint recognition and authentication device for personnel of a power dispatching system, wherein the method comprises the following steps: the user sends an operation request and a voice signal to the power dispatching system; removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal; constructing a voiceprint recognition model; the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model; and allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful. The invention can accurately recognize the user voice under the condition of being interfered by current and noise.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to the technical field of voiceprint recognition, and particularly relates to a voiceprint recognition and authentication method, system and device for power dispatching system personnel.
Background
At present, the artificial intelligence level is higher and higher, an intelligent dispatching voice processing platform is more needed, various dispatching voice information is identified, analyzed and diagnosed, and a dispatcher is assisted to make the most timely response, the most accurate judgment and the most efficient analysis.
Time-frequency analysis is a common approach in the field of acoustic signal processing. However, the acoustic signals of the operating dispatcher are inevitably affected by current, noise interference and the like, so that the acoustic signals monitored at different times are changed and have broadband non-stationary characteristics, the time-frequency characteristics of the acoustic signals show a certain complexity, and the acoustic signals are difficult to directly analyze to distinguish different working states of the dispatcher. How to improve the accuracy of the identification of the work state of the scheduler is a problem to be solved.
Disclosure of Invention
The invention aims to provide a voice print recognition and authentication method, a voice print recognition and authentication system and a voice print recognition and authentication device for power dispatching system personnel.
A voice print recognition and authentication method for power dispatching system personnel comprises the following steps:
the user sends an operation request and a voice signal to the power dispatching system;
removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal;
constructing a voiceprint recognition model;
the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
Removing the components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal, which comprises the following specific steps:
the method comprises the steps that a first voice signal is obtained from an input end of a power dispatching system, wherein the first voice signal comprises voices of a calling person and a called person;
the power dispatching system comprises a power dispatching system transmission line, a side-sound eliminating circuit, a second voice signal and a power dispatching system, wherein the side-sound eliminating circuit is added in the power dispatching system transmission line, the second voice is acquired from a microphone end of a calling party, and the strength of the calling party voice signal in the second voice signal is far greater than that of a called party;
and carrying out voice signal intensity analysis and signal comparison on the first voice signal and the second voice signal, and separating out the voice signal of the calling person to obtain a pure user voice signal.
After the pure user voice signal is obtained, the method further comprises the step of preprocessing the pure user voice signal, and specifically comprises the following steps:
framing the clean user speech signal, multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal:
s w (n)=s(n)*w(n);
the slope at two ends of the time window is reduced by adopting a Hamming window, and the expression of the Hamming window is as follows:
different values of a will produce different hamming windows;
lifting and de-emphasizing high-frequency components of the pure user voice signal by adopting pre-emphasizing, and suppressing low-frequency components of the pure user voice signal by adopting de-emphasizing;
and performing endpoint detection on the pure user voice signal.
Endpoint detection of clean user speech signals includes:
detecting unvoiced sound by using a short-time zero-crossing rate detection algorithm combining short-time energy and zero-crossing rate detection, and detecting voiced sound by using short-time energy;
and selecting a corresponding unvoiced sound model and a corresponding voiced sound model according to the voiced sound and the unvoiced sound of the voice signal to detect the endpoint of the pure user voice signal.
Selecting corresponding unvoiced models and voiced models according to voiced and unvoiced sounds of the voice signal to perform endpoint detection of the clean user voice signal comprises:
when unvoiced, the corresponding unvoiced excitation model is simulated into random white noise, and a sequence with zero mean, 1 variance and white distribution on time and amplitude values is used;
when voiced sound, intermittent pulse waves are generated, and the mathematical expression is as follows:
in the above formula, N1 is the time of the rising part of the oblique triangular wave, and N2 is the time of the falling part thereof;
after the speech signal is framed, the energy of the nth frame of speech signal xn (m) can be expressed as:
the short-time zero-crossing rate is the number of times that the waveform of the voice signal in one frame of voice passes through the horizontal axis, namely the zero level, and can be expressed as:
where sgn () is a sign function, the number of zero crossings is evaluated by looking at whether a sign change on the waveform has occurred between the current sampled signal and the last sampled signal.
The voiceprint recognition model is formed by serially connecting a convolutional neural network CNN and a long-short-term memory network LSTM network.
Before the power dispatching system uses the trained voiceprint recognition model to match the received user voice signal and the voiceprint information which is input in advance, the power dispatching system further comprises the step of training the voiceprint recognition model, specifically:
dividing the preprocessed voice signals into a training set and a testing set;
inputting the training set into a voiceprint recognition model;
outputting a matching result of the voice signal by the voiceprint recognition model, if the matching is successful, outputting a user identity, and if the matching is unsuccessful, outputting no personnel information;
and iteratively training the voiceprint recognition model until the error rate is smaller than a preset value.
The power dispatching system matches the received user voice signal with the voice signal pre-recorded by the personnel with the operation authority by using a trained voiceprint recognition model, which comprises the following steps:
extracting the user voice signal, and generating a corresponding WAV file by using a PCM code;
the power dispatching system forwards the corresponding WAV file to the voiceprint recognition model;
taking out a voice signal which is recorded in advance by a person with the operation authority in the power dispatching system and an extracted user voice signal to perform signal matching;
and judging the user operation authority according to the matching result.
A power dispatching system personnel voiceprint recognition authentication system, comprising:
the receiving module is used for receiving an operation request and a voice signal sent by a user to the power dispatching system;
the first data processing module is used for constructing a voiceprint recognition model;
the second data processing module is used for matching the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and the result output module is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
The power dispatching system personnel voiceprint recognition and authentication device is connected with a power dispatching system personnel voiceprint recognition and authentication system through a data transmission path, so that the power dispatching system personnel voiceprint recognition and authentication device executes the power dispatching system personnel voiceprint recognition and authentication method, which comprises the following steps:
the data acquisition unit is used for acquiring an operation request and a voice signal sent by a user to the power dispatching system;
the model building unit is used for building a voiceprint recognition model;
the judging unit is used for matching the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and the output unit is used for outputting the judging result of the judging unit, allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
According to the invention, the user sends an operation request and a voice signal to the power dispatching system; removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal; constructing a voiceprint recognition model; the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model; and allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful. Telephone voice signal extraction can be carried out from the input end and the microphone end of the dispatching telephone at the same time, voices which do not belong to a calling party are removed through voice comparison of the telephone input end and the microphone end, the purification precision of user voice signals is improved, the processed user voice signals can enable a voiceprint recognition model to judge user voice information more accurately, work of a dispatcher is reduced, and dispatching efficiency is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of the method for obtaining clean user speech signals according to the present invention;
FIG. 3 is a flow chart of the voiceprint recognition model training of the present invention;
FIG. 4 is a flowchart illustrating the operation of the voiceprint recognition model of the present invention;
FIG. 5 is a short-term processing diagram of a speech signal according to the present invention;
fig. 6 shows the hamming window time and frequency domain signals after normalization according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are merely used to explain the relative positional relationship, movement, etc. between the components in a particular posture (as shown in the drawings), and if the particular posture is changed, the directional indicator is changed accordingly.
Furthermore, the description of "first," "second," etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
The voice recognition method based on the neural network is easy to be interfered by external environment noise and other human voices to cause inaccurate recognition results, the method can eliminate the interference of the external environment noise and other human voices to obtain pure target human voice signals, the recognition accuracy of a voice print recognition model is improved, the characteristics extracted by a single convolution network model are single, the recognition results are inaccurate, the voice print recognition model is formed by combining the convolution neural network and a long-term and short-term memory network, and the voice recognition accuracy is greatly improved.
Example 1
A voice print recognition and authentication method for power dispatching system personnel comprises the following steps:
s100, a user sends an operation request and a voice signal to a power dispatching system;
s200, removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal;
s300, constructing a voiceprint recognition model;
s400, the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
s500, allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
S200, removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal, wherein the method specifically comprises the following steps:
s201, a first voice signal is acquired from an input end of a power dispatching system, wherein the first voice signal comprises voices of a calling person and a called person;
s202, a side sound eliminating circuit is added in a transmission line of the power dispatching system, and second voice is acquired from a microphone end of a calling party, wherein the strength of the voice signal of the calling party in the second voice signal is far greater than that of the voice signal of a called party;
s203, performing voice signal intensity analysis and signal comparison on the first voice signal and the second voice signal, and separating the voice signal of the calling party to obtain a pure user voice signal.
S200, after obtaining the clean user voice signal, S210 pre-processes the clean user voice signal, specifically:
s211, framing the clean user voice signal, and multiplying the voice signal S (n) by a window function w (n) to form a windowed voice signal:
s w (n)=s(n)*w(n);
the analysis and processing of the speech signal must be based on short-time basis, the speech signal being divided into segments to analyze its characteristic parameters. Each of these is called a frame, which is typically 10ms to 30ms long. For the overall speech signal, a time series of characteristic parameters consisting of characteristic parameters of each frame is analyzed.
For speech signal processing, typically 33 to 100 frames per second (10 ms to 30ms per frame) are taken. Although continuous segmentation may be used, a more common method is an overlapping segmentation method, as shown in fig. 5, where the basic purpose of the overlapping segmentation is to make a smooth transition from frame to frame, and to maintain continuity. The overlapping portion of the previous frame and the next frame is referred to as frame shift. The ratio of frame shift to frame length is typically taken to be 0-1/2.
S212, reducing gradients at two ends of a time window by adopting a Hamming window, wherein the expression of the Hamming window is as follows:
different values of a will produce different hamming windows;
one good window function criterion is: the time domain is that the voice waveform is multiplied by the window function, so that the gradient at two ends of the time window needs to be reduced, and the two ends of the edge of the window do not cause abrupt change and smoothly transition to zero, so that the cut-out voice waveform can be slowly reduced to zero, and the cutting-off effect of voice frames is reduced; a wider 3dB bandwidth and a smaller sideband maximum are required in the frequency domain.
Compared with the Hamming window and the rectangular window, the width of the main lobe of the Hamming window is doubled compared with the rectangular window, namely, the bandwidth is doubled, and the out-of-band attenuation is also doubled compared with the rectangular window. The rectangular window has better smoothness, but loses high frequency content, which results in loss of waveform details. In combination, a hamming window is more suitable than a rectangular window.
Out-of-band attenuation: the ratio of the signal amplitude at a frequency outside the passband (e.g., at the frequency multiplication of the turning frequency or at the frequency multiplication of 10) relative to the signal amplitude within the passband.
S213, the pre-emphasis is adopted to boost the high-frequency component of the pure user voice signal and the de-emphasis is adopted to suppress the low-frequency component of the pure user voice signal;
the low-frequency band energy of the voice signal is large, the high-frequency band signal energy is obviously reduced, the power spectral density of the noise output by the frequency discriminator is increased along with the square of the frequency, so that the signal-to-noise ratio of the audio signal at the low frequency end is large, and the signal-to-noise ratio at the high frequency end is obviously smaller, and therefore pre-emphasis (the high-frequency component of the signal to be processed is lifted) and de-emphasis (the corresponding high-frequency component is depressed after the processing) can be adopted for processing.
The frequency domain analysis can also be carried out on the voice signal, specifically:
because speech waves are a non-stationary process, standard fourier transforms applied to periodic, transient or stationary random signals cannot directly represent the speech signal, but rather the spectrum of the speech signal should be processed using short-time fourier transforms. The corresponding spectrum is called the short-term spectrum.
Performing discrete time domain Fourier transform on the nth frame of voice signal xn (m) to obtain short time Fourier transform:
the time-wide bandwidth product of the signal is a constant, and it is known that W (e jω ) The main lobe width is inversely proportional to the window width, with greater N, W (e jω ) The narrower the main lobe of (c). N needs to take a suitable value to achieve an equilibrium between signal loss and framing processing.
S214, performing end point detection on the clean user voice signal.
S214 endpoint detection of the clean user speech signal includes:
detecting unvoiced sound by using a short-time zero-crossing rate detection algorithm combining short-time energy and zero-crossing rate detection, and detecting voiced sound by using short-time energy;
and selecting a corresponding unvoiced sound model and a corresponding voiced sound model according to the voiced sound and the unvoiced sound of the voice signal to detect the endpoint of the pure user voice signal.
Selecting corresponding unvoiced models and voiced models according to voiced and unvoiced sounds of the voice signal to perform endpoint detection of the clean user voice signal comprises:
when unvoiced, the corresponding unvoiced excitation model is simulated into random white noise, and a sequence with zero mean, 1 variance and white distribution on time and amplitude values is used;
when voiced sound, intermittent pulse waves are generated, and the mathematical expression is as follows:
in the above formula, N1 is the time of the rising part of the oblique triangular wave, and N2 is the time of the falling part thereof;
after the speech signal is framed, the energy of the nth frame of speech signal xn (m) can be expressed as:
the short-time zero-crossing rate is the number of times that the waveform of the voice signal in one frame of voice passes through the horizontal axis, namely the zero level, and can be expressed as:
where sgn () is a sign function, the number of zero crossings is evaluated by looking at whether a sign change on the waveform has occurred between the current sampled signal and the last sampled signal.
The endpoint detection may be developed based on a number of different methods, such as a dual-threshold method, an autocorrelation method, a spectral entropy method, a scaling method, and a logarithmic spectral distance method.
1 double threshold method: short-time energy detection can better distinguish between voiced sounds and silence. For unvoiced sound, because the energy is smaller, the energy is misjudged as silence because the energy is lower than an energy threshold in short-time energy detection; short-time zero-crossing detection can then distinguish silence from unvoiced speech. The two aspects are combined with each other, so that a voice segment and a mute segment can be detected.
2 autocorrelation method: the short-time autocorrelation function Rn (k) of the speech signal xn (m) can be expressed as:
where K is the maximum delay point number.
The autocorrelation function of a speech sequence is also a periodic function of the same period, assuming that the speech sequence has periodicity. The autocorrelation function may be used to find the pitch period of the speech waveform sequence for a voiced signal. The autocorrelation function of the noise signal and the noise-containing voice has a large difference in peak amplitude, a proper threshold is set according to the size of the noise, whether the corresponding voice signal exists or not is judged, and the endpoint of the voice signal is determined.
3 log spectral distance method: let the noise-containing speech signal be x (N), the i-th frame speech signal xi (m) obtained after windowing and framing processing, and the frame length be N. FFT (fast fourier transform) is performed for xi (m), and it is possible to obtain:
taking the modulus value of the frequency spectrum Xi (k) and then taking the logarithm, the method can obtain:
because the energy spectra of the noise signal and the noise-containing speech signal differ significantly (the noise signal energy spectrum is much lower than the noise-containing speech signal energy spectrum), the end point of the speech signal can be determined by the logarithmic spectral difference between the two frames of signals.
By combining short-time zero-crossing rate, endpoint detection voice and energy spectrum judgment, the voice signals extracted by the two different methods are subjected to signal comparison, so that the voice signals of the calling party in the power dispatching system can be effectively extracted, and the voice signals are used for subsequent voiceprint recognition deep learning neural network frame training of the voice signals and voiceprint recognition identity judgment and authentication of the calling party.
The voiceprint recognition model is formed by serially connecting a convolutional neural network CNN and a long-short-term memory network LSTM network.
The architecture of a fully connected neural network is shown as being divided into an Input Layer (Input Layers), a Hidden Layer (hiden Layers), and an output Layer, wherein the Hidden Layer may comprise multiple Layers. The deep neural network corresponds to a neural network architecture with a plurality of hidden layers.
Processing audio signals using a fully connected neural network may have 3 significant drawbacks:
expanding the speech signal into vectors may lose part of the spatial information;
too many parameters can lead to inefficiency and difficulty in training;
a large number of parameters may lead to overfitting.
In general, the first convolutional layer is responsible for capturing lower-level features, and the other convolutional layers are responsible for extracting higher-level features.
The gradient explosion or gradient disappearance is caused in the network deepening process, so that the neural network cannot transfer the gradient to the previous layer during optimization, and the optimal solution cannot be approximated. The residual convolution neural network can better solve the problems of gradient explosion and gradient disappearance. The Softmax function is modified based on basic logic functions. The method can map the K-dimensional vector into another K-dimensional vector through nonlinearity, so that each element is output in a probability mode, the sum of all elements is 1, and the probability requirement is met. The invention adopts the Softmax function as the loss function to train the voiceprint recognition deep neural network.
S400 before the power dispatching system uses the trained voiceprint recognition model to match the received user voice signal and the voiceprint information which is input in advance, the power dispatching system further comprises S310 for training the voiceprint recognition model, specifically:
s311, dividing the preprocessed voice signals into a training set and a testing set;
s312, inputting the training set into a voiceprint recognition model;
s313, outputting a matching result of the voice signal by the voiceprint recognition model, if the matching is successful, outputting the user identity, and if the matching is failed, outputting no personnel information;
s314, iteratively training the voiceprint recognition model until the error rate is smaller than a preset value.
The step 400 of the power dispatching system matching the received user voice signal and the voice signal pre-recorded by the personnel with the operation authority by using a trained voiceprint recognition model comprises the following steps:
s401, extracting the user voice signal, and generating a corresponding WAV file by using a PCM code;
s402, the power dispatching system forwards the corresponding WAV file to a voiceprint recognition model;
s403, taking out a voice signal which is recorded in advance by a person with the operation authority in the power dispatching system and an extracted user voice signal for signal matching;
s404, judging the user operation authority according to the matching result.
Example 2
A power dispatching system personnel voiceprint recognition authentication system, comprising:
the receiving module is used for receiving an operation request and a voice signal sent by a user to the power dispatching system;
the first data processing module is used for constructing a voiceprint recognition model;
the second data processing module is used for matching the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and the result output module is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
Example 3
The power dispatching system personnel voiceprint recognition and authentication device is connected with a power dispatching system personnel voiceprint recognition and authentication system through a data transmission path, so that the power dispatching system personnel voiceprint recognition and authentication device executes the power dispatching system personnel voiceprint recognition and authentication method, which comprises the following steps:
the data acquisition unit is used for acquiring an operation request and a voice signal sent by a user to the power dispatching system;
the model building unit is used for building a voiceprint recognition model;
the judging unit is used for matching the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and the output unit is used for outputting the judging result of the judging unit, allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
According to the invention, the user sends an operation request and a voice signal to the power dispatching system; removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal; constructing a voiceprint recognition model; the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model; and allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful. Telephone voice signal extraction can be carried out from the input end and the microphone end of the dispatching telephone at the same time, voices which do not belong to a calling party are removed through voice comparison of the telephone input end and the microphone end, the purification precision of user voice signals is improved, the processed user voice signals can enable a voiceprint recognition model to judge user voice information more accurately, work of a dispatcher is reduced, and dispatching efficiency is improved.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. The voiceprint recognition and authentication method for the power dispatching system personnel is characterized by comprising the following steps of:
the user sends an operation request and a voice signal to the power dispatching system;
removing components which do not belong to the user voice in the user voice signal to obtain a pure user voice signal;
constructing a voiceprint recognition model;
the power dispatching system matches the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
2. The method for identifying and authenticating voice prints of personnel in a power dispatching system according to claim 1, wherein the step of removing the component which does not belong to the user voice in the user voice signal to obtain a pure user voice signal comprises the following steps:
the method comprises the steps that a first voice signal is obtained from an input end of a power dispatching system, wherein the first voice signal comprises voices of a calling person and a called person;
the power dispatching system comprises a power dispatching system transmission line, a side-sound eliminating circuit, a second voice signal and a power dispatching system, wherein the side-sound eliminating circuit is added in the power dispatching system transmission line, the second voice is acquired from a microphone end of a calling party, and the strength of the calling party voice signal in the second voice signal is far greater than that of a called party;
and carrying out voice signal intensity analysis and signal comparison on the first voice signal and the second voice signal by using the short-time zero-crossing rate, the end point detection and the voice energy spectrum, and separating the voice signal of the calling person to obtain a pure user voice signal.
3. The method for identifying and authenticating voice prints of personnel in a power dispatching system according to claim 1, wherein after the pure user voice signal is obtained, the method further comprises preprocessing the pure user voice signal, specifically:
framing the clean user speech signal, multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal:
s w (n)=s(n)*w(n);
the slope at two ends of the time window is reduced by adopting a Hamming window, and the expression of the Hamming window is as follows:
different values of a will produce different hamming windows;
lifting and de-emphasizing high-frequency components of the pure user voice signal by adopting pre-emphasizing, and suppressing low-frequency components of the pure user voice signal by adopting de-emphasizing;
and performing endpoint detection on the pure user voice signal.
4. The method for voice print recognition and authentication of power dispatching system personnel according to claim 2, wherein the endpoint detection of the clean user voice signal comprises:
detecting unvoiced sound by using a short-time zero-crossing rate detection algorithm combining short-time energy and zero-crossing rate detection, and detecting voiced sound by using short-time energy;
and selecting a corresponding unvoiced sound model and a corresponding voiced sound model according to the voiced sound and the unvoiced sound of the voice signal to detect the endpoint of the pure user voice signal.
5. The method for identifying and authenticating voice prints of personnel in a power dispatching system according to claim 4, wherein the selecting corresponding unvoiced models and voiced models for endpoint detection of clean user voice signals according to voiced and unvoiced voice signals comprises:
when unvoiced, the corresponding unvoiced excitation model is simulated into random white noise, and a sequence with zero mean, 1 variance and white distribution on time and amplitude values is used;
when voiced sound, intermittent pulse waves are generated, and the mathematical expression is as follows:
in the above formula, N1 is the time of the rising part of the oblique triangular wave, and N2 is the time of the falling part thereof;
after the speech signal is framed, the energy of the nth frame of speech signal xn (m) can be expressed as:
the short-time zero-crossing rate is the number of times that the waveform of the voice signal in one frame of voice passes through the horizontal axis, namely the zero level, and can be expressed as:
where sgn () is a sign function, the number of zero crossings is evaluated by looking at whether a sign change on the waveform has occurred between the current sampled signal and the last sampled signal.
6. The method for identifying and authenticating voiceprint of personnel in a power dispatching system according to claim 1, wherein the voiceprint identification model is formed by serially connecting a convolutional neural network CNN and a long-short-term memory network LSTM.
7. The method for identifying and authenticating voiceprint of personnel of a power dispatching system according to claim 1, wherein before the power dispatching system matches the received user voice signal with the voiceprint information recorded in advance by using a trained voiceprint identification model, the method further comprises training the voiceprint identification model, specifically:
dividing the preprocessed voice signals into a training set and a testing set;
inputting the training set into a voiceprint recognition model;
outputting a matching result of the voice signal by the voiceprint recognition model, if the matching is successful, outputting a user identity, and if the matching is unsuccessful, outputting no personnel information;
and iteratively training the voiceprint recognition model until the error rate is smaller than a preset value.
8. The method for voice print recognition and authentication of personnel in a power dispatching system according to claim 1, wherein the power dispatching system matches the received user voice signal with a voice signal pre-recorded by the personnel with the operation authority by using a trained voice print recognition model, comprising:
extracting the user voice signal, and generating a corresponding WAV file by using a PCM code;
the power dispatching system forwards the corresponding WAV file to the voiceprint recognition model;
taking out a voice signal which is recorded in advance by a person with the operation authority in the power dispatching system and an extracted user voice signal to perform signal matching;
and judging the user operation authority according to the matching result.
9. A power dispatching system personnel voiceprint recognition authentication system, comprising:
the receiving module is used for receiving an operation request and a voice signal sent by a user to the power dispatching system;
the first data processing module is used for constructing a voiceprint recognition model;
the second data processing module is used for matching the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and the result output module is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
10. A power dispatching system personnel voiceprint recognition and authentication device connected with a power dispatching system personnel voiceprint recognition and authentication system through a data transmission path, so that the power dispatching system personnel voiceprint recognition and authentication device executes the power dispatching system personnel voiceprint recognition and authentication method according to claims 1-8, which is characterized by comprising the following steps:
the data acquisition unit is used for acquiring an operation request and a voice signal sent by a user to the power dispatching system;
the model building unit is used for building a voiceprint recognition model;
the judging unit is used for matching the received user voice signal with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and the output unit is used for outputting the judging result of the judging unit, allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310297752.8A CN116229988A (en) | 2023-03-23 | 2023-03-23 | Voiceprint recognition and authentication method, system and device for personnel of power dispatching system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310297752.8A CN116229988A (en) | 2023-03-23 | 2023-03-23 | Voiceprint recognition and authentication method, system and device for personnel of power dispatching system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116229988A true CN116229988A (en) | 2023-06-06 |
Family
ID=86587418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310297752.8A Pending CN116229988A (en) | 2023-03-23 | 2023-03-23 | Voiceprint recognition and authentication method, system and device for personnel of power dispatching system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116229988A (en) |
-
2023
- 2023-03-23 CN CN202310297752.8A patent/CN116229988A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103854662B (en) | Adaptive voice detection method based on multiple domain Combined estimator | |
US10504539B2 (en) | Voice activity detection systems and methods | |
CN108896878B (en) | Partial discharge detection method based on ultrasonic waves | |
WO2014153800A1 (en) | Voice recognition system | |
Venter et al. | Automatic detection of African elephant (Loxodonta africana) infrasonic vocalisations from recordings | |
CN103646649A (en) | High-efficiency voice detecting method | |
WO2001016937A9 (en) | System and method for classification of sound sources | |
US10381025B2 (en) | Multiple pitch extraction by strength calculation from extrema | |
US20060100866A1 (en) | Influencing automatic speech recognition signal-to-noise levels | |
CA2492204A1 (en) | Similar speaking recognition method and system using linear and nonlinear feature extraction | |
CN110232933A (en) | Audio-frequency detection, device, storage medium and electronic equipment | |
CN105679312A (en) | Phonetic feature processing method of voiceprint identification in noise environment | |
CN111540342A (en) | Energy threshold adjusting method, device, equipment and medium | |
CN109473102A (en) | A kind of robot secretary intelligent meeting recording method and system | |
CN106548786A (en) | A kind of detection method and system of voice data | |
KR101022519B1 (en) | System and method for voice activity detection using vowel characteristic, and method for measuring sound spectral similarity used thereto | |
CN112466276A (en) | Speech synthesis system training method and device and readable storage medium | |
Rahman et al. | Dynamic time warping assisted svm classifier for bangla speech recognition | |
Couvreur et al. | Automatic noise recognition in urban environments based on artificial neural networks and hidden markov models | |
Lim et al. | Classification of underwater transient signals using mfcc feature vector | |
CN112216285B (en) | Multi-user session detection method, system, mobile terminal and storage medium | |
US6470311B1 (en) | Method and apparatus for determining pitch synchronous frames | |
CN116312561A (en) | Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system | |
RU2606566C2 (en) | Method and device for classifying noisy voice segments using multispectral analysis | |
CN116229988A (en) | Voiceprint recognition and authentication method, system and device for personnel of power dispatching system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |