CN116312561A - Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system - Google Patents

Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system Download PDF

Info

Publication number
CN116312561A
CN116312561A CN202310297886.XA CN202310297886A CN116312561A CN 116312561 A CN116312561 A CN 116312561A CN 202310297886 A CN202310297886 A CN 202310297886A CN 116312561 A CN116312561 A CN 116312561A
Authority
CN
China
Prior art keywords
voice
voice signal
signal
calling user
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310297886.XA
Other languages
Chinese (zh)
Inventor
崔兆阳
衷宇清
张雄威
凌健文
徐武华
蒋盛智
彭丽文
周上
罗慕尧
骆雅菲
刘晨辉
孔嘉麟
陈文文
张思敏
周菲
吴若迪
冯雅雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority to CN202310297886.XA priority Critical patent/CN116312561A/en
Publication of CN116312561A publication Critical patent/CN116312561A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S40/00Systems for electrical power generation, transmission, distribution or end-user application management characterised by the use of communication or information technologies, or communication or information technology specific aspects supporting them
    • Y04S40/20Information technology specific aspects, e.g. CAD, simulation, modelling, system security

Abstract

The invention provides a method, a system and a device for voice print recognition, authentication, noise reduction and voice enhancement of power dispatching system personnel, wherein the method comprises the following steps: the calling user sends operation request and voice signal to dispatcher through telephone; separating a calling user voice signal from mixed voices of a calling user and a dispatcher; noise reduction is carried out on the voice signal of the calling subscriber; performing voice enhancement on a voice signal of a calling user; the power dispatching system matches the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model; and if the matching is successful, allowing the calling user to operate, and if the matching is unsuccessful, not allowing the calling user to operate. The invention can accurately recognize the user voice under the condition of being interfered by current and noise.

Description

Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to the technical field of voiceprint recognition, and particularly relates to a method, a system and a device for voiceprint recognition, authentication, noise reduction and voice enhancement of personnel in a power dispatching system.
Background
In power dispatching systems, telephony dispatching is a common fundamental form. When receiving a scheduling instruction of a calling party through a telephone, performing identity authentication and authentication on the calling party is a core problem for improving the security and reliability of a scheduling system.
Voiceprint recognition by voice signals of the dispatch applicant acting as the calling party is one possible way to authenticate it.
When the corresponding operation is carried out, a series of processes such as telephone voice extraction, voice signal preprocessing, deep neural network training based on voice sample signals, judgment and authentication based on actual dispatching voice signals of calling parties are involved.
One factor that has a great influence on the success rate and reliability of voiceprint recognition of power dispatching system personnel is the quality and interference problem of the voice signal extracted by dispatching telephone.
When a dispatcher of a power system makes a call by dispatching the call, problems of noise interference and telephone channel noise interference in the working environment are inevitably encountered. Therefore, how to effectively suppress these two kinds of noise and to pertinently perform speech enhancement is a critical issue for improving the system performance.
Disclosure of Invention
The invention aims to provide a voice print recognition, authentication, noise reduction and voice enhancement method, a voice print recognition, authentication, noise reduction and voice enhancement device for personnel of a power dispatching system.
A voice print recognition, authentication, noise reduction and voice enhancement method for power dispatching system personnel comprises the following steps:
the calling user sends operation request and voice signal to dispatcher through telephone;
separating a calling user voice signal from mixed voices of a calling user and a dispatcher;
noise reduction is carried out on the voice signal of the calling subscriber;
performing voice enhancement on a voice signal of a calling user;
the power dispatching system matches the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and if the matching is successful, allowing the calling user to operate, and if the matching is unsuccessful, not allowing the calling user to operate.
Separating the caller speech signal from the mixed speech of the caller and dispatcher comprises:
a first voice signal obtained from a telephone terminal;
a side sound eliminating circuit is added in a transmission line of the power dispatching system, and a second voice signal is acquired from a telephone receiver end;
using short-time zero-crossing rate, end point detection and voice energy spectrum to analyze the voice signal intensity of the first voice signal and the second voice signal and compare the signals, and separating out the voice signal of the calling user;
the four voice signals affected by different noises are obtained after separation:
Figure BDA0004143901660000021
Figure BDA0004143901660000022
Figure BDA0004143901660000023
Figure BDA0004143901660000024
noise reduction of the caller's speech signal includes:
noise reduction is carried out on the voice signal of the calling user by adopting a relevant characteristic method:
assuming that the voice signal of the calling user is mutually incoherent with the environmental noise of the calling user and the noise of a telephone transmission channel, carrying out autocorrelation processing on the noisy signal to obtain an autocorrelation frame sequence of the voice signal without noise:
Figure BDA0004143901660000025
where s (t) is a clean speech signal, n (t) is a noise signal, w (t) is a window function applied to achieve short-time stationary, R y (τ) and R S (τ) is the auto-correlation function of the caller's speech signal with and without noise, respectively;
noise reduction is carried out on the voice signal by adopting a wiener filtering method:
the output s '(t) of the noisy speech signal after passing through the wiener filter satisfies E [ |s' (t) -s (t) | 2 ]The wiener filtering method is based on the premise of short-time stable voice signals, and the following formula is obtained for the wiener filter:
Figure BDA0004143901660000026
in the above formula, h|omega| is impulse response of wiener filter frequency domain, and P s (ω),P n (ω) into a signal power spectrum and a noise power spectrum;
S O (ω)=H(ω)·Y(ω)
s in the above O (ω) is the output signal spectrum of the wiener filter and Y (ω) is the caller noisy telephone speech signal spectrum.
The voice enhancement of the calling user voice signal comprises:
the cepstrum mean-average regular noise reduction CMN method is used for removing noise components in telephone voice signal cepstrum with non-additive noise, and the enhanced voice cepstrum obtained by processing through the CMN method is expressed as follows:
Figure BDA0004143901660000027
wherein the method comprises the steps of
Figure BDA0004143901660000028
To enhance cepstrum of speech, C sn (t) is cepstrum of noisy speech, C s (t) is a cepstrum of pure speech, < >>
Figure BDA0004143901660000029
A cepstrum average of the speech segments is collected for the caller.
Using the short-time zero-crossing rate, the end point detection and the voice energy spectrum to perform voice signal intensity analysis and signal comparison on the first voice signal and the second voice signal, and separating the voice signal of the calling user comprises the following steps:
detecting unvoiced sound by using a short-time zero-crossing rate detection algorithm combining short-time energy and zero-crossing rate detection, and detecting voiced sound by using short-time energy;
selecting a corresponding unvoiced model and a corresponding voiced model according to the voiced and unvoiced sounds of the voice signal to detect the voice signal end points so as to obtain the voice signal of the calling user;
the selecting the corresponding unvoiced model and the corresponding voiced model according to the voiced and unvoiced of the voice signal to perform voice signal endpoint detection, thereby obtaining the voice signal of the calling user includes:
when unvoiced, the corresponding unvoiced excitation model is simulated into random white noise, and a sequence with zero mean, 1 variance and white distribution on time and amplitude values is used;
when voiced sound, intermittent pulse waves are generated, and the mathematical expression is as follows:
Figure BDA0004143901660000031
in the above formula, N1 is the time of the rising part of the oblique triangular wave, and N2 is the time of the falling part thereof;
after the speech signal is framed, the energy of the nth frame of speech signal xn (m) can be expressed as:
Figure BDA0004143901660000032
the short-time zero-crossing rate is the number of times that the waveform of the voice signal in one frame of voice passes through the horizontal axis, namely the zero level, and can be expressed as:
Figure BDA0004143901660000033
wherein sgn () is a sign function that evaluates the number of zero crossings by examining whether a sign change on the waveform occurs between the current sampled signal and the last sampled signal;
energy spectrum estimation is carried out on the voice signal of the calling user:
after the speech signal is framed, the energy of the nth frame speech signal xn (m) is expressed as:
Figure BDA0004143901660000034
extracting the voice signal of the calling user by adopting an autocorrelation method:
the short-time autocorrelation function Rn (k) of the speech signal xn (m) can be expressed as:
Figure BDA0004143901660000035
the method comprises the steps of obtaining a pitch period of a voice waveform sequence by using an autocorrelation function for a voiced sound signal, obtaining a large difference between a peak amplitude of the autocorrelation function of a noise signal and a noise-containing voice, setting a threshold according to the size of noise, and determining an endpoint of the noise signal, wherein K is the maximum delay point number, and the autocorrelation function is also the periodic function of the same period on the assumption that the voice sequence has periodicity.
The voiceprint recognition model is formed by serially connecting a convolutional neural network CNN and a long-short-term memory network LSTM network.
Before the power dispatching system uses the trained voiceprint recognition model to match the received user voice signal and the voiceprint information which is input in advance, the power dispatching system further comprises the step of training the voiceprint recognition model, specifically:
dividing the preprocessed voice signals into a training set and a testing set;
inputting the training set into a voiceprint recognition model;
outputting a judging result of the voice signal by the voiceprint recognition model;
and iteratively training the voiceprint recognition model until the error rate is smaller than a preset value.
The power dispatching system matches the received user voice signal with the voice signal pre-recorded by the personnel with the operation authority by using a trained voiceprint recognition model, which comprises the following steps:
performing fast Fourier transform on the noise-reduced calling user voice signals to obtain frequency spectrum characteristics corresponding to each sound source signal;
filtering the spectrum characteristics by a Mel filter and then taking the logarithm to obtain a Mel frequency logarithm energy spectrum corresponding to the telephone voice signal of the calling user;
discrete cosine transforming the mel frequency logarithmic energy spectrum to obtain a mel coefficient spectrum corresponding to the voice signal of the calling user;
and carrying out voiceprint recognition processing based on the corresponding Mel coefficient spectrum, judging the identity of the calling user and authenticating.
A power dispatching system personnel voiceprint recognition authentication noise reduction and voice enhancement system, comprising:
the receiving module is used for receiving an operation request and a voice signal sent by a calling user to a dispatcher through a telephone;
the first data processing module is used for separating calling user voice signals from mixed voices of the calling user and the dispatcher;
the second data processing module is used for matching the voice signal of the calling user with the voice signal which is recorded in advance by a person with the operation authority by using a trained voiceprint recognition model by the power dispatching system;
and the result output module is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
The utility model provides a power dispatching system personnel voiceprint discernment authentication noise reduction and speech enhancement device, is connected with power dispatching system personnel voiceprint discernment authentication system through the data transmission route, makes power dispatching system personnel voiceprint discernment authentication device carry out a power dispatching system personnel voiceprint discernment authentication noise reduction and speech enhancement method, includes:
the data acquisition unit is used for receiving an operation request and a voice signal sent by a calling user to a dispatcher through a telephone;
the data processing unit is used for separating calling user voice signals from mixed voices of the calling user and the dispatcher;
the judging unit is used for matching the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model by the power dispatching system;
and the output unit is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
According to the invention, a calling user sends an operation request and a voice signal to a dispatcher through a telephone; separating a calling user voice signal from mixed voices of a calling user and a dispatcher; noise reduction is carried out on the voice signal of the calling subscriber; performing voice enhancement on a voice signal of a calling user; the power dispatching system matches the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model; and if the matching is successful, allowing the calling user to operate, and if the matching is unsuccessful, not allowing the calling user to operate. Telephone voice signal extraction can be carried out from the input end and the microphone end of the dispatching telephone at the same time, voices which do not belong to a calling party are removed through voice comparison of the telephone input end and the microphone end, the purification precision of user voice signals is improved, the processed user voice signals can enable a voiceprint recognition model to judge user voice information more accurately, work of a dispatcher is reduced, and dispatching efficiency is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of the method for obtaining clean user speech signals according to the present invention;
FIG. 3 is a flow chart of the voiceprint recognition model training of the present invention;
FIG. 4 is a flowchart illustrating the operation of the voiceprint recognition model of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are merely used to explain the relative positional relationship, movement, etc. between the components in a particular posture (as shown in the drawings), and if the particular posture is changed, the directional indicator is changed accordingly.
Furthermore, the description of "first," "second," etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
The dispatching voice is used as a most direct mode of dispatching a password by a dispatcher, is also a most common carrier for dispatching information transmission, and is more required to be an intelligent dispatching voice processing platform at present with higher and higher artificial intelligence level, so that various dispatching voice information is identified, analyzed and diagnosed, and the dispatcher is assisted to make the most timely response, the most accurate judgment and the most efficient analysis. Time-frequency analysis is a common approach in the field of acoustic signal processing. However, the acoustic signals of the operating dispatcher are inevitably affected by current, noise interference and the like, so that the acoustic signals monitored at different times are changed and have broadband non-stationary characteristics, the time-frequency characteristics of the acoustic signals show a certain complexity, and the acoustic signals are difficult to directly analyze to distinguish different working states of the dispatcher. How to improve the accuracy of the identification of the work state of the scheduler is a problem to be solved.
The voice recognition method based on the neural network is easy to be interfered by external environment noise and other human voices to cause inaccurate recognition results, the method can eliminate the interference of the external environment noise and other human voices to obtain pure target human voice signals, the recognition accuracy of a voice print recognition model is improved, the characteristics extracted by a single convolution network model are single, the recognition results are inaccurate, the voice print recognition model is formed by combining the convolution neural network and a long-term and short-term memory network, and the voice recognition accuracy is greatly improved.
Example 1
A voice print recognition, authentication, noise reduction and voice enhancement method for power dispatching system personnel comprises the following steps:
s100, a calling user sends an operation request and a voice signal to a dispatcher through a telephone;
s200, separating a calling user voice signal from mixed voices of a calling user and a dispatcher;
s300, noise reduction is carried out on the voice signal of the calling subscriber;
s400, carrying out voice enhancement on the voice signal of the calling user;
s500, the power dispatching system matches the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the authority of the operation by using a trained voiceprint recognition model;
s600, if the matching is successful, the calling user is allowed to operate, and if the matching is unsuccessful, the calling user is not allowed to operate.
S200, separating the calling user voice signal from the mixed voice of the calling user and the dispatcher comprises the following steps:
s201, a first voice signal is acquired from a telephone terminal;
s202, a side sound eliminating circuit is added in a transmission line of the power dispatching system, and a second voice signal is acquired from a telephone receiver end;
s203, performing voice signal intensity analysis and signal comparison on the first voice signal and the second voice signal by using the short-time zero-crossing rate, the end point detection and the voice energy spectrum, and separating out the voice signal of the calling user;
the four voice signals affected by different noises are obtained after separation:
Figure BDA0004143901660000061
Figure BDA0004143901660000062
Figure BDA0004143901660000063
Figure BDA0004143901660000064
the noise characteristics of the system are considered, and signals can be amplified appropriately according to the intensity of the energy spectrum of the system, so that the telephone intensity signals of calling and called persons at different acquisition ends are close. The noise of the telephone can be extracted by canceling the telephone end caller and the telephone receiver end caller, the noise of the telephone can be canceled by canceling the telephone end callee and the telephone receiver end callee, and then the noise signal of the telephone extracted in advance can be eliminated, and the mute time n under the current call state can be obtained Caller ambient noise + telephone transmission channel noise Is not affected by noise. The noise effect is compared with the expression of the caller at the telephone end, and the noise effect can be decomposed
Figure BDA0004143901660000065
The method can greatly improve the noise suppression characteristic of the telephone collected voice signals and improve the accuracy of subsequent voiceprint recognition of the calling person.
The endpoint detection may be developed based on a number of different methods, such as a dual-threshold method, an autocorrelation method, a spectral entropy method, a scaling method, and a logarithmic spectral distance method.
Double threshold method: short-time energy detection can better distinguish between voiced sounds and silence. For unvoiced sound, because the energy is smaller, the energy is misjudged as silence because the energy is lower than an energy threshold in short-time energy detection; short-time zero-crossing detection can then distinguish silence from unvoiced speech. The two aspects are combined with each other, so that a voice segment and a mute segment can be detected.
Autocorrelation method: the short-time autocorrelation function Rn (k) of the speech signal xn (m) can be expressed as:
Figure BDA0004143901660000071
where K is the maximum delay point number.
The autocorrelation function of a speech sequence is also a periodic function of the same period, assuming that the speech sequence has periodicity. The autocorrelation function may be used to find the pitch period of the speech waveform sequence for a voiced signal. The autocorrelation function of the noise signal and the noise-containing voice has a large difference in peak amplitude, a proper threshold is set according to the size of the noise, whether the corresponding voice signal exists or not is judged, and the endpoint of the voice signal is determined.
Log spectral distance method: let the noise-containing speech signal be x (N), the i-th frame speech signal xi (m) obtained after windowing and framing processing, and the frame length be N. FFT (fast fourier transform) is performed for xi (m), and it is possible to obtain:
Figure BDA0004143901660000072
taking the modulus value of the frequency spectrum Xi (k) and then taking the logarithm, the method can obtain:
Figure BDA0004143901660000073
because the energy spectra of the noise signal and the noise-containing speech signal differ significantly (the noise signal energy spectrum is much lower than the noise-containing speech signal energy spectrum), the end point of the speech signal can be determined by the logarithmic spectral difference between the two frames of signals.
By combining short-time zero-crossing rate, endpoint detection voice and energy spectrum judgment, the voice signals extracted by the two different methods are subjected to signal comparison, so that the voice signals of the calling party in the power dispatching system can be effectively extracted, and the voice signals are used for subsequent voiceprint recognition deep learning neural network frame training of the voice signals and voiceprint recognition identity judgment and authentication of the calling party.
In the processing process, the voice signals obtained by the receiver interface of the telephone handle end can cause obvious strength distinction of the voice signals between the calling party and the called party due to the existence of the telephone side-sound eliminating circuit, and the signals of the calling party and the called party can be effectively segmented and intercepted by combining short-time zero-crossing rate and end point detection.
S300, noise reduction is carried out on the voice signal of the calling user, and specifically, the method comprises the following steps:
noise reduction is carried out on the voice signal of the calling user by adopting a relevant characteristic method:
assuming that the calling user voice signal is mutually incoherent with the calling user environment noise and the telephone transmission channel noise, carrying out autocorrelation processing on the noisy signal to obtain an autocorrelation frame sequence similar to the voice signal without noise:
Figure BDA0004143901660000074
where s (t) is a clean speech signal, n (t) is a noise signal, w (t) is a window function applied to achieve short-time stationary, R y (τ) and R S (τ) is the auto-correlation function of the caller's speech signal with and without noise, respectively;
noise reduction is carried out on the voice signal by adopting a wiener filtering method:
the output s '(t) of the noisy speech signal after passing through the wiener filter satisfies E [ |s' (t) -s (t) | 2 ]The wiener filtering method is based on the premise of short-time stable voice signals, and the following formula is obtained for the wiener filter:
Figure BDA0004143901660000075
in the above formula, h|omega| is impulse response of wiener filter frequency domain, and P s (ω),P n (ω) into a signal power spectrum and a noise power spectrum;
S O (ω)=H(ω)·Y(ω)
s in the above O (ω) is the output signal spectrum of the wiener filter and Y (ω) is the caller noisy telephone speech signal spectrum.
For the voice print recognition system of the power dispatching telephone, noise introduced by the working environment of the dispatching personnel, the transmission channel of the dispatching telephone and the telephone itself causes that when the voice print recognition processing is carried out, the collected voice signals of the calling party have larger background and interference noise deviation compared with the voice signals adopted when the large-scale voice sample is trained because of the problems of reduction of voice quality and interference of the calling party, thereby greatly reducing the voice print recognition rate.
In order to effectively improve the success rate of voiceprint recognition of the system, it is necessary to reduce as much as possible the background interference in the caller's voice signal, the interference of the telephone transmission channel, and the interference introduced by the telephone itself.
The available noise reduction and speech enhancement methods are as follows:
active noise reduction: the method is based on the superposition principle of sound waves, namely, noise removal is realized through mutual cancellation of the sound waves. By finding a sound exactly the same as the noise spectrum to be cancelled, only the opposite phase is added, thus canceling the noise. The difficulty with this approach is that the frequency of the noise is integrated with the frequency spectrum of the speech signal, making it difficult to find a sound with exactly opposite phase, and to perform subsequent noise cancellation.
The characteristic extraction method for speaker identification is classified and arranged, and the characteristic extraction method of the noise-free compensation technology is classified into the following categories for explanation: high/low level based feature extraction, type of transformation, speech generation/hearing system, type of feature extraction technique, time-variability, speech processing technique. In addition, the noise compensation characteristic extraction method is divided into a noise shielding characteristic, a characteristic normalization method and a characteristic compensation method.
non-Negative Matrix Factorization (NMF) algorithms based on sparse constraints. The NMF algorithm based on sparse constraint of the Mel frequency spectrum is used by adopting a method of matrix decomposition based on the Mel frequency spectrum as data in combination with the common amplitude frequency spectrum or Mel frequency spectrum characteristics and the non-negative matrix decomposition principle. Existing sparsely constrained NMF algorithms use fixed noise and speech dictionaries, and when the noise of noisy speech and the noise dictionary do not match, the denoising performance is reduced.
The spectral subtraction is combined with an ideal binary masking (Ideal Binary Mask, IBM) algorithm to mask the speech to be enhanced first, then to spectrally subtract the noise.
The noise reduction process is to separate the environmental audio signal, telephone channel signal, telephone interference signal and speaker voice signal to obtain purer caller voice information.
After noise reduction treatment is carried out on the collected voice information of the calling party, voiceprint recognition matching is carried out on the collected voice information of the calling party and a prerecorded speaker audio signal.
S400 speech enhancement of the caller' S speech signal comprises:
the cepstrum mean-average regular noise reduction CMN method is used for removing noise components in telephone voice signal cepstrum with non-additive noise, and the enhanced voice cepstrum obtained by processing through the CMN method is expressed as follows:
Figure BDA0004143901660000081
wherein the method comprises the steps of
Figure BDA0004143901660000082
To enhance cepstrum of speech, C sn (t) is cepstrum of noisy speech, C s (t) is a cepstrum of pure speech, < >>
Figure BDA0004143901660000083
Collecting a cepstrum average value of a voice section for a calling person;
homomorphism filtering method: for additive noise, a linear processing method can be adopted, and for non-additive noise, a homomorphic filtering method can be adopted for processing. Because cepstrum signals are widely used in speech signal processing, the noise reduction goal can be achieved based on the process of cepstrum processing. After the convolution signal passes through the homomorphic filter, the convolution operation becomes summation operation of complex cepstrum, so that multiplicative noise can be separated. And finally, extracting tone parameters from the complex cepstrum, and obtaining corresponding formants through spectrum analysis, so that the noise-reduced voice signal can be further obtained. Noise components in the telephone voice signal cepstrum of the calling party with non-additive noise can be removed by using a cepstrum average value regular noise reduction (Cepstral Mean Normalization, CMN) method, so that the voice quality is improved.
S203, performing voice signal strength analysis and signal comparison on the first voice signal and the second voice signal by using the short-time zero-crossing rate, the end point detection and the voice energy spectrum, and separating the voice signal of the calling user comprises:
s2031, detecting unvoiced sound by a short-time zero-crossing rate detection algorithm combining short-time energy and zero-crossing rate detection, and detecting voiced sound by short-time energy;
s2032, selecting a corresponding unvoiced model and a corresponding voiced model according to the voiced and unvoiced sounds of the voice signal to detect the voice signal end point so as to obtain the voice signal of the calling user.
S2032 selects a corresponding unvoiced model and a corresponding voiced model according to voiced and unvoiced sounds of the voice signal, and performing voice signal endpoint detection to obtain the voice signal of the calling user includes:
when unvoiced, the corresponding unvoiced excitation model is simulated into random white noise, and a sequence with zero mean, 1 variance and white distribution on time and amplitude values is used;
when voiced sound, intermittent pulse waves are generated, and the mathematical expression is as follows:
Figure BDA0004143901660000091
in the above formula, N1 is the time of the rising part of the oblique triangular wave, and N2 is the time of the falling part thereof;
after the speech signal is framed, the energy of the nth frame of speech signal xn (m) can be expressed as:
Figure BDA0004143901660000092
the short-time zero-crossing rate is the number of times that the waveform of the voice signal in one frame of voice passes through the horizontal axis, namely the zero level, and can be expressed as:
Figure BDA0004143901660000093
wherein sgn () is a sign function that evaluates the number of zero crossings by examining whether a sign change on the waveform occurs between the current sampled signal and the last sampled signal;
energy spectrum estimation is carried out on the voice signal of the calling user:
after the speech signal is framed, the energy of the nth frame speech signal xn (m) is expressed as:
Figure BDA0004143901660000094
extracting the voice signal of the calling user by adopting an autocorrelation method:
the short-time autocorrelation function Rn (k) of the speech signal xn (m) can be expressed as:
Figure BDA0004143901660000095
the method comprises the steps of obtaining a pitch period of a voice waveform sequence by using an autocorrelation function for a voiced sound signal, obtaining a large difference between a peak amplitude of the autocorrelation function of a noise signal and a noise-containing voice, setting a threshold according to the size of noise, and determining an endpoint of the noise signal, wherein K is the maximum delay point number, and the autocorrelation function is also the periodic function of the same period on the assumption that the voice sequence has periodicity.
The voiceprint recognition model is formed by serially connecting a convolutional neural network CNN and a long-short-term memory network LSTM network.
S500 before the power dispatching system uses the trained voiceprint recognition model to match the received user voice signal and the voiceprint information which is input in advance, S410 is also included to train the voiceprint recognition model, specifically:
s411, dividing the preprocessed voice signals into a training set and a testing set;
s412, inputting the training set into a voiceprint recognition model;
s413, outputting a judgment result of the voice signal by the voiceprint recognition model;
s414, iteratively training the voiceprint recognition model until the error rate is smaller than a preset value.
S500, the power dispatching system matches the received user voice signal with the voice signal pre-recorded by the personnel with the operation authority by using a trained voiceprint recognition model, which comprises the following steps:
s501, performing fast Fourier transform on the noise-reduced calling user voice signals to obtain frequency spectrum characteristics corresponding to each sound source signal;
because speech waves are a non-stationary process, standard fourier transforms applied to periodic, transient or stationary random signals cannot directly represent the speech signal, but rather the spectrum of the speech signal should be processed using short-time fourier transforms. The corresponding spectrum is called the short-term spectrum.
S502, filtering the frequency spectrum characteristics by a Mel filter and then taking the logarithm to obtain a Mel frequency logarithm energy spectrum corresponding to the telephone voice signal of the calling user;
s503, carrying out discrete cosine transform on the Mel frequency logarithmic energy spectrum to obtain a Mel coefficient spectrum corresponding to the calling user voice signal;
s504, voiceprint recognition processing is carried out based on the corresponding Mel coefficient spectrum, and the identity of the calling user is judged and authenticated.
Example 2
A power dispatching system personnel voiceprint recognition authentication noise reduction and voice enhancement system, comprising:
the receiving module is used for receiving an operation request and a voice signal sent by a calling user to a dispatcher through a telephone;
the first data processing module is used for separating calling user voice signals from mixed voices of the calling user and the dispatcher;
the second data processing module is used for matching the voice signal of the calling user with the voice signal which is recorded in advance by a person with the operation authority by using a trained voiceprint recognition model by the power dispatching system;
and the result output module is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
Example 3
The utility model provides a power dispatching system personnel voiceprint discernment authentication noise reduction and speech enhancement device, is connected with power dispatching system personnel voiceprint discernment authentication system through the data transmission route, makes power dispatching system personnel voiceprint discernment authentication device carry out a power dispatching system personnel voiceprint discernment authentication noise reduction and speech enhancement method, includes:
the data acquisition unit is used for receiving an operation request and a voice signal sent by a calling user to a dispatcher through a telephone;
the data processing unit is used for separating calling user voice signals from mixed voices of the calling user and the dispatcher;
the judging unit is used for matching the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model by the power dispatching system;
and the output unit is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
According to the invention, a calling user sends an operation request and a voice signal to a dispatcher through a telephone; separating a calling user voice signal from mixed voices of a calling user and a dispatcher; noise reduction is carried out on the voice signal of the calling subscriber; performing voice enhancement on a voice signal of a calling user; the power dispatching system matches the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model; and if the matching is successful, allowing the calling user to operate, and if the matching is unsuccessful, not allowing the calling user to operate. Telephone voice signal extraction can be carried out from the input end and the microphone end of the dispatching telephone at the same time, voices which do not belong to a calling party are removed through voice comparison of the telephone input end and the microphone end, the purification precision of user voice signals is improved, the processed user voice signals can enable a voiceprint recognition model to judge user voice information more accurately, work of a dispatcher is reduced, and dispatching efficiency is improved.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The voice print recognition, authentication, noise reduction and voice enhancement method for the personnel of the power dispatching system is characterized by comprising the following steps of:
the calling user sends operation request and voice signal to dispatcher through telephone;
separating a calling user voice signal from mixed voices of a calling user and a dispatcher;
noise reduction is carried out on the voice signal of the calling subscriber;
performing voice enhancement on a voice signal of a calling user;
the power dispatching system matches the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model;
and if the matching is successful, allowing the calling user to operate, and if the matching is unsuccessful, not allowing the calling user to operate.
2. The method of claim 1, wherein the step of separating the caller's voice signal from the mixed voice of the caller and the dispatcher comprises:
a first voice signal obtained from a telephone terminal;
a side sound eliminating circuit is added in a transmission line of the power dispatching system, and a second voice signal is acquired from a telephone receiver end;
using short-time zero-crossing rate, end point detection and voice energy spectrum to analyze the voice signal intensity of the first voice signal and the second voice signal and compare the signals, and separating out the voice signal of the calling user;
the four voice signals affected by different noises are obtained after separation:
Figure FDA0004143901630000011
Figure FDA0004143901630000012
Figure FDA0004143901630000013
Figure FDA0004143901630000014
3. the method for voice print recognition, authentication, noise reduction and voice enhancement of power dispatching system personnel according to claim 1, wherein the step of noise reduction of the voice signal of the calling party comprises the steps of:
noise reduction is carried out on the voice signal of the calling user by adopting a relevant characteristic method:
assuming that the voice signal of the calling user is mutually incoherent with the environmental noise of the calling user and the noise of a telephone transmission channel, carrying out autocorrelation processing on the noisy signal to obtain an autocorrelation frame sequence of the voice signal without noise:
Figure FDA0004143901630000015
where s (t) is a clean speech signal, n (t) is a noise signal, w (t) is a window function applied to achieve short-time stationary, R y (τ) and R S (τ) is the auto-correlation function of the caller's speech signal with and without noise, respectively;
noise reduction is carried out on the voice signal by adopting a wiener filtering method:
the output s '(t) of the noisy speech signal after passing through the wiener filter satisfies E [ |s' (t) -s (t) | 2 ]The wiener filtering method is based on the premise of short-time stable voice signals, and the following formula is obtained for the wiener filter:
Figure FDA0004143901630000021
in the above formula, h|omega| is impulse response of wiener filter frequency domain, and P s (ω),P n (ω) into a signal power spectrum and a noise power spectrum;
S O (ω)=H(ω)·Y(ω)
s in the above O (ω) is the output signal spectrum of the wiener filter and Y (ω) is the caller noisy telephone speech signal spectrum.
4. The method for voice print recognition, authentication, noise reduction and voice enhancement of power dispatching system personnel according to claim 1, wherein the voice enhancement of the calling user voice signal comprises:
the cepstrum mean-average regular noise reduction CMN method is used for removing noise components in telephone voice signal cepstrum with non-additive noise, and the enhanced voice cepstrum obtained by processing through the CMN method is expressed as follows:
Figure FDA0004143901630000022
wherein the method comprises the steps of
Figure FDA0004143901630000023
To enhance cepstrum of speech, C sn (t) is cepstrum of noisy speech, C s (t) is a cepstrum of pure speech,
Figure FDA0004143901630000024
a cepstrum average of the speech segments is collected for the caller.
5. The method for voice print recognition, authentication, noise reduction and voice enhancement of power dispatching system personnel according to claim 2, wherein the steps of performing voice signal strength analysis and signal comparison on the first voice signal and the second voice signal by using short-time zero-crossing rate, end point detection and voice energy spectrum, and separating the voice signal of the calling party include:
detecting unvoiced sound by using a short-time zero-crossing rate detection algorithm combining short-time energy and zero-crossing rate detection, and detecting voiced sound by using short-time energy;
selecting a corresponding unvoiced model and a corresponding voiced model according to the voiced and unvoiced sounds of the voice signal to detect the voice signal end points so as to obtain the voice signal of the calling user;
the selecting the corresponding unvoiced model and the corresponding voiced model according to the voiced and unvoiced of the voice signal to perform voice signal endpoint detection, thereby obtaining the voice signal of the calling user includes:
when unvoiced, the corresponding unvoiced excitation model is simulated into random white noise, and a sequence with zero mean, 1 variance and white distribution on time and amplitude values is used;
when voiced sound, intermittent pulse waves are generated, and the mathematical expression is as follows:
Figure FDA0004143901630000025
in the above formula, N1 is the time of the rising part of the oblique triangular wave, and N2 is the time of the falling part thereof;
after the speech signal is framed, the energy of the nth frame of speech signal xn (m) can be expressed as:
Figure FDA0004143901630000031
the short-time zero-crossing rate is the number of times that the waveform of the voice signal in one frame of voice passes through the horizontal axis, namely the zero level, and can be expressed as:
Figure FDA0004143901630000032
wherein sgn () is a sign function that evaluates the number of zero crossings by examining whether a sign change on the waveform occurs between the current sampled signal and the last sampled signal;
energy spectrum estimation is carried out on the voice signal of the calling user:
after the speech signal is framed, the energy of the nth frame speech signal xn (m) is expressed as:
Figure FDA0004143901630000033
extracting the voice signal of the calling user by adopting an autocorrelation method:
the short-time autocorrelation function Rn (k) of the speech signal xn (m) can be expressed as:
Figure FDA0004143901630000034
the method comprises the steps of obtaining a pitch period of a voice waveform sequence by using an autocorrelation function for a voiced sound signal, obtaining a large difference between a peak amplitude of the autocorrelation function of a noise signal and a noise-containing voice, setting a threshold according to the size of noise, and determining an endpoint of the noise signal, wherein K is the maximum delay point number, and the autocorrelation function is also the periodic function of the same period on the assumption that the voice sequence has periodicity.
6. The method for voice print recognition, authentication, noise reduction and voice enhancement of power dispatching system personnel according to claim 1, wherein the voice print recognition model is formed by serially connecting a convolutional neural network CNN and a long-short-term memory network LSTM network.
7. The method for voice print recognition, authentication, noise reduction and voice enhancement of personnel in a power dispatching system according to claim 1, wherein before the power dispatching system matches the received user voice signal with the voice print information recorded in advance by using a trained voice print recognition model, the method further comprises training the voice print recognition model, specifically comprises the following steps:
dividing the preprocessed voice signals into a training set and a testing set;
inputting the training set into a voiceprint recognition model;
outputting a judging result of the voice signal by the voiceprint recognition model;
and iteratively training the voiceprint recognition model until the error rate is smaller than a preset value.
8. The method for voice print recognition, authentication, noise reduction and voice enhancement of personnel in a power dispatching system according to claim 1, wherein the step of matching the received user voice signal with a voice signal pre-recorded by the personnel with the authority of the operation by using a trained voice print recognition model comprises the following steps:
performing fast Fourier transform on the noise-reduced calling user voice signals to obtain frequency spectrum characteristics corresponding to each sound source signal;
filtering the spectrum characteristics by a Mel filter and then taking the logarithm to obtain a Mel frequency logarithm energy spectrum corresponding to the telephone voice signal of the calling user;
discrete cosine transforming the mel frequency logarithmic energy spectrum to obtain a mel coefficient spectrum corresponding to the voice signal of the calling user;
and carrying out voiceprint recognition processing based on the corresponding Mel coefficient spectrum, judging the identity of the calling user and authenticating.
9. A power dispatching system personnel voiceprint recognition authentication noise reduction and voice enhancement system, comprising:
the receiving module is used for receiving an operation request and a voice signal sent by a calling user to a dispatcher through a telephone;
the first data processing module is used for separating calling user voice signals from mixed voices of the calling user and the dispatcher;
the second data processing module is used for matching the voice signal of the calling user with the voice signal which is recorded in advance by a person with the operation authority by using a trained voiceprint recognition model by the power dispatching system;
and the result output module is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
10. The power dispatching system personnel voiceprint recognition, authentication, noise reduction and voice enhancement device is characterized in that the device is connected with a power dispatching system personnel voiceprint recognition, authentication system through a data transmission path, so that the power dispatching system personnel voiceprint recognition, authentication device executes the power dispatching system personnel voiceprint recognition, authentication, noise reduction and voice enhancement method in claims 1-8, and the method comprises the following steps:
the data acquisition unit is used for receiving an operation request and a voice signal sent by a calling user to a dispatcher through a telephone;
the data processing unit is used for separating calling user voice signals from mixed voices of the calling user and the dispatcher;
the judging unit is used for matching the voice signal of the calling user with the voice signal which is recorded in advance by the personnel with the operation authority by using a trained voiceprint recognition model by the power dispatching system;
and the output unit is used for allowing the user to operate if the matching is successful, and not allowing the user to operate if the matching is unsuccessful.
CN202310297886.XA 2023-03-23 2023-03-23 Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system Pending CN116312561A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310297886.XA CN116312561A (en) 2023-03-23 2023-03-23 Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310297886.XA CN116312561A (en) 2023-03-23 2023-03-23 Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system

Publications (1)

Publication Number Publication Date
CN116312561A true CN116312561A (en) 2023-06-23

Family

ID=86834007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310297886.XA Pending CN116312561A (en) 2023-03-23 2023-03-23 Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system

Country Status (1)

Country Link
CN (1) CN116312561A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757646A (en) * 2023-08-15 2023-09-15 成都市青羊大数据有限责任公司 Comprehensive management system for teaching

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757646A (en) * 2023-08-15 2023-09-15 成都市青羊大数据有限责任公司 Comprehensive management system for teaching
CN116757646B (en) * 2023-08-15 2023-11-10 成都市青羊大数据有限责任公司 Comprehensive management system for teaching

Similar Documents

Publication Publication Date Title
US10504539B2 (en) Voice activity detection systems and methods
CN106486131B (en) A kind of method and device of speech de-noising
US11475907B2 (en) Method and device of denoising voice signal
KR100636317B1 (en) Distributed Speech Recognition System and method
WO2014153800A1 (en) Voice recognition system
CN108108357B (en) Accent conversion method and device and electronic equipment
CN105679312B (en) The phonetic feature processing method of Application on Voiceprint Recognition under a kind of noise circumstance
CN108564956B (en) Voiceprint recognition method and device, server and storage medium
CN111243617B (en) Speech enhancement method for reducing MFCC feature distortion based on deep learning
Dua et al. Performance evaluation of Hindi speech recognition system using optimized filterbanks
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
Hou et al. Domain adversarial training for speech enhancement
CN116312561A (en) Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system
Labied et al. An overview of automatic speech recognition preprocessing techniques
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
CN111862991A (en) Method and system for identifying baby crying
KR20090116055A (en) Method for estimating noise mask using hidden markov model and apparatus for performing the same
Sudhakar et al. Automatic speech segmentation to improve speech synthesis performance
Shareef et al. Comparison between features extraction techniques for impairments arabic speech
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
Kyriakides et al. Isolated word endpoint detection using time-frequency variance kernels
Thakur et al. Design of Hindi key word recognition system for home automation system using MFCC and DTW
Tu et al. Computational auditory scene analysis based voice activity detection
JP4537821B2 (en) Audio signal analysis method, audio signal recognition method using the method, audio signal section detection method, apparatus, program and recording medium thereof
Bharathi et al. Speaker verification in a noisy environment by enhancing the speech signal using various approaches of spectral subtraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination