CN110415722A - Audio signal processing method, storage medium, computer program and electronic equipment - Google Patents
Audio signal processing method, storage medium, computer program and electronic equipment Download PDFInfo
- Publication number
- CN110415722A CN110415722A CN201910674339.2A CN201910674339A CN110415722A CN 110415722 A CN110415722 A CN 110415722A CN 201910674339 A CN201910674339 A CN 201910674339A CN 110415722 A CN110415722 A CN 110415722A
- Authority
- CN
- China
- Prior art keywords
- frequency spectrum
- transformation frequency
- spectrum data
- data
- speech signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 24
- 230000005236 sound signal Effects 0.000 title claims abstract description 21
- 238000004590 computer program Methods 0.000 title claims abstract description 15
- 238000001228 spectrum Methods 0.000 claims abstract description 119
- 230000009466 transformation Effects 0.000 claims abstract description 94
- 238000000034 method Methods 0.000 claims abstract description 40
- 238000004458 analytical method Methods 0.000 claims abstract description 12
- 230000006854 communication Effects 0.000 claims description 29
- 238000004891 communication Methods 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 23
- 238000009499 grossing Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 10
- 238000012731 temporal analysis Methods 0.000 abstract description 8
- 238000000700 time series analysis Methods 0.000 abstract description 8
- 230000006872 improvement Effects 0.000 abstract description 3
- 238000009432 framing Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
- 238000004804 winding Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Complex Calculations (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the present invention provides a kind of audio signal processing method, computer readable storage medium, computer program and electronic equipment.Audio signal processing method includes: the second transformation frequency spectrum data of the first transformation frequency spectrum data and the assistant voice signal based on the primary speech signal that obtain primary speech signal respectively according to Time-Frequency Analysis Method;Based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, the transformation frequency spectrum amendment data of the primary speech signal are obtained;Data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, determine the group delay characteristic of the primary speech signal.Primary speech signal can be directed to, Time Series Analysis Method is organically combined with group delay (MGD) method of improvement, thus the phonetic feature got had both remained the dynamic resolution of Time-Series analysis and the characteristic of log-frequency axis, amplitude information and phase information are merged again, therefore there is better detection performance.
Description
Technical field
The present embodiments relate to voice processing technology more particularly to a kind of audio signal processing methods, computer-readable
Storage medium, computer program and electronic equipment.
Background technique
Speaker verification's technology refers to sound when speaking by people to verify the technology of its identity.This technology quilt
It is widely used in various fields.However, speaker verification's technology is easy by malicious attack.Currently, main attack technology is divided into
Following four: impersonation attack, sound conversion attack, speech synthesis attack and playback attack.Wherein, impersonation attack, which refers to, attacks
The person of hitting goes to imitate the sound of target speaker, to attempt to enter system.Sound conversion attack is then that attacker calculates by computer
The sound of oneself is converted into sound similar with target speaker, and then attacked by the strength of method.Speech synthesis attack
Then directly system is attacked by the sound of computer synthesis target speaker.Playback attack, then refer to attacker's thing
It first recorded target speaker's one's voice in speech, then by reproducing device (such as loudspeaker), recording reset, and then right
System causes to attack.
In these four attacks, playback attack implements very simple, does not need professional knowledge.Meanwhile it is existing
Lot of documents proves that playback attack is larger to the safety effects of speaker verification, is one and is badly in need of the problem of being solved.
In order to accurately and efficiently from the attack of the Speech signal detection playback of acquisition, need to voice signal
It is analyzed, extracts the characteristic for obtaining being best able to reflect playback.
Summary of the invention
The purpose of the embodiment of the present invention is, provides a kind of Speech processing scheme, extracts frequency spectrum from voice signal
Amplitude Characteristics information and phase property information, accurately to carry out the relevant detection of voice.
According to a first aspect of the embodiments of the present invention, a kind of audio signal processing method is provided, comprising: according to time frequency analysis
Method, the first transformation frequency spectrum data and the assistant voice based on the primary speech signal for obtaining primary speech signal respectively
Second transformation frequency spectrum data of signal;Based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, institute is obtained
State the transformation frequency spectrum amendment data of primary speech signal;Number is corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum
According to determining the group delay characteristic of the primary speech signal.
Optionally, described according to Time-Frequency Analysis Method, obtain respectively the first transformation frequency spectrum data of primary speech signal with
And the second transformation frequency spectrum data of the assistant voice signal based on the primary speech signal, comprising: pass through constant Q transform CQT
Method obtains the first transformation frequency spectrum data of the primary speech signal;Obtain the auxiliary language message of the primary speech signal
Number, and by constant Q transform CQT method, obtain the second transformation frequency spectrum data of the assistant voice signal.
Optionally, described based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, obtain the original
The transformation frequency spectrum of beginning voice signal corrects data, comprising:
Frequency spectrum data Y (f, t) is converted to first transformation frequency spectrum data X (f, t) and corresponding described second, by with
Lower formula calculates transformation frequency spectrum amendment data Y ' (f, t):
Y ' (f, t)=Y (f, t)-t × T × X (f, t)
Wherein, f is frequency index, and t is time index, T be adjacent two frame respectively at the beginning of between interval.
Optionally, described that data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, determine the original
The group delay characteristic of beginning voice signal, comprising:
It is calculated by the following formula the group delay characteristic frequency spectrum data τ of the primary speech signalx(f, t):
Wherein, the first transformation frequency spectrum data that X (f, t) is primary speech signal x (n), Y ' (f, t) are auxiliary voice signal
The transformation frequency spectrum that second transformation frequency spectrum data of nx (n) signal obtains after amendment corrects data, XR(f, t) and XI(f, t) is
The real part and imaginary part of X (f, t), Y 'R(f, t) and Y 'I(f, t) is the real part and imaginary part of Y ' (f, t), S
(f, t) is the smooth first transformation frequency spectrum data of cepstrum obtained after cepstrum smoothing processing by X (f, t), and α and γ are to extract spy
The hyper parameter to be adjusted when sign.
Optionally, the method also includes: according to the group delay characteristic, by model to the group delay of normal voice
Slow characteristic and the group delay characteristic for resetting voice are modeled, and playback attack detecting is carried out.
According to a second aspect of the embodiments of the present invention, a kind of computer readable storage medium is provided, calculating is stored thereon with
Machine program instruction, wherein the step of described program instruction realizes any aforementioned voice signal processing method when being executed by processor.
According to a third aspect of the embodiments of the present invention, a kind of electronic equipment is provided, comprising: processor, memory, communication member
Part and communication bus, the processor, the memory and the communication device are completed each other by the communication bus
Communication;The memory is for storing an at least executable instruction, before the executable instruction keeps the processor execution any
The corresponding operation of predicate signal processing method.
According to a fourth aspect of the embodiments of the present invention, a kind of computer program is provided comprising there are computer program instructions,
Wherein, the step of realizing any aforementioned voice signal processing method when described program instruction is executed by processor.
By the processing of aforementioned voice signal processing method, primary speech signal can be directed to, by Time Series Analysis Method with
Group delay (MGD) method of improvement is organically combined, and the MGD spectrum based on Time-Series analysis is extracted, to detect for playback.
The phonetic feature got as a result, had not only remained the dynamic resolution of Time-Series analysis and the characteristic of log-frequency axis, but also by amplitude
Information and phase information are merged, therefore have better detection performance.
Further, since when the frequency spectrum of the voice signal obtained by the processing of aforementioned voice signal processing method both remains
The dynamic resolution of sequence analysis and the characteristic of log-frequency axis, and amplitude information and phase information are merged, therefore, use
The group delay characteristic come to normal voice and reset voice model, can more accurately carry out playback attack inspection
It surveys.
Detailed description of the invention
Fig. 1 is the flow chart for showing audio signal processing method according to some embodiments of the invention;
Fig. 2 is shown carries out assistant voice signal processing and traditional assistant voice signal processing according to embodiments of the present invention
Voice signal waveform diagram;
Fig. 3 is the structural schematic diagram shown according to the electronic equipment with regard to the embodiment of the present invention.
Specific embodiment
The exemplary embodiment of the present invention is described in detail embodiment with reference to the accompanying drawing.
In this application, " multiple " refer to two or more, "at least one" refer to one, two or more.It is right
Either component, data or the structure referred in the application, it is clearly limit one in the case where, it will be appreciated that for one or
It is multiple.
Fig. 1 is the flow chart for showing audio signal processing method according to some embodiments of the invention.
Referring to Fig.1, in step S110, according to Time-Frequency Analysis Method, the first transformation frequency of primary speech signal is obtained respectively
Second transformation frequency spectrum data of modal data and the assistant voice signal based on the primary speech signal.
Here, which is the voice signal of the certain time length for example acquired by voice capture device.In
The step, can be used for example including including Short Time Fourier Transform (STFT) Fourier transformation method, small wave converting method, often
The Time-Frequency Analysis Methods such as number Q transformation (CQT) method obtain the first transformation spectrum number of primary speech signal from primary speech signal
According to.
Wherein, constant Q transform (Constant Q Transform, CQT) is a kind of important time frequency analyzing tool, in sound
It is widely used in happy analysis.Recent studies suggest that CQT also has good effect in playback detection field.CQT with it is short
When Fourier transformation (Short Time Fourier Transform, STFT) it is closely similar, difference is, what CQT was generated
The frequency axis of frequency spectrum is logarithmic scale, and the frequency axis that STFT is generated is linear-scale.In addition, CQT is extracting different frequency
When ingredient, the length of analysis window can also change, and specifically, low frequency part window is long, high frequency section window length.Therefore,
It has better frequency resolution in low frequency part compared with STFT, has better temporal resolution in high frequency section.CQT is relatively based on Fu
In for the frequency spectrum (such as STFT, FFT) that generates of leaf transformation, be more suitable playback Detection task.
On the other hand, simple CQT frequency spectrum, usually can only utilize the amplitude information of frequency spectrum, and can not be believed using phase
Breath, thus the spectrum signature for being unable to fully land productivity voice signal carries out comprehensively and accurately speech feature extraction.For this purpose, considering
The extraction of phase property is carried out using the group delay method for the phase property for being able to reflect voice signal.
Group delay (Group Delay function, GD) is the key concept in signal processing, indicates that signal passes through
The time delay of the amplitude envelope of each component sine waves when measured device is defined as the negative derivative of phase versus frequency, as follows:
Wherein, ω indicates frequency, and θ (ω) indicates the phase at ω frequency.It can directly be extracted according to frequency spectrum, such as
Shown in lower:
Wherein, X (ω) is the frequency spectrum of signal x (n), and Y (ω) is the frequency spectrum for assisting voice signal nx (n), XR(ω) and XI
(ω) is the real and imaginary parts of X (ω) respectively.The problem of group delay can be to avoid phase winding, therefore be a kind of pair of signal phase
Popular representation.
Group delay (Modified Group Delay function, MGD) is improved then to carry out traditional group delay method
It improves, spectrum energy is carried out smoothly, to avoid infinity problem caused by the low energy point of part, and increase hyper parameter,
Its dynamic range is controlled, model modeling is conducive to.It is as follows that MGD extracts formula:
Wherein, S (ω) is that X (ω) is obtained after cepstrum is smooth, and sign. isSymbol
Number, two hyper parameters that α and γ are MGD need to be adjusted according to different tasks.
MGD frequency spectrum is identical in the resolution ratio of each frequency band using FFT as transformation of criterion.Research phenomenon shows
For CQT is compared with FFT, it is more suitable playback Detection task.But simple CQT frequency spectrum, it can only generally be believed using amplitude
Breath, cannot utilize phase information.On the other hand, MGD is a kind of technology that can merge phase information and assignment information simultaneously.
Therefore, according to the preferred embodiment of the present invention, in the processing of voice signal, by CQT method and MGD method phase
In conjunction with MGD spectrum of the extraction based on CQT, to be used for playback detection.
Correspondingly, optional embodiment according to the present invention is obtained in step S110 by constant Q transform (CQT) method
The first of the primary speech signal is taken to convert frequency spectrum data;The assistant voice signal of the primary speech signal is obtained again, and
And by CQT method, the second transformation frequency spectrum data of the assistant voice signal is obtained.
Here, the CQT frequency spectrum data of primary speech signal can be calculated, referred herein to by any applicable CQT method
For the first transformation frequency spectrum data.In addition, it is similar with the processing in group delay algorithm, it is first generated here based on primary speech signal
Assistant voice signal, to be modified for the CQT frequency spectrum to primary speech signal.For example, being directed to primary speech signal, generate
Its assistant voice signal.Then, CQT processing is carried out for the assistant voice signal, obtains the CQT spectrum number of assistant voice signal
According to, hereon referred to as the second transformation frequency spectrum data.The the first transformation frequency spectrum data for being subsequently used for group delay feature calculation is obtained as a result,
With the second transformation frequency spectrum data.
The original is obtained based on the first transformation frequency spectrum data and the second transformation frequency spectrum data in step S120
The transformation frequency spectrum of beginning voice signal corrects data.
In the voice signal voice for having carried out framing, the i-th frame signal x(i)(n) relationship between complete signal x (n)
It can be expressed as follows:
x(i)(n)=x (n+i*T)
Wherein, T be frame move length, i.e., adjacent two frame respectively at the beginning of between interval.
In traditional MGD method, framing first is carried out to primary speech signal, then aforementioned auxiliary language is calculated to each framing
Sound signal:
y(i)(n)=nx(i)(n)
Then, then to the assistant voice signal y of each framing(i)(n) frequency spectrum Y is calculated.
In contrast to this, in the audio signal processing method that the embodiment of the present invention proposes, first whole raw tone is believed
Number x (n) acquires its assistant voice signal nx (n), then carries out framing to assistant voice signal nx (n), at this point,
y(i)(n)=(n+i*T) x (n+i*T)=(n+i*T) x(i)(n)
Then, then the CQT frequency spectrum of assistant voice signal nx (n) is calculated.
As can be seen that the assistant voice signal y obtained by two kinds of sub-frame processing modes(i)(n) numerical value is in first frame
It is identical at (i=0 at this time), can be different for the numerical value of both subsequent frames, as shown in Figure 2.The upper end Fig. 2 shows raw tone letter
Number waveform, and the waveform diagram of its lower end shows the waveform 210 for each framing that processing mode according to the present invention obtains
With the comparison of the waveform 220 of each framing obtained according to the processing mode of traditional MGD.
Since the processing mode of the audio signal processing method of the embodiment of the present invention is different, it is therefore desirable to carry out frequency spectrum and repair
Just.Specifically, it can be calculated based on the first transformation frequency spectrum data obtained by primary speech signal and by assistant voice signal
Second transformation frequency spectrum data be modified processing, thus obtain primary speech signal transformation frequency spectrum correct data.
For example, to each first transformation frequency spectrum data X (f, t) and corresponding second transformation frequency spectrum data Y (f, t),
Transformation frequency spectrum amendment data Y ' (f, t) can be calculated by the following formula:
Y ' (f, t)=Y (f, t)-t × T × X (f, t)
Wherein, f is frequency index, and t is time index, and T is that frame moves length, i.e., adjacent two frame respectively at the beginning of between
Interval.
Its difference between Y (f, t) of Y ' (f, t) is, when calculating Y (f, t), calculates assistant voice based on overall signal
Signal nx (n) then carries out framing to assistant voice signal nx (n), then calculates CQT frequency spectrum;And Y ' (f, t), numerically its with
Framing is first carried out, nx (n) then is done for the local signal in each frame, then to calculate the result that frequency spectrum obtains identical.
In step S130, data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, determine the original
The group delay characteristic of beginning voice signal.
For example, tradition can be replaced by the first transformation frequency spectrum data and transformation frequency spectrum amendment data converted based on CQT
FFT spectrum in group delay (MGD) method of improvement, to calculate the group delay characteristic of primary speech signal.
Specifically, the group delay characteristic of primary speech signal is calculated by following formula:
Wherein, the first transformation frequency spectrum data that X (f, t) is primary speech signal x (n), Y ' (f, t) are auxiliary voice signal
The transformation frequency spectrum that second transformation frequency spectrum data of nx (n) signal obtains after amendment corrects data, XR(f, t) and XI(f, t) is
The real part and imaginary part of X (f, t), Y 'R(f, t) and Y 'I(f, t) is the real part and imaginary part of Y ' (f, t), S
(f, t) is the smooth first transformation frequency spectrum data of cepstrum that X (f, t) is obtained after cepstrum smoothing processing, and α and γ are to extract feature
When the hyper parameter to be adjusted.
By aforementioned processing, for primary speech signal, Time Series Analysis Method is organically combined with MGD method,
The MGD spectrum based on Time-Series analysis is extracted, to detect for playback.The phonetic feature got as a result, both remains timing
The dynamic resolution of analysis and the characteristic of log-frequency axis, and amplitude information and phase information are merged, therefore has more
Good detection performance.
Optional embodiment according to the present invention, the audio signal processing method further include: according to the group delay feature
Data are modeled by group delay characteristic of the model to the group delay characteristic of normal voice and playback voice, into
Row playback attack detecting.
Since the frequency spectrum of the voice signal of the processing acquisition by abovementioned steps S110~S150 both remains Time-Series analysis
Dynamic resolution and log-frequency axis characteristic, and amplitude information and phase information are merged, therefore, use the group delay
Slow characteristic to normal voice and is reset voice and is modeled, and playback attack detecting can be more accurately carried out.
The embodiment of the present invention also provides a kind of calculating for being stored with the step of executing aforementioned any audio signal processing method
Machine readable storage medium storing program for executing.
In addition, the embodiment of the present invention also provides a kind of computer program product including at least one executable instruction, institute
It states when executable instruction is executed by processor for realizing aforementioned any audio signal processing method.
The embodiment of the invention also provides a kind of electronic equipment.Fig. 3 is to show to be set according to the electronics with regard to the embodiment of the present invention
Standby 300 structural schematic diagram.The electronic equipment 300 can be such as mobile terminal, personal computer (PC), tablet computer, clothes
Business device etc..Below with reference to Fig. 3, it illustrates the structural schematic diagrams for the electronic equipment 300 for being suitable for being used to realize the embodiment of the present invention:
As shown in figure 3, electronic equipment 300 may include memory and processor.Specifically, electronic equipment 300 includes one or more
Processor, communication device etc., one or more of processors for example: one or more central processing unit (CPU) 301, and/
Or one or more image processors (GPU) 313 etc., processor can according to be stored in read-only memory (ROM) 302 can
It executes instruction or is executed various from the executable instruction that storage section 308 is loaded into random access storage device (RAM) 303
Movement and processing appropriate.Communication device includes communication component 312 and/or communication interface 309.Wherein, communication component 312 can wrap
Network interface card is included but is not limited to, the network interface card may include but be not limited to IB (Infiniband) network interface card, and communication interface 309 includes such as
The communication interface of the network interface card of LAN card, modem etc., communication interface 309 are executed via the network of such as internet
Communication process.
Processor can with communicate in read-only memory 302 and/or random access storage device 303 to execute executable instruction,
It is connected by communication bus 304 with communication component 312 and is communicated through communication component 312 with other target devices, to completes this
Anti-lost detection method corresponding operation of any one based on broadcast that inventive embodiments provide, for example, according to Time-Frequency Analysis Method,
The the first transformation frequency spectrum data and the assistant voice signal based on the primary speech signal for obtaining primary speech signal respectively
Second transformation frequency spectrum data;Based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, the original is obtained
The transformation frequency spectrum of beginning voice signal corrects data;Data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum,
Determine the group delay characteristic of the primary speech signal.
In addition, in RAM 303, various programs and data needed for being also stored with device operation.CPU 301 or GPU
313, ROM 302 and RAM 303 is connected with each other by communication bus 304.In the case where there is 303 RAM, ROM 302 is can
Modeling block.RAM 303 stores executable instruction, or executable instruction is written into ROM 302 at runtime, and executable instruction makes
Processor executes the corresponding operation of above-mentioned communication means.Input/output (I/O) interface 305 is also connected to communication bus 304.It is logical
Letter component 312 can integrate setting, may be set to be with multiple submodule (such as multiple IB network interface cards), and in communication bus
It chains.
I/O interface 305 is connected to lower component: the importation 306 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 307 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 308 including hard disk etc.;
And the communication interface 309 of the network interface card including LAN card, modem etc..Driver 310 also connects as needed
It is connected to I/O interface 305.Detachable media 311, such as disk, CD, magneto-optic disk, semiconductor memory etc. are pacified as needed
On driver 310, in order to be mounted into storage section 308 as needed from the computer program read thereon.
It should be noted that framework as shown in Figure 3 is only that a kind of optional implementation can during concrete practice
The component count amount and type of above-mentioned Fig. 3 are selected, are deleted, increased or replaced according to actual needs;In different function component
In setting, separately positioned or integrally disposed and other implementations, such as the separable setting of GPU and CPU or can be by GPU can also be used
It is integrated on CPU, the separable setting of communication device, can also be integrally disposed on CPU or GPU, etc..These are alternatively implemented
Mode each falls within the protection scope of the disclosure.
Particularly, according to embodiments of the present invention, it is soft to may be implemented as computer for the process above with reference to flow chart description
Part program.For example, the embodiment of the present invention includes a kind of computer program products comprising be tangibly embodied in machine readable media
On computer program, computer program includes the program code for method shown in execution flow chart, and program code can wrap
The corresponding instruction of corresponding execution method and step provided in an embodiment of the present invention is included, for example, for according to Time-Frequency Analysis Method, respectively
Obtain primary speech signal the first transformation frequency spectrum data and the assistant voice signal based on the primary speech signal the
The executable code of two transformation frequency spectrum datas;For based on the first transformation frequency spectrum data and the second transformation spectrum number
According to the transformation frequency spectrum for obtaining the primary speech signal corrects the executable code of data;For based on the first transformation frequency
Modal data and the transformation frequency spectrum correct data, determine the executable generation of the group delay characteristic of the primary speech signal
Code.
In such embodiments, which can be downloaded and installed from network by communication device, and/
Or it is mounted from detachable media 311.When the computer program is executed by central processing unit (CPU) 301, the present invention is executed
The above-mentioned function of being limited in the method for embodiment.
The electronic equipment of the embodiment of the present invention can be used to implement corresponding audio signal processing method in above-described embodiment,
Each device in the electronic equipment can be used for executing each step in above method embodiment, for example, being outlined above
Audio signal processing method the dependent instruction of memory storage can be called to realize by the processor of electronic equipment, in order to
Succinctly, details are not described herein.
It may be noted that all parts/step described in this application can be split as more multi-section according to the needs of implementation
The part operation of two or more components/steps or components/steps can also be combined into new components/steps by part/step,
To realize the purpose of the embodiment of the present invention.
Disclosed method and device, electronic equipment and storage medium may be achieved in many ways.For example, can pass through
Software, hardware, firmware or software, hardware, firmware any combination realize method and apparatus, the electronics of the embodiment of the present invention
Equipment and storage medium.The said sequence of the step of for method merely to be illustrated, the method for the embodiment of the present invention
Step is not limited to sequence described in detail above, unless specifically stated otherwise.In addition, in some embodiments, may be used also
The disclosure is embodied as to record program in the recording medium, these programs include for realizing side according to an embodiment of the present invention
The machine readable instructions of method.Thus, the disclosure, which also covers to store, is used to execute program according to the method for the embodiment of the present invention
Recording medium.
The description of the embodiment of the present invention is given for the purpose of illustration and description, and is not exhaustively or to incite somebody to action
The disclosure is limited to disclosed form, and many modifications and variations are obvious for the ordinary skill in the art.Choosing
Selecting and describe embodiment is the principle and practical application in order to more preferably illustrate the disclosure, and makes those skilled in the art
It will be appreciated that the disclosure is to design various embodiments suitable for specific applications with various modifications.
Claims (8)
1. a kind of audio signal processing method, comprising:
According to Time-Frequency Analysis Method, the first transformation frequency spectrum data of primary speech signal is obtained respectively and is based on the original language
Second transformation frequency spectrum data of the assistant voice signal of sound signal;
Based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, the transformation of the primary speech signal is obtained
Frequency spectrum corrects data;
Data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, determine the group delay of the primary speech signal
Slow characteristic.
2. it is described according to Time-Frequency Analysis Method according to the method described in claim 1, wherein, primary speech signal is obtained respectively
First transformation frequency spectrum data and the assistant voice signal based on the primary speech signal second transformation frequency spectrum data, packet
It includes:
By constant Q transform CQT method, the first transformation frequency spectrum data of the primary speech signal is obtained;
The assistant voice signal of the primary speech signal is obtained, and by constant Q transform CQT method, obtains the auxiliary
Second transformation frequency spectrum data of voice signal.
3. according to the method described in claim 2, wherein, the first transformation frequency spectrum data and described second that is based on converts
Frequency spectrum data obtains the transformation frequency spectrum amendment data of the primary speech signal, comprising:
To first transformation frequency spectrum data X (f, t) and corresponding second transformation frequency spectrum data Y (f, t), pass through following public affairs
Formula calculates transformation frequency spectrum amendment data Y ' (f, t):
Y ' (f, t)=Y (f, t)-t × T × X (f, t)
Wherein, f is frequency index, and t is time index, T be adjacent two frame respectively at the beginning of between interval.
4. described based on the first transformation frequency spectrum data and the transformation frequency spectrum according to the method described in claim 3, wherein
Data are corrected, determine the group delay characteristic of the primary speech signal, comprising:
It is calculated by the following formula the group delay characteristic frequency spectrum data τ of the primary speech signalx(f, t):
Wherein, the first transformation frequency spectrum data that X (f, t) is primary speech signal x (n), Y ' (f, t) are auxiliary voice signal nx
(n) the transformation frequency spectrum that the second transformation frequency spectrum data of signal obtains after amendment corrects data, XR(f, t) and XI(f, t) is X
The real part and imaginary part of (f, t), Y 'R(f, t) and Y 'I(f, t) be Y ' (f, t) real part and imaginary part, S (f,
It t) is the smooth first transformation frequency spectrum data of cepstrum obtained after cepstrum smoothing processing by X (f, t), α and γ are to extract feature
When the hyper parameter to be adjusted.
5. method according to any one of claims 1 to 4, wherein the method also includes:
According to the group delay characteristic, by model to the group delay of the group delay characteristic of normal voice and playback voice
Slow characteristic is modeled, and playback attack detecting is carried out.
6. a kind of computer readable storage medium, is stored thereon with computer program instructions, wherein described program instruction is processed
The step of any one of the Claims 1 to 5 audio signal processing method is realized when device executes.
7. a kind of electronic equipment, comprising: processor, memory, communication device and communication bus, the processor, the storage
Device and the communication device complete mutual communication by the communication bus;
The memory executes the processor as right is wanted for storing an at least executable instruction, the executable instruction
Ask the corresponding operation of 1~5 any one audio signal processing method.
8. a kind of computer program comprising there is computer program instructions, wherein described program instruction is real when being executed by processor
The step of any one of existing Claims 1 to 5 audio signal processing method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910674339.2A CN110415722B (en) | 2019-07-25 | 2019-07-25 | Speech signal processing method, storage medium, computer program, and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910674339.2A CN110415722B (en) | 2019-07-25 | 2019-07-25 | Speech signal processing method, storage medium, computer program, and electronic device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110415722A true CN110415722A (en) | 2019-11-05 |
CN110415722B CN110415722B (en) | 2021-10-08 |
Family
ID=68362974
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910674339.2A Active CN110415722B (en) | 2019-07-25 | 2019-07-25 | Speech signal processing method, storage medium, computer program, and electronic device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110415722B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111402856A (en) * | 2020-03-23 | 2020-07-10 | 北京字节跳动网络技术有限公司 | Voice processing method and device, readable medium and electronic equipment |
WO2022052965A1 (en) * | 2020-09-10 | 2022-03-17 | 达闼机器人有限公司 | Voice replay attack detection method, apparatus, medium, device and program product |
CN114639387A (en) * | 2022-03-07 | 2022-06-17 | 哈尔滨理工大学 | Voiceprint fraud detection method based on reconstructed group delay-constant Q transform spectrogram |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250857A (en) * | 2016-08-04 | 2016-12-21 | 深圳先进技术研究院 | A kind of identity recognition device and method |
CN107924686A (en) * | 2015-09-16 | 2018-04-17 | 株式会社东芝 | Voice processing apparatus, method of speech processing and voice processing program |
CN109243487A (en) * | 2018-11-30 | 2019-01-18 | 宁波大学 | A kind of voice playback detection method normalizing normal Q cepstrum feature |
CN109389992A (en) * | 2018-10-18 | 2019-02-26 | 天津大学 | A kind of speech-emotion recognition method based on amplitude and phase information |
-
2019
- 2019-07-25 CN CN201910674339.2A patent/CN110415722B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107924686A (en) * | 2015-09-16 | 2018-04-17 | 株式会社东芝 | Voice processing apparatus, method of speech processing and voice processing program |
CN106250857A (en) * | 2016-08-04 | 2016-12-21 | 深圳先进技术研究院 | A kind of identity recognition device and method |
CN109389992A (en) * | 2018-10-18 | 2019-02-26 | 天津大学 | A kind of speech-emotion recognition method based on amplitude and phase information |
CN109243487A (en) * | 2018-11-30 | 2019-01-18 | 宁波大学 | A kind of voice playback detection method normalizing normal Q cepstrum feature |
Non-Patent Citations (4)
Title |
---|
H.A.PATIL: ""A survey on replay attack detection for automatic speaker verification system"", 《APSIPA》 * |
XIAOHAI TIAN: ""Detecting synthetic speech using long term magntitude and phase information"", 《CHINASIP》 * |
朱春雷: ""优化自适应非平行训练语音转换算法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
蔡超: ""自动语种识别的研究与应用"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111402856A (en) * | 2020-03-23 | 2020-07-10 | 北京字节跳动网络技术有限公司 | Voice processing method and device, readable medium and electronic equipment |
CN111402856B (en) * | 2020-03-23 | 2023-04-14 | 北京字节跳动网络技术有限公司 | Voice processing method and device, readable medium and electronic equipment |
WO2022052965A1 (en) * | 2020-09-10 | 2022-03-17 | 达闼机器人有限公司 | Voice replay attack detection method, apparatus, medium, device and program product |
CN114639387A (en) * | 2022-03-07 | 2022-06-17 | 哈尔滨理工大学 | Voiceprint fraud detection method based on reconstructed group delay-constant Q transform spectrogram |
Also Published As
Publication number | Publication date |
---|---|
CN110415722B (en) | 2021-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106486131B (en) | A kind of method and device of speech de-noising | |
US20210193149A1 (en) | Method, apparatus and device for voiceprint recognition, and medium | |
WO2020181824A1 (en) | Voiceprint recognition method, apparatus and device, and computer-readable storage medium | |
CN103999076B (en) | System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain | |
CN110415722A (en) | Audio signal processing method, storage medium, computer program and electronic equipment | |
WO2019210796A1 (en) | Speech recognition method and apparatus, storage medium, and electronic device | |
CN110503971A (en) | Time-frequency mask neural network based estimation and Wave beam forming for speech processes | |
AU2017404565A1 (en) | Electronic device, method and system of identity verification and computer readable storage medium | |
CN109036436A (en) | A kind of voice print database method for building up, method for recognizing sound-groove, apparatus and system | |
JP2019510248A (en) | Voiceprint identification method, apparatus and background server | |
CN113436643B (en) | Training and application method, device and equipment of voice enhancement model and storage medium | |
CN108694954A (en) | A kind of Sex, Age recognition methods, device, equipment and readable storage medium storing program for executing | |
CN107833581A (en) | A kind of method, apparatus and readable storage medium storing program for executing of the fundamental frequency for extracting sound | |
US20150046156A1 (en) | System and Method for Anomaly Detection and Extraction | |
CN108922515A (en) | Speech model training method, audio recognition method, device, equipment and medium | |
EP3404584A1 (en) | Multi-view vector processing method and multi-view vector processing device | |
CN113314147B (en) | Training method and device of audio processing model, audio processing method and device | |
CN113921022B (en) | Audio signal separation method, device, storage medium and electronic equipment | |
CN110109058A (en) | A kind of planar array deconvolution identification of sound source method | |
US11393443B2 (en) | Apparatuses and methods for creating noise environment noisy data and eliminating noise | |
CN111402922B (en) | Audio signal classification method, device, equipment and storage medium based on small samples | |
CN109584888A (en) | Whistle recognition methods based on machine learning | |
Wu et al. | Audio watermarking algorithm with a synchronization mechanism based on spectrum distribution | |
Tian et al. | Spoofing detection under noisy conditions: a preliminary investigation and an initial database | |
CN105989837A (en) | Audio matching method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Zheng Fang Inventor after: Xu Mingxing Inventor after: Jin Panshi Inventor after: Cheng Xingliang Inventor after: Yang Jie Inventor before: Zheng Fang Inventor before: Xu Mingxing Inventor before: Cheng Xingliang |