CN110415722A - Audio signal processing method, storage medium, computer program and electronic equipment - Google Patents

Audio signal processing method, storage medium, computer program and electronic equipment Download PDF

Info

Publication number
CN110415722A
CN110415722A CN201910674339.2A CN201910674339A CN110415722A CN 110415722 A CN110415722 A CN 110415722A CN 201910674339 A CN201910674339 A CN 201910674339A CN 110415722 A CN110415722 A CN 110415722A
Authority
CN
China
Prior art keywords
frequency spectrum
transformation frequency
spectrum data
data
speech signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910674339.2A
Other languages
Chinese (zh)
Other versions
CN110415722B (en
Inventor
郑方
徐明星
程星亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING D-EAR TECHNOLOGIES Co Ltd
Original Assignee
BEIJING D-EAR TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING D-EAR TECHNOLOGIES Co Ltd filed Critical BEIJING D-EAR TECHNOLOGIES Co Ltd
Priority to CN201910674339.2A priority Critical patent/CN110415722B/en
Publication of CN110415722A publication Critical patent/CN110415722A/en
Application granted granted Critical
Publication of CN110415722B publication Critical patent/CN110415722B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Complex Calculations (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the present invention provides a kind of audio signal processing method, computer readable storage medium, computer program and electronic equipment.Audio signal processing method includes: the second transformation frequency spectrum data of the first transformation frequency spectrum data and the assistant voice signal based on the primary speech signal that obtain primary speech signal respectively according to Time-Frequency Analysis Method;Based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, the transformation frequency spectrum amendment data of the primary speech signal are obtained;Data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, determine the group delay characteristic of the primary speech signal.Primary speech signal can be directed to, Time Series Analysis Method is organically combined with group delay (MGD) method of improvement, thus the phonetic feature got had both remained the dynamic resolution of Time-Series analysis and the characteristic of log-frequency axis, amplitude information and phase information are merged again, therefore there is better detection performance.

Description

Audio signal processing method, storage medium, computer program and electronic equipment
Technical field
The present embodiments relate to voice processing technology more particularly to a kind of audio signal processing methods, computer-readable Storage medium, computer program and electronic equipment.
Background technique
Speaker verification's technology refers to sound when speaking by people to verify the technology of its identity.This technology quilt It is widely used in various fields.However, speaker verification's technology is easy by malicious attack.Currently, main attack technology is divided into Following four: impersonation attack, sound conversion attack, speech synthesis attack and playback attack.Wherein, impersonation attack, which refers to, attacks The person of hitting goes to imitate the sound of target speaker, to attempt to enter system.Sound conversion attack is then that attacker calculates by computer The sound of oneself is converted into sound similar with target speaker, and then attacked by the strength of method.Speech synthesis attack Then directly system is attacked by the sound of computer synthesis target speaker.Playback attack, then refer to attacker's thing It first recorded target speaker's one's voice in speech, then by reproducing device (such as loudspeaker), recording reset, and then right System causes to attack.
In these four attacks, playback attack implements very simple, does not need professional knowledge.Meanwhile it is existing Lot of documents proves that playback attack is larger to the safety effects of speaker verification, is one and is badly in need of the problem of being solved.
In order to accurately and efficiently from the attack of the Speech signal detection playback of acquisition, need to voice signal It is analyzed, extracts the characteristic for obtaining being best able to reflect playback.
Summary of the invention
The purpose of the embodiment of the present invention is, provides a kind of Speech processing scheme, extracts frequency spectrum from voice signal Amplitude Characteristics information and phase property information, accurately to carry out the relevant detection of voice.
According to a first aspect of the embodiments of the present invention, a kind of audio signal processing method is provided, comprising: according to time frequency analysis Method, the first transformation frequency spectrum data and the assistant voice based on the primary speech signal for obtaining primary speech signal respectively Second transformation frequency spectrum data of signal;Based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, institute is obtained State the transformation frequency spectrum amendment data of primary speech signal;Number is corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum According to determining the group delay characteristic of the primary speech signal.
Optionally, described according to Time-Frequency Analysis Method, obtain respectively the first transformation frequency spectrum data of primary speech signal with And the second transformation frequency spectrum data of the assistant voice signal based on the primary speech signal, comprising: pass through constant Q transform CQT Method obtains the first transformation frequency spectrum data of the primary speech signal;Obtain the auxiliary language message of the primary speech signal Number, and by constant Q transform CQT method, obtain the second transformation frequency spectrum data of the assistant voice signal.
Optionally, described based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, obtain the original The transformation frequency spectrum of beginning voice signal corrects data, comprising:
Frequency spectrum data Y (f, t) is converted to first transformation frequency spectrum data X (f, t) and corresponding described second, by with Lower formula calculates transformation frequency spectrum amendment data Y ' (f, t):
Y ' (f, t)=Y (f, t)-t × T × X (f, t)
Wherein, f is frequency index, and t is time index, T be adjacent two frame respectively at the beginning of between interval.
Optionally, described that data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, determine the original The group delay characteristic of beginning voice signal, comprising:
It is calculated by the following formula the group delay characteristic frequency spectrum data τ of the primary speech signalx(f, t):
Wherein, the first transformation frequency spectrum data that X (f, t) is primary speech signal x (n), Y ' (f, t) are auxiliary voice signal The transformation frequency spectrum that second transformation frequency spectrum data of nx (n) signal obtains after amendment corrects data, XR(f, t) and XI(f, t) is The real part and imaginary part of X (f, t), Y 'R(f, t) and Y 'I(f, t) is the real part and imaginary part of Y ' (f, t), S (f, t) is the smooth first transformation frequency spectrum data of cepstrum obtained after cepstrum smoothing processing by X (f, t), and α and γ are to extract spy The hyper parameter to be adjusted when sign.
Optionally, the method also includes: according to the group delay characteristic, by model to the group delay of normal voice Slow characteristic and the group delay characteristic for resetting voice are modeled, and playback attack detecting is carried out.
According to a second aspect of the embodiments of the present invention, a kind of computer readable storage medium is provided, calculating is stored thereon with Machine program instruction, wherein the step of described program instruction realizes any aforementioned voice signal processing method when being executed by processor.
According to a third aspect of the embodiments of the present invention, a kind of electronic equipment is provided, comprising: processor, memory, communication member Part and communication bus, the processor, the memory and the communication device are completed each other by the communication bus Communication;The memory is for storing an at least executable instruction, before the executable instruction keeps the processor execution any The corresponding operation of predicate signal processing method.
According to a fourth aspect of the embodiments of the present invention, a kind of computer program is provided comprising there are computer program instructions, Wherein, the step of realizing any aforementioned voice signal processing method when described program instruction is executed by processor.
By the processing of aforementioned voice signal processing method, primary speech signal can be directed to, by Time Series Analysis Method with Group delay (MGD) method of improvement is organically combined, and the MGD spectrum based on Time-Series analysis is extracted, to detect for playback. The phonetic feature got as a result, had not only remained the dynamic resolution of Time-Series analysis and the characteristic of log-frequency axis, but also by amplitude Information and phase information are merged, therefore have better detection performance.
Further, since when the frequency spectrum of the voice signal obtained by the processing of aforementioned voice signal processing method both remains The dynamic resolution of sequence analysis and the characteristic of log-frequency axis, and amplitude information and phase information are merged, therefore, use The group delay characteristic come to normal voice and reset voice model, can more accurately carry out playback attack inspection It surveys.
Detailed description of the invention
Fig. 1 is the flow chart for showing audio signal processing method according to some embodiments of the invention;
Fig. 2 is shown carries out assistant voice signal processing and traditional assistant voice signal processing according to embodiments of the present invention Voice signal waveform diagram;
Fig. 3 is the structural schematic diagram shown according to the electronic equipment with regard to the embodiment of the present invention.
Specific embodiment
The exemplary embodiment of the present invention is described in detail embodiment with reference to the accompanying drawing.
In this application, " multiple " refer to two or more, "at least one" refer to one, two or more.It is right Either component, data or the structure referred in the application, it is clearly limit one in the case where, it will be appreciated that for one or It is multiple.
Fig. 1 is the flow chart for showing audio signal processing method according to some embodiments of the invention.
Referring to Fig.1, in step S110, according to Time-Frequency Analysis Method, the first transformation frequency of primary speech signal is obtained respectively Second transformation frequency spectrum data of modal data and the assistant voice signal based on the primary speech signal.
Here, which is the voice signal of the certain time length for example acquired by voice capture device.In The step, can be used for example including including Short Time Fourier Transform (STFT) Fourier transformation method, small wave converting method, often The Time-Frequency Analysis Methods such as number Q transformation (CQT) method obtain the first transformation spectrum number of primary speech signal from primary speech signal According to.
Wherein, constant Q transform (Constant Q Transform, CQT) is a kind of important time frequency analyzing tool, in sound It is widely used in happy analysis.Recent studies suggest that CQT also has good effect in playback detection field.CQT with it is short When Fourier transformation (Short Time Fourier Transform, STFT) it is closely similar, difference is, what CQT was generated The frequency axis of frequency spectrum is logarithmic scale, and the frequency axis that STFT is generated is linear-scale.In addition, CQT is extracting different frequency When ingredient, the length of analysis window can also change, and specifically, low frequency part window is long, high frequency section window length.Therefore, It has better frequency resolution in low frequency part compared with STFT, has better temporal resolution in high frequency section.CQT is relatively based on Fu In for the frequency spectrum (such as STFT, FFT) that generates of leaf transformation, be more suitable playback Detection task.
On the other hand, simple CQT frequency spectrum, usually can only utilize the amplitude information of frequency spectrum, and can not be believed using phase Breath, thus the spectrum signature for being unable to fully land productivity voice signal carries out comprehensively and accurately speech feature extraction.For this purpose, considering The extraction of phase property is carried out using the group delay method for the phase property for being able to reflect voice signal.
Group delay (Group Delay function, GD) is the key concept in signal processing, indicates that signal passes through The time delay of the amplitude envelope of each component sine waves when measured device is defined as the negative derivative of phase versus frequency, as follows:
Wherein, ω indicates frequency, and θ (ω) indicates the phase at ω frequency.It can directly be extracted according to frequency spectrum, such as Shown in lower:
Wherein, X (ω) is the frequency spectrum of signal x (n), and Y (ω) is the frequency spectrum for assisting voice signal nx (n), XR(ω) and XI (ω) is the real and imaginary parts of X (ω) respectively.The problem of group delay can be to avoid phase winding, therefore be a kind of pair of signal phase Popular representation.
Group delay (Modified Group Delay function, MGD) is improved then to carry out traditional group delay method It improves, spectrum energy is carried out smoothly, to avoid infinity problem caused by the low energy point of part, and increase hyper parameter, Its dynamic range is controlled, model modeling is conducive to.It is as follows that MGD extracts formula:
Wherein, S (ω) is that X (ω) is obtained after cepstrum is smooth, and sign. isSymbol Number, two hyper parameters that α and γ are MGD need to be adjusted according to different tasks.
MGD frequency spectrum is identical in the resolution ratio of each frequency band using FFT as transformation of criterion.Research phenomenon shows For CQT is compared with FFT, it is more suitable playback Detection task.But simple CQT frequency spectrum, it can only generally be believed using amplitude Breath, cannot utilize phase information.On the other hand, MGD is a kind of technology that can merge phase information and assignment information simultaneously.
Therefore, according to the preferred embodiment of the present invention, in the processing of voice signal, by CQT method and MGD method phase In conjunction with MGD spectrum of the extraction based on CQT, to be used for playback detection.
Correspondingly, optional embodiment according to the present invention is obtained in step S110 by constant Q transform (CQT) method The first of the primary speech signal is taken to convert frequency spectrum data;The assistant voice signal of the primary speech signal is obtained again, and And by CQT method, the second transformation frequency spectrum data of the assistant voice signal is obtained.
Here, the CQT frequency spectrum data of primary speech signal can be calculated, referred herein to by any applicable CQT method For the first transformation frequency spectrum data.In addition, it is similar with the processing in group delay algorithm, it is first generated here based on primary speech signal Assistant voice signal, to be modified for the CQT frequency spectrum to primary speech signal.For example, being directed to primary speech signal, generate Its assistant voice signal.Then, CQT processing is carried out for the assistant voice signal, obtains the CQT spectrum number of assistant voice signal According to, hereon referred to as the second transformation frequency spectrum data.The the first transformation frequency spectrum data for being subsequently used for group delay feature calculation is obtained as a result, With the second transformation frequency spectrum data.
The original is obtained based on the first transformation frequency spectrum data and the second transformation frequency spectrum data in step S120 The transformation frequency spectrum of beginning voice signal corrects data.
In the voice signal voice for having carried out framing, the i-th frame signal x(i)(n) relationship between complete signal x (n) It can be expressed as follows:
x(i)(n)=x (n+i*T)
Wherein, T be frame move length, i.e., adjacent two frame respectively at the beginning of between interval.
In traditional MGD method, framing first is carried out to primary speech signal, then aforementioned auxiliary language is calculated to each framing Sound signal:
y(i)(n)=nx(i)(n)
Then, then to the assistant voice signal y of each framing(i)(n) frequency spectrum Y is calculated.
In contrast to this, in the audio signal processing method that the embodiment of the present invention proposes, first whole raw tone is believed Number x (n) acquires its assistant voice signal nx (n), then carries out framing to assistant voice signal nx (n), at this point,
y(i)(n)=(n+i*T) x (n+i*T)=(n+i*T) x(i)(n)
Then, then the CQT frequency spectrum of assistant voice signal nx (n) is calculated.
As can be seen that the assistant voice signal y obtained by two kinds of sub-frame processing modes(i)(n) numerical value is in first frame It is identical at (i=0 at this time), can be different for the numerical value of both subsequent frames, as shown in Figure 2.The upper end Fig. 2 shows raw tone letter Number waveform, and the waveform diagram of its lower end shows the waveform 210 for each framing that processing mode according to the present invention obtains With the comparison of the waveform 220 of each framing obtained according to the processing mode of traditional MGD.
Since the processing mode of the audio signal processing method of the embodiment of the present invention is different, it is therefore desirable to carry out frequency spectrum and repair Just.Specifically, it can be calculated based on the first transformation frequency spectrum data obtained by primary speech signal and by assistant voice signal Second transformation frequency spectrum data be modified processing, thus obtain primary speech signal transformation frequency spectrum correct data.
For example, to each first transformation frequency spectrum data X (f, t) and corresponding second transformation frequency spectrum data Y (f, t), Transformation frequency spectrum amendment data Y ' (f, t) can be calculated by the following formula:
Y ' (f, t)=Y (f, t)-t × T × X (f, t)
Wherein, f is frequency index, and t is time index, and T is that frame moves length, i.e., adjacent two frame respectively at the beginning of between Interval.
Its difference between Y (f, t) of Y ' (f, t) is, when calculating Y (f, t), calculates assistant voice based on overall signal Signal nx (n) then carries out framing to assistant voice signal nx (n), then calculates CQT frequency spectrum;And Y ' (f, t), numerically its with Framing is first carried out, nx (n) then is done for the local signal in each frame, then to calculate the result that frequency spectrum obtains identical.
In step S130, data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, determine the original The group delay characteristic of beginning voice signal.
For example, tradition can be replaced by the first transformation frequency spectrum data and transformation frequency spectrum amendment data converted based on CQT FFT spectrum in group delay (MGD) method of improvement, to calculate the group delay characteristic of primary speech signal.
Specifically, the group delay characteristic of primary speech signal is calculated by following formula:
Wherein, the first transformation frequency spectrum data that X (f, t) is primary speech signal x (n), Y ' (f, t) are auxiliary voice signal The transformation frequency spectrum that second transformation frequency spectrum data of nx (n) signal obtains after amendment corrects data, XR(f, t) and XI(f, t) is The real part and imaginary part of X (f, t), Y 'R(f, t) and Y 'I(f, t) is the real part and imaginary part of Y ' (f, t), S (f, t) is the smooth first transformation frequency spectrum data of cepstrum that X (f, t) is obtained after cepstrum smoothing processing, and α and γ are to extract feature When the hyper parameter to be adjusted.
By aforementioned processing, for primary speech signal, Time Series Analysis Method is organically combined with MGD method, The MGD spectrum based on Time-Series analysis is extracted, to detect for playback.The phonetic feature got as a result, both remains timing The dynamic resolution of analysis and the characteristic of log-frequency axis, and amplitude information and phase information are merged, therefore has more Good detection performance.
Optional embodiment according to the present invention, the audio signal processing method further include: according to the group delay feature Data are modeled by group delay characteristic of the model to the group delay characteristic of normal voice and playback voice, into Row playback attack detecting.
Since the frequency spectrum of the voice signal of the processing acquisition by abovementioned steps S110~S150 both remains Time-Series analysis Dynamic resolution and log-frequency axis characteristic, and amplitude information and phase information are merged, therefore, use the group delay Slow characteristic to normal voice and is reset voice and is modeled, and playback attack detecting can be more accurately carried out.
The embodiment of the present invention also provides a kind of calculating for being stored with the step of executing aforementioned any audio signal processing method Machine readable storage medium storing program for executing.
In addition, the embodiment of the present invention also provides a kind of computer program product including at least one executable instruction, institute It states when executable instruction is executed by processor for realizing aforementioned any audio signal processing method.
The embodiment of the invention also provides a kind of electronic equipment.Fig. 3 is to show to be set according to the electronics with regard to the embodiment of the present invention Standby 300 structural schematic diagram.The electronic equipment 300 can be such as mobile terminal, personal computer (PC), tablet computer, clothes Business device etc..Below with reference to Fig. 3, it illustrates the structural schematic diagrams for the electronic equipment 300 for being suitable for being used to realize the embodiment of the present invention: As shown in figure 3, electronic equipment 300 may include memory and processor.Specifically, electronic equipment 300 includes one or more Processor, communication device etc., one or more of processors for example: one or more central processing unit (CPU) 301, and/ Or one or more image processors (GPU) 313 etc., processor can according to be stored in read-only memory (ROM) 302 can It executes instruction or is executed various from the executable instruction that storage section 308 is loaded into random access storage device (RAM) 303 Movement and processing appropriate.Communication device includes communication component 312 and/or communication interface 309.Wherein, communication component 312 can wrap Network interface card is included but is not limited to, the network interface card may include but be not limited to IB (Infiniband) network interface card, and communication interface 309 includes such as The communication interface of the network interface card of LAN card, modem etc., communication interface 309 are executed via the network of such as internet Communication process.
Processor can with communicate in read-only memory 302 and/or random access storage device 303 to execute executable instruction, It is connected by communication bus 304 with communication component 312 and is communicated through communication component 312 with other target devices, to completes this Anti-lost detection method corresponding operation of any one based on broadcast that inventive embodiments provide, for example, according to Time-Frequency Analysis Method, The the first transformation frequency spectrum data and the assistant voice signal based on the primary speech signal for obtaining primary speech signal respectively Second transformation frequency spectrum data;Based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, the original is obtained The transformation frequency spectrum of beginning voice signal corrects data;Data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, Determine the group delay characteristic of the primary speech signal.
In addition, in RAM 303, various programs and data needed for being also stored with device operation.CPU 301 or GPU 313, ROM 302 and RAM 303 is connected with each other by communication bus 304.In the case where there is 303 RAM, ROM 302 is can Modeling block.RAM 303 stores executable instruction, or executable instruction is written into ROM 302 at runtime, and executable instruction makes Processor executes the corresponding operation of above-mentioned communication means.Input/output (I/O) interface 305 is also connected to communication bus 304.It is logical Letter component 312 can integrate setting, may be set to be with multiple submodule (such as multiple IB network interface cards), and in communication bus It chains.
I/O interface 305 is connected to lower component: the importation 306 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 307 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 308 including hard disk etc.; And the communication interface 309 of the network interface card including LAN card, modem etc..Driver 310 also connects as needed It is connected to I/O interface 305.Detachable media 311, such as disk, CD, magneto-optic disk, semiconductor memory etc. are pacified as needed On driver 310, in order to be mounted into storage section 308 as needed from the computer program read thereon.
It should be noted that framework as shown in Figure 3 is only that a kind of optional implementation can during concrete practice The component count amount and type of above-mentioned Fig. 3 are selected, are deleted, increased or replaced according to actual needs;In different function component In setting, separately positioned or integrally disposed and other implementations, such as the separable setting of GPU and CPU or can be by GPU can also be used It is integrated on CPU, the separable setting of communication device, can also be integrally disposed on CPU or GPU, etc..These are alternatively implemented Mode each falls within the protection scope of the disclosure.
Particularly, according to embodiments of the present invention, it is soft to may be implemented as computer for the process above with reference to flow chart description Part program.For example, the embodiment of the present invention includes a kind of computer program products comprising be tangibly embodied in machine readable media On computer program, computer program includes the program code for method shown in execution flow chart, and program code can wrap The corresponding instruction of corresponding execution method and step provided in an embodiment of the present invention is included, for example, for according to Time-Frequency Analysis Method, respectively Obtain primary speech signal the first transformation frequency spectrum data and the assistant voice signal based on the primary speech signal the The executable code of two transformation frequency spectrum datas;For based on the first transformation frequency spectrum data and the second transformation spectrum number According to the transformation frequency spectrum for obtaining the primary speech signal corrects the executable code of data;For based on the first transformation frequency Modal data and the transformation frequency spectrum correct data, determine the executable generation of the group delay characteristic of the primary speech signal Code.
In such embodiments, which can be downloaded and installed from network by communication device, and/ Or it is mounted from detachable media 311.When the computer program is executed by central processing unit (CPU) 301, the present invention is executed The above-mentioned function of being limited in the method for embodiment.
The electronic equipment of the embodiment of the present invention can be used to implement corresponding audio signal processing method in above-described embodiment, Each device in the electronic equipment can be used for executing each step in above method embodiment, for example, being outlined above Audio signal processing method the dependent instruction of memory storage can be called to realize by the processor of electronic equipment, in order to Succinctly, details are not described herein.
It may be noted that all parts/step described in this application can be split as more multi-section according to the needs of implementation The part operation of two or more components/steps or components/steps can also be combined into new components/steps by part/step, To realize the purpose of the embodiment of the present invention.
Disclosed method and device, electronic equipment and storage medium may be achieved in many ways.For example, can pass through Software, hardware, firmware or software, hardware, firmware any combination realize method and apparatus, the electronics of the embodiment of the present invention Equipment and storage medium.The said sequence of the step of for method merely to be illustrated, the method for the embodiment of the present invention Step is not limited to sequence described in detail above, unless specifically stated otherwise.In addition, in some embodiments, may be used also The disclosure is embodied as to record program in the recording medium, these programs include for realizing side according to an embodiment of the present invention The machine readable instructions of method.Thus, the disclosure, which also covers to store, is used to execute program according to the method for the embodiment of the present invention Recording medium.
The description of the embodiment of the present invention is given for the purpose of illustration and description, and is not exhaustively or to incite somebody to action The disclosure is limited to disclosed form, and many modifications and variations are obvious for the ordinary skill in the art.Choosing Selecting and describe embodiment is the principle and practical application in order to more preferably illustrate the disclosure, and makes those skilled in the art It will be appreciated that the disclosure is to design various embodiments suitable for specific applications with various modifications.

Claims (8)

1. a kind of audio signal processing method, comprising:
According to Time-Frequency Analysis Method, the first transformation frequency spectrum data of primary speech signal is obtained respectively and is based on the original language Second transformation frequency spectrum data of the assistant voice signal of sound signal;
Based on the first transformation frequency spectrum data and the second transformation frequency spectrum data, the transformation of the primary speech signal is obtained Frequency spectrum corrects data;
Data are corrected based on the first transformation frequency spectrum data and the transformation frequency spectrum, determine the group delay of the primary speech signal Slow characteristic.
2. it is described according to Time-Frequency Analysis Method according to the method described in claim 1, wherein, primary speech signal is obtained respectively First transformation frequency spectrum data and the assistant voice signal based on the primary speech signal second transformation frequency spectrum data, packet It includes:
By constant Q transform CQT method, the first transformation frequency spectrum data of the primary speech signal is obtained;
The assistant voice signal of the primary speech signal is obtained, and by constant Q transform CQT method, obtains the auxiliary Second transformation frequency spectrum data of voice signal.
3. according to the method described in claim 2, wherein, the first transformation frequency spectrum data and described second that is based on converts Frequency spectrum data obtains the transformation frequency spectrum amendment data of the primary speech signal, comprising:
To first transformation frequency spectrum data X (f, t) and corresponding second transformation frequency spectrum data Y (f, t), pass through following public affairs Formula calculates transformation frequency spectrum amendment data Y ' (f, t):
Y ' (f, t)=Y (f, t)-t × T × X (f, t)
Wherein, f is frequency index, and t is time index, T be adjacent two frame respectively at the beginning of between interval.
4. described based on the first transformation frequency spectrum data and the transformation frequency spectrum according to the method described in claim 3, wherein Data are corrected, determine the group delay characteristic of the primary speech signal, comprising:
It is calculated by the following formula the group delay characteristic frequency spectrum data τ of the primary speech signalx(f, t):
Wherein, the first transformation frequency spectrum data that X (f, t) is primary speech signal x (n), Y ' (f, t) are auxiliary voice signal nx (n) the transformation frequency spectrum that the second transformation frequency spectrum data of signal obtains after amendment corrects data, XR(f, t) and XI(f, t) is X The real part and imaginary part of (f, t), Y 'R(f, t) and Y 'I(f, t) be Y ' (f, t) real part and imaginary part, S (f, It t) is the smooth first transformation frequency spectrum data of cepstrum obtained after cepstrum smoothing processing by X (f, t), α and γ are to extract feature When the hyper parameter to be adjusted.
5. method according to any one of claims 1 to 4, wherein the method also includes:
According to the group delay characteristic, by model to the group delay of the group delay characteristic of normal voice and playback voice Slow characteristic is modeled, and playback attack detecting is carried out.
6. a kind of computer readable storage medium, is stored thereon with computer program instructions, wherein described program instruction is processed The step of any one of the Claims 1 to 5 audio signal processing method is realized when device executes.
7. a kind of electronic equipment, comprising: processor, memory, communication device and communication bus, the processor, the storage Device and the communication device complete mutual communication by the communication bus;
The memory executes the processor as right is wanted for storing an at least executable instruction, the executable instruction Ask the corresponding operation of 1~5 any one audio signal processing method.
8. a kind of computer program comprising there is computer program instructions, wherein described program instruction is real when being executed by processor The step of any one of existing Claims 1 to 5 audio signal processing method.
CN201910674339.2A 2019-07-25 2019-07-25 Speech signal processing method, storage medium, computer program, and electronic device Active CN110415722B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910674339.2A CN110415722B (en) 2019-07-25 2019-07-25 Speech signal processing method, storage medium, computer program, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910674339.2A CN110415722B (en) 2019-07-25 2019-07-25 Speech signal processing method, storage medium, computer program, and electronic device

Publications (2)

Publication Number Publication Date
CN110415722A true CN110415722A (en) 2019-11-05
CN110415722B CN110415722B (en) 2021-10-08

Family

ID=68362974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910674339.2A Active CN110415722B (en) 2019-07-25 2019-07-25 Speech signal processing method, storage medium, computer program, and electronic device

Country Status (1)

Country Link
CN (1) CN110415722B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402856A (en) * 2020-03-23 2020-07-10 北京字节跳动网络技术有限公司 Voice processing method and device, readable medium and electronic equipment
WO2022052965A1 (en) * 2020-09-10 2022-03-17 达闼机器人有限公司 Voice replay attack detection method, apparatus, medium, device and program product
CN114639387A (en) * 2022-03-07 2022-06-17 哈尔滨理工大学 Voiceprint fraud detection method based on reconstructed group delay-constant Q transform spectrogram

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250857A (en) * 2016-08-04 2016-12-21 深圳先进技术研究院 A kind of identity recognition device and method
CN107924686A (en) * 2015-09-16 2018-04-17 株式会社东芝 Voice processing apparatus, method of speech processing and voice processing program
CN109243487A (en) * 2018-11-30 2019-01-18 宁波大学 A kind of voice playback detection method normalizing normal Q cepstrum feature
CN109389992A (en) * 2018-10-18 2019-02-26 天津大学 A kind of speech-emotion recognition method based on amplitude and phase information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107924686A (en) * 2015-09-16 2018-04-17 株式会社东芝 Voice processing apparatus, method of speech processing and voice processing program
CN106250857A (en) * 2016-08-04 2016-12-21 深圳先进技术研究院 A kind of identity recognition device and method
CN109389992A (en) * 2018-10-18 2019-02-26 天津大学 A kind of speech-emotion recognition method based on amplitude and phase information
CN109243487A (en) * 2018-11-30 2019-01-18 宁波大学 A kind of voice playback detection method normalizing normal Q cepstrum feature

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
H.A.PATIL: ""A survey on replay attack detection for automatic speaker verification system"", 《APSIPA》 *
XIAOHAI TIAN: ""Detecting synthetic speech using long term magntitude and phase information"", 《CHINASIP》 *
朱春雷: ""优化自适应非平行训练语音转换算法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
蔡超: ""自动语种识别的研究与应用"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402856A (en) * 2020-03-23 2020-07-10 北京字节跳动网络技术有限公司 Voice processing method and device, readable medium and electronic equipment
CN111402856B (en) * 2020-03-23 2023-04-14 北京字节跳动网络技术有限公司 Voice processing method and device, readable medium and electronic equipment
WO2022052965A1 (en) * 2020-09-10 2022-03-17 达闼机器人有限公司 Voice replay attack detection method, apparatus, medium, device and program product
CN114639387A (en) * 2022-03-07 2022-06-17 哈尔滨理工大学 Voiceprint fraud detection method based on reconstructed group delay-constant Q transform spectrogram

Also Published As

Publication number Publication date
CN110415722B (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN106486131B (en) A kind of method and device of speech de-noising
US20210193149A1 (en) Method, apparatus and device for voiceprint recognition, and medium
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
CN103999076B (en) System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain
CN110415722A (en) Audio signal processing method, storage medium, computer program and electronic equipment
WO2019210796A1 (en) Speech recognition method and apparatus, storage medium, and electronic device
CN110503971A (en) Time-frequency mask neural network based estimation and Wave beam forming for speech processes
AU2017404565A1 (en) Electronic device, method and system of identity verification and computer readable storage medium
CN109036436A (en) A kind of voice print database method for building up, method for recognizing sound-groove, apparatus and system
JP2019510248A (en) Voiceprint identification method, apparatus and background server
CN113436643B (en) Training and application method, device and equipment of voice enhancement model and storage medium
CN108694954A (en) A kind of Sex, Age recognition methods, device, equipment and readable storage medium storing program for executing
CN107833581A (en) A kind of method, apparatus and readable storage medium storing program for executing of the fundamental frequency for extracting sound
US20150046156A1 (en) System and Method for Anomaly Detection and Extraction
CN108922515A (en) Speech model training method, audio recognition method, device, equipment and medium
EP3404584A1 (en) Multi-view vector processing method and multi-view vector processing device
CN113314147B (en) Training method and device of audio processing model, audio processing method and device
CN113921022B (en) Audio signal separation method, device, storage medium and electronic equipment
CN110109058A (en) A kind of planar array deconvolution identification of sound source method
US11393443B2 (en) Apparatuses and methods for creating noise environment noisy data and eliminating noise
CN111402922B (en) Audio signal classification method, device, equipment and storage medium based on small samples
CN109584888A (en) Whistle recognition methods based on machine learning
Wu et al. Audio watermarking algorithm with a synchronization mechanism based on spectrum distribution
Tian et al. Spoofing detection under noisy conditions: a preliminary investigation and an initial database
CN105989837A (en) Audio matching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zheng Fang

Inventor after: Xu Mingxing

Inventor after: Jin Panshi

Inventor after: Cheng Xingliang

Inventor after: Yang Jie

Inventor before: Zheng Fang

Inventor before: Xu Mingxing

Inventor before: Cheng Xingliang