CN110363148A - A kind of method of face vocal print feature fusion verifying - Google Patents

A kind of method of face vocal print feature fusion verifying Download PDF

Info

Publication number
CN110363148A
CN110363148A CN201910641594.7A CN201910641594A CN110363148A CN 110363148 A CN110363148 A CN 110363148A CN 201910641594 A CN201910641594 A CN 201910641594A CN 110363148 A CN110363148 A CN 110363148A
Authority
CN
China
Prior art keywords
frequency
image
face
vocal print
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910641594.7A
Other languages
Chinese (zh)
Inventor
胡增
江大白
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Applied Technology Co Ltd
Original Assignee
China Applied Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Applied Technology Co Ltd filed Critical China Applied Technology Co Ltd
Priority to CN201910641594.7A priority Critical patent/CN110363148A/en
Publication of CN110363148A publication Critical patent/CN110363148A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/60Rotation of whole images or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/10Image enhancement or restoration using non-spatial domain filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20052Discrete cosine transform [DCT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20056Discrete and fast Fourier transform, [DFT, FFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Collating Specific Patterns (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of method of face vocal print feature fusion verifying, comprising the following steps: the audio files of input is parsed into the time-domain signal of sound;The time-domain signal is changed into frequency-region signal by Short Time Fourier Transform and adding window framing;The linear relationship that can perceive frequency conversion at human ear is converted by log spectrum;By cepstrum, using dct transform by the frequency-region signal after conversion DC signal component and sinusoidal signal component separate;Sound spectrum feature vector is extracted, the vector is converted into image;Described image is merged with two-dimension human face image.The method of face vocal print feature proposed by the present invention fusion verifying, can achieve the effect for only doing one-time authentication, and does not have a kind of mode erroneous detection in application layer joint verification and will cause and entirely verify unacceptable problem, improve usage experience.

Description

A kind of method of face vocal print feature fusion verifying
Technical field
The present invention relates to living things feature recognition field, a kind of method for particularly relating to face vocal print feature fusion verifying.
Background technique
In recent years, as the continuous development of deep learning and computer vision technique is mature, in living things feature recognition Several identity identifying technologies based on computer vision technique are also developed rapidly, especially face recognition technology, are had Feature contactless, identification is fast, is widely used in the various business for needing to verify identity.Application on Voiceprint Recognition is as bio-identification One kind, be according to the acoustic wave character of speaker carry out identity identification service.Identity identification is unrelated with accent, with language without It closes, can be used for speaker's identification and speaker verification.
Although a variety of biological identification technologies are gradually applied in life production by people's research and development, usage mode is remained obviously Unicity.Single verification mode the case where always there are missing inspections, erroneous detection, current research and application hot spot be used in combination it is more Kind biological identification technology, reaches higher safety and accuracy rate, such as recognition of face adds Application on Voiceprint Recognition.But these methods are only It is only two identification technologies to be used in combination in application layer, for example recognition of face adds Application on Voiceprint Recognition to be exactly that recognition of face passes through Afterwards, it then verifies Application on Voiceprint Recognition and passes through.There is no, by face characteristic and vocal print Fusion Features, reach in low-level image feature level and only do one The effect of secondary verifying, and a kind of mode erroneous detection will cause entire verifying and not pass through, and reduce usage experience.
Summary of the invention
Aiming at the defects existing in the prior art, the technical problem to be solved in the present invention is to provide a kind of faces The method of vocal print feature fusion verifying.
Present invention technical solution used for the above purpose is: a kind of side of face vocal print feature fusion verifying Method, comprising the following steps:
The audio files of input is parsed into the time-domain signal of sound;
The time-domain signal is changed into frequency-region signal by Short Time Fourier Transform and adding window framing;
The linear relationship that can perceive the frequency conversion in the frequency-region signal at human ear is converted by log spectrum;
By cepstrum, using dct transform by the DC signal component and sinusoidal signal in the frequency-region signal after conversion Component separation;
Sound spectrum feature vector is extracted, the vector is converted into image;
Described image is merged with two-dimension human face image.
It is described that the time-domain signal is changed by frequency-region signal by Short Time Fourier Transform and adding window framing, specifically:
The window function h (t) for selecting a Time-Frequency Localization calculates each different moments by Short Time Fourier Transform Power spectrum, the formula of the Short Time Fourier Transform are as follows:
Wherein, f (τ) represents the time-domain signal of input audio, and τ indicates that integration variable, t indicate the different moments.
The window function is hamming window.
It is described that the linear relationship that can perceive frequency conversion at human ear is converted by log spectrum, specifically:
Logarithmic frequency scale is converted by frequency scaling by following formula, human ear is perceived linearly to the perceptibility of frequency and closes System:
Mel (f)=2595*log10(1+f/700)
Wherein, mel (f) indicates that log-frequency, f indicate the frequency obtained after Short Time Fourier Transform.
It is described by cepstrum, using dct transform by the DC signal component and sine in the frequency-region signal after conversion Signal component separation, specifically:
Wherein,
Wherein, mfcc (u) indicates that cepstrum, mel (i) indicate that log-frequency, N indicate the quantity of Frequency point, and u indicates scramble The Frequency point of spectrum.
The extraction sound spectrum feature vector, is converted into image for the vector, specifically:
By the range of output vector:
Mfcc ∈ [min, max]
Range of the linear transformation to image:
pixel∈[0,255]
The scramble spectrogram of sound is thus obtained, the horizontal axis of the scramble spectrogram of the sound is the time, and the longitudinal axis is frequency; Wherein, mfcc indicates that cepstrum, min indicate the minimum value of mfcc, and max indicates the maximum value of mfcc, and pixel expression is converted to figure Pixel as after.
It is described to merge described image with two-dimension human face image, specifically:
Scramble spectrogram is rotated clockwise 90 degree, if the horizontal axis length of spliced image and the scramble after being rotated by 90 ° The horizontal axis length of spectrogram is inconsistent, then scales two-dimension human face image, keeps the two horizontal axis length consistent, then the two is spliced.
The present invention has the following advantages and beneficial effects: the method for face vocal print feature fusion verifying proposed by the present invention, The two is fused together in low-level image feature level, can achieve the effect for only doing one-time authentication, and does not have application layer connection It closes a kind of mode erroneous detection in verifying and will cause and entirely verify unacceptable problem, improve usage experience.As long as (some effects are most Amount supplement)
Detailed description of the invention
Fig. 1 is the time-domain signal figure of sound;
Fig. 2 is the frequency-region signal of sound;
Fig. 3 is hamming window function figure of the invention;
Fig. 4 is the scramble spectrogram of sound of the invention;
Fig. 5 is VGG16 network structure of the invention;
Fig. 6 is flow chart of the method for the present invention.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and embodiments.
A kind of process of the method for face vocal print feature fusion verifying proposed by the present invention is as shown in fig. 6, by the sound of input Sound document analysis at sound time-domain signal;The time-domain signal is changed into frequency by Short Time Fourier Transform and adding window framing Domain signal;The linear relationship that can perceive the frequency conversion in the frequency-region signal at human ear is converted by log spectrum;Pass through Cepstrum, using dct transform by the frequency-region signal after conversion DC signal component and sinusoidal signal component separate;It extracts The vector is converted into image by sound spectrum feature vector;Described image is merged with two-dimension human face image.
Fig. 1 is the time-domain signal of sound, and Fig. 2 is the frequency-region signal of sound.Voice signal is one-dimensional time-domain signal, intuitively On be difficult to find out frequency changing rule.If it is changed on frequency domain by Fourier transformation, although it can be seen that the frequency of signal Rate distribution, but it is lost time-domain information, it can not find out that frequency distribution changes with time.In order to solve this problem, we Using Short Time Fourier Transform (STFT), to determine the frequency and phase of its regional area sine wave of time varying signal.Specific side Method is: selection one Time-Frequency Localization window function, it is assumed that analysis window function h (t) be in a short time interval smoothly, Making f (t) h (t) is stationary signal in different finite time width, so that the power spectrum of each different moments is calculated, Middle window function selects hamming window, and hamming window is a kind of Cosine Window, can be well reflected the decaying of a certain moment energy at any time Relationship.The formula of Short Time Fourier Transform are as follows:
Wherein, f (τ) represents the time-domain signal of input audio, and τ indicates that integration variable, t indicate the different moments.
For hamming window function, wherein n indicates that the discrete point of window function, N indicate the total quantity of discrete point, as shown in Figure 3.
The unit of frequency is hertz (Hz), and the frequency range that human ear can be heard is 20-20000Hz, but human ear is this to Hz Scale unit is not linear perception relationship, converts logarithmic frequency scale, mapping relations such as following formula for common frequency scaling It is shown:
Mel (f)=2595*log10(1+f/700)
Wherein, mel (f) indicates that log-frequency, f indicate the frequency obtained after Short Time Fourier Transform.By above-mentioned formula, Then human ear is to the perceptibility of frequency just at linear relationship.That is, under the scale, if the frequency phase-difference of two sections of voices Twice, then the tone that human ear can perceive probably also differs twice.
Based on logarithmic spectrum, using dct transform by the DC signal component and sinusoidal signal point in the frequency-region signal after conversion Amount separation, the final result are known as cepstrum:
Wherein,
Wherein, mfcc (u) indicates that cepstrum, mel (i) indicate that log-frequency, N indicate the quantity of Frequency point, and u indicates scramble The Frequency point of spectrum, 0 indicates DC component.
What it is due to cepstrum output is vector, can't be shown with picture, need to convert thereof into image array.Needing will The range of output vector:
Mfcc ∈ [min, max]
Range of the linear transformation to image:
pixel∈[0,255]
The scramble spectrogram of sound is thus obtained, as shown in Figure 4.Wherein horizontal axis is the time, and the longitudinal axis is frequency.Brighter Place is worth bigger (energy is bigger).Wherein, mfcc indicates that cepstrum, min indicate the minimum value of mfcc, and max indicates mfcc most Big value, pixel indicate to be converted to the pixel after image.
Generally the Frequency point of our extractions is fixed, i.e. the length of the longitudinal axis is fixed, and Fig. 4 is rotated clockwise 90 Degree splices after face picture.If the horizontal axis length of face picture and scramble spectrogram horizontal axis length after being rotated by 90 ° are not Unanimously, then face picture is scaled, so that the two horizontal axis length is consistent.The picture obtained after splicing is fused feature.Most Pass through convolutional neural networks training identification model afterwards.One typical CNN disaggregated model can be abbreviated as two steps:
Z=CNN (x)
P=softmax (zW)
Wherein x is input picture, and p is the probability output of every one kind.When as a classification problem training, what we exported It is x and corresponding one hot label p, but when in use, we do not have to entire model, we only use this portion CNN (x) Point, this part is responsible for converting picture to the vector of a regular length.There is this transformation model (encoder, encoder), Regardless of under scene, we can encode for new face sound fusion feature, be then converted to these coding vectors Between comparison, to just not depend on original disaggregated model.It so just completes and carries out bio-identification using fusion feature Algorithm.
The following are one embodiment of the present of invention:
We identify that particularly, we are using VGG16 net to fused feature using convolutional neural networks Network, network structure as shown in figure 5, be VGG16 network structure, totally 16 layers (not including pondization and softmax layers), all volumes Product core all uses the size of 3*3, and pondization all uses size for 2*2, and the maximum pond that step-length is 2, convolution layer depth is followed successively by 64- >128->256->512->512.VGG16 is one kind of neural network, and training data can also be different, including is taken out from which library It takes and quantity.
Training data uses 1000 face pictures (50 people, everyone 20 pictures) extracted from ImageNet, and 1000 mankind's sound clips (and 50 people, everyone 20 segments) extracted in AudioSet.At random by the 50 of this face Personal and sound 50 people match, and the face picture and sound clip recombinant after pairing are finally obtained 50 kinds at random There are 20 samples in face sound pair, every kind of the inside, VGG16 network are inputted after preceding feature merges, and construct one 50 classification Model and training obtain the feature coding device based on CNN.

Claims (7)

1. a kind of method of face vocal print feature fusion verifying, which comprises the following steps:
The audio files of input is parsed into the time-domain signal of sound;
The time-domain signal is changed into frequency-region signal by Short Time Fourier Transform and adding window framing;
The linear relationship that can perceive the frequency conversion in the frequency-region signal at human ear is converted by log spectrum;
By cepstrum, using dct transform by the DC signal component and sinusoidal signal component in the frequency-region signal after conversion Separation;
Sound spectrum feature vector is extracted, the vector is converted into image;
Described image is merged with two-dimension human face image.
2. a kind of method of face vocal print feature fusion verifying according to claim 1, which is characterized in that described by short When Fourier transformation and adding window framing the time-domain signal is changed into frequency-region signal, specifically:
The window function h (t) for selecting a Time-Frequency Localization calculates the power of each different moments by Short Time Fourier Transform Spectrum, the formula of the Short Time Fourier Transform are as follows:
Wherein, f (τ) represents the time-domain signal of input audio, and τ indicates that integration variable, t indicate the different moments.
3. a kind of method of face vocal print feature fusion verifying according to claim 2, which is characterized in that the window function For hamming window.
4. a kind of method of face vocal print feature fusion verifying according to claim 1, which is characterized in that it is described by pair The linear relationship that number Spectrum Conversion can perceive frequency conversion at human ear, specifically:
Logarithmic frequency scale is converted by frequency scaling by following formula, makes human ear to the linear perception relationship of the perceptibility of frequency:
Mel (f)=2595*log10(1+f/700)
Wherein, mel (f) indicates that log-frequency, f indicate the frequency obtained after Short Time Fourier Transform.
5. a kind of method of face vocal print feature fusion verifying according to claim 1, which is characterized in that described by falling Frequency analysis, using dct transform by the frequency-region signal after conversion DC signal component and sinusoidal signal component separate, specifically Are as follows:
Wherein,
Wherein, mfcc (u) indicates that cepstrum, mel (i) indicate that log-frequency, N indicate the quantity of Frequency point, and u indicates cepstrum Frequency point.
6. a kind of method of face vocal print feature fusion verifying according to claim 1, which is characterized in that the extraction sound The vector is converted into image by sound spectrum feature vector, specifically:
By the range of output vector:
Mfcc ∈ [min, max]
Range of the linear transformation to image:
Pixel ∈ [0,255]
The scramble spectrogram of sound is thus obtained, the horizontal axis of the scramble spectrogram of the sound is the time, and the longitudinal axis is frequency;Its In, mfcc indicates that cepstrum, min indicate the minimum value of mfcc, and max indicates the maximum value of mfcc, and pixel expression is converted to image Pixel later.
7. a kind of method of face vocal print feature fusion verifying according to claim 1, which is characterized in that it is described will be described Image is merged with two-dimension human face image, specifically:
Scramble spectrogram is rotated clockwise 90 degree, if the horizontal axis length of spliced image and the scramble spectrogram after being rotated by 90 ° Horizontal axis length it is inconsistent, then scale two-dimension human face image, both make horizontal axis length consistent, then the two is spliced.
CN201910641594.7A 2019-07-16 2019-07-16 A kind of method of face vocal print feature fusion verifying Pending CN110363148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910641594.7A CN110363148A (en) 2019-07-16 2019-07-16 A kind of method of face vocal print feature fusion verifying

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910641594.7A CN110363148A (en) 2019-07-16 2019-07-16 A kind of method of face vocal print feature fusion verifying

Publications (1)

Publication Number Publication Date
CN110363148A true CN110363148A (en) 2019-10-22

Family

ID=68219964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910641594.7A Pending CN110363148A (en) 2019-07-16 2019-07-16 A kind of method of face vocal print feature fusion verifying

Country Status (1)

Country Link
CN (1) CN110363148A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709004A (en) * 2020-08-19 2020-09-25 北京远鉴信息技术有限公司 Identity authentication method and device, electronic equipment and readable storage medium
CN111814128A (en) * 2020-09-01 2020-10-23 北京远鉴信息技术有限公司 Identity authentication method, device, equipment and storage medium based on fusion characteristics
CN112133311A (en) * 2020-09-18 2020-12-25 科大讯飞股份有限公司 Speaker recognition method, related device and readable storage medium
CN113114417A (en) * 2021-03-30 2021-07-13 深圳市冠标科技发展有限公司 Audio transmission method and device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104835507A (en) * 2015-03-30 2015-08-12 渤海大学 Serial-parallel combined multi-mode emotion information fusion and identification method
CN105469253A (en) * 2015-11-19 2016-04-06 桂林航天工业学院 Handset NFC safety payment method based on integrated voiceprint and face characteristic encryption
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN106127156A (en) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 Robot interactive method based on vocal print and recognition of face
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107274887A (en) * 2017-05-09 2017-10-20 重庆邮电大学 Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN108124488A (en) * 2017-12-12 2018-06-05 福建联迪商用设备有限公司 A kind of payment authentication method and terminal based on face and vocal print
CN108399395A (en) * 2018-03-13 2018-08-14 成都数智凌云科技有限公司 The compound identity identifying method of voice and face based on end-to-end deep neural network
CN108847251A (en) * 2018-07-04 2018-11-20 武汉斗鱼网络科技有限公司 A kind of voice De-weight method, device, server and storage medium
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN109910818A (en) * 2019-02-15 2019-06-21 东华大学 A kind of VATS Vehicle Anti-Theft System based on human body multiple features fusion identification

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104835507A (en) * 2015-03-30 2015-08-12 渤海大学 Serial-parallel combined multi-mode emotion information fusion and identification method
CN105469253A (en) * 2015-11-19 2016-04-06 桂林航天工业学院 Handset NFC safety payment method based on integrated voiceprint and face characteristic encryption
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN106127156A (en) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 Robot interactive method based on vocal print and recognition of face
CN107274887A (en) * 2017-05-09 2017-10-20 重庆邮电大学 Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN108124488A (en) * 2017-12-12 2018-06-05 福建联迪商用设备有限公司 A kind of payment authentication method and terminal based on face and vocal print
CN108399395A (en) * 2018-03-13 2018-08-14 成都数智凌云科技有限公司 The compound identity identifying method of voice and face based on end-to-end deep neural network
CN108847251A (en) * 2018-07-04 2018-11-20 武汉斗鱼网络科技有限公司 A kind of voice De-weight method, device, server and storage medium
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN109910818A (en) * 2019-02-15 2019-06-21 东华大学 A kind of VATS Vehicle Anti-Theft System based on human body multiple features fusion identification

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709004A (en) * 2020-08-19 2020-09-25 北京远鉴信息技术有限公司 Identity authentication method and device, electronic equipment and readable storage medium
CN111709004B (en) * 2020-08-19 2020-11-13 北京远鉴信息技术有限公司 Identity authentication method and device, electronic equipment and readable storage medium
CN111814128A (en) * 2020-09-01 2020-10-23 北京远鉴信息技术有限公司 Identity authentication method, device, equipment and storage medium based on fusion characteristics
CN111814128B (en) * 2020-09-01 2020-12-11 北京远鉴信息技术有限公司 Identity authentication method, device, equipment and storage medium based on fusion characteristics
CN112133311A (en) * 2020-09-18 2020-12-25 科大讯飞股份有限公司 Speaker recognition method, related device and readable storage medium
CN113114417A (en) * 2021-03-30 2021-07-13 深圳市冠标科技发展有限公司 Audio transmission method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110363148A (en) A kind of method of face vocal print feature fusion verifying
Zhang et al. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching
Wang et al. Speech emotion recognition with dual-sequence LSTM architecture
CN109741732B (en) Named entity recognition method, named entity recognition device, equipment and medium
US20190096400A1 (en) Method and apparatus for providing voice service
CN112489635A (en) Multi-mode emotion recognition method based on attention enhancement mechanism
CN110288980A (en) Audio recognition method, the training method of model, device, equipment and storage medium
CN112151030B (en) Multi-mode-based complex scene voice recognition method and device
CN112099628A (en) VR interaction method and device based on artificial intelligence, computer equipment and medium
CN1971621A (en) Generating method of cartoon face driven by voice and text together
CN113658583B (en) Ear voice conversion method, system and device based on generation countermeasure network
CN111930900B (en) Standard pronunciation generating method and related device
CN109065073A (en) Speech-emotion recognition method based on depth S VM network model
CN116129863A (en) Training method of voice synthesis model, voice synthesis method and related device
CN109947971A (en) Image search method, device, electronic equipment and storage medium
CN115937369A (en) Expression animation generation method and system, electronic equipment and storage medium
Uttam et al. Hush-Hush Speak: Speech Reconstruction Using Silent Videos.
CN114360492A (en) Audio synthesis method and device, computer equipment and storage medium
CN105741853A (en) Digital speech perception hash method based on formant frequency
Deschamps-Berger et al. Exploring attention mechanisms for multimodal emotion recognition in an emergency call center corpus
CN113782042A (en) Speech synthesis method, vocoder training method, device, equipment and medium
CN113823303A (en) Audio noise reduction method and device and computer readable storage medium
CN116682463A (en) Multi-mode emotion recognition method and system
CN112885326A (en) Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191022