CN113270112A - Electronic camouflage voice automatic distinguishing and restoring method and system - Google Patents

Electronic camouflage voice automatic distinguishing and restoring method and system Download PDF

Info

Publication number
CN113270112A
CN113270112A CN202110476114.3A CN202110476114A CN113270112A CN 113270112 A CN113270112 A CN 113270112A CN 202110476114 A CN202110476114 A CN 202110476114A CN 113270112 A CN113270112 A CN 113270112A
Authority
CN
China
Prior art keywords
voice
camouflage
restored
factor
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110476114.3A
Other languages
Chinese (zh)
Inventor
孙蒙
李嘉康
孙雅茹
张雄伟
张星昱
曹铁勇
郑琳琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202110476114.3A priority Critical patent/CN113270112A/en
Publication of CN113270112A publication Critical patent/CN113270112A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses an electronic camouflage voice automatic distinguishing and restoring method and a system, comprising the following steps: acquiring voice audio information to be detected; traversing all values of the theoretical value range of the disguised factors, and performing pre-reduction on the voice audio information to be detected one by one for each value to obtain pre-reduced voices corresponding to all the disguised factors; extracting audio features of all pre-restored voices by adopting an automatic speaker confirmation method; inputting all audio features into a pre-trained universal disguised voice detection model for detection, and outputting detection scores of pre-restored voices; and comparing all the detection scores, outputting a corresponding disguise factor with the score closest to the score of the unmasked voice as a correct disguise factor, and taking the corresponding pre-restored voice as a correct restored voice. The advantages are that: the identification precision of the disguised degree is improved, so that the accuracy of identifying the identity of the disguised voice speaker is improved; the problem that the accuracy of the existing camouflage degree identification method is not ideal in a real noisy scene is solved.

Description

Electronic camouflage voice automatic distinguishing and restoring method and system
Technical Field
The invention relates to an automatic distinguishing and restoring method and system for electronic camouflage voice, and belongs to the technical field of voice signal processing.
Background
The development of voice technology brings great convenience to people, but the appearance of electronic camouflage voice technology brings great challenges to speaker recognition. At present, various voice changers and voice changing software which are popular in the market can perform individual camouflage on voice, so that the identity of a speaker cannot be identified by human ears and voice print recognition technical products, the voice inspection, identification, recognition and monitoring effects are seriously influenced, and criminals can take advantage of voice crimes.
In order to combat the electronic camouflage change of the voice, before identifying the identity of the camouflage person, the camouflage voice needs to be restored, and the problems of restoration of the electronic camouflage voice are abstracted and simplified into the problems of estimation of camouflage factors, namely [ Y.Wang, H.Wu, and J.Huang, 'Verification of friend maker band and transformation distributed voices', and the problems of Digital Signal Processing, vol.45, pp.84-95,2015 ], and then the camouflage factors are estimated by a base-frequency ratio method, so that the restoration of the camouflage voice is completed. However, the method has too strict requirements on the voice quality, and is not ideal for the noise-containing camouflage voice effect under the real condition.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an automatic electronic camouflage voice distinguishing and restoring method and system, so that the accuracy of camouflage degree identification is improved.
In order to solve the above technical problems, the present invention provides an automatic electronic camouflage voice distinguishing and restoring method, comprising:
acquiring voice audio information to be detected;
traversing all values of the theoretical value range of the disguised factors, and performing pre-reduction on the voice audio information to be detected one by one for each value to obtain pre-reduced voices corresponding to all the disguised factors;
extracting audio features of all pre-restored voices by adopting an automatic speaker confirmation method;
inputting the audio features of all pre-restored voices into a pre-trained universal disguised voice detection model for detection, and outputting the detection scores of all pre-restored voices;
and comparing all the detection scores, outputting a corresponding disguise factor with the score closest to the score of the unmasked voice as a correct disguise factor, and taking the corresponding pre-restored voice as a correct restored voice.
Further, the process of performing pre-reduction on the voice audio information to be detected one by one includes:
fixing a certain disguise factor;
and modifying the voice frequency spectrum according to the disguised factor to obtain the pre-restored voice.
Further, the training process of the pre-trained universal disguised speech detection model includes:
acquiring a training data set containing real voice and electronic camouflage voice;
extracting mel cepstrum coefficients of the training data set;
extracting real voice training audio features and electronic camouflage voice audio features with different camouflage factors from real voice and electronic camouflage voice of a training data set by adopting a speaker confirmation method based on a Gaussian mixture model and a speaker feature vector based on a Mel cepstrum coefficient;
taking the real voice training audio features and electronic camouflage voice audio features with different camouflage factors as input data, taking the camouflage factor corresponding to each electronic camouflage voice as a target value, establishing a machine learning regression model between the input data and the target value, and determining a final universal camouflage voice detection model by minimizing the parameters of a regression error estimation model;
further, the comparing all the detection scores includes:
and inputting the audio features extracted from the pre-reduced voice obtained by each camouflage factor into a general camouflage voice detection model to obtain detection scores, comparing the detection scores with the non-camouflage scores in the model, and taking the camouflage factor corresponding to the most similar score and the pre-reduced voice thereof as output results.
An electronic camouflage voice automatic distinguishing and restoring system comprises:
the acquisition module is used for acquiring the voice audio information to be detected;
the traversal module is used for traversing all values of the theoretical value range of the disguised factors, and performing pre-reduction on the voice audio information to be detected one by one for each value to obtain pre-reduced voices corresponding to all the disguised factors;
the extraction module is used for extracting audio features of all pre-reduced voices by adopting an automatic speaker confirmation method;
the output module is used for inputting the audio features of all the pre-restored voices into a pre-trained universal disguised voice detection model for detection and outputting the detection scores of the pre-restored voices;
and the comparison module is used for comparing all the detection scores, outputting a corresponding disguise factor with the score closest to the score of the unmasked voice as a correct disguise factor, and taking the corresponding pre-restored voice as a correct restored voice.
Further, the traversal module comprises a pre-restoration module,
the method is used for fixing a certain disguise factor and modifying the voice frequency spectrum according to the disguise factor to obtain pre-restored voice.
Further, the output module comprises a model training module,
the method comprises the steps of acquiring a training data set containing real voice and electronic camouflage voice; extracting mel cepstrum coefficients of the training data set; extracting real voice training audio features and electronic camouflage voice audio features with different camouflage factors from real voice and electronic camouflage voice of a training data set by adopting a speaker confirmation method based on a Gaussian mixture model and a speaker feature vector based on a Mel cepstrum coefficient; taking the real voice training audio features and electronic camouflage voice audio features with different camouflage factors as input data, taking the camouflage factor corresponding to each electronic camouflage voice as a target value, establishing a machine learning regression model between the input data and the target value, and determining a final universal camouflage voice detection model by minimizing the parameters of a regression error estimation model;
further, the comparing module comprises a process of comparing all the detection scores, including:
and inputting the audio features extracted from the pre-reduced voice obtained by each camouflage factor into a general camouflage voice detection model to obtain detection scores, comparing the detection scores with the non-camouflage scores in the model, and taking the camouflage factor corresponding to the most similar score and the pre-reduced voice thereof as output results.
The invention achieves the following beneficial effects:
the identification precision of the disguised degree is improved, so that the accuracy of identifying the identity of the disguised voice speaker is improved; the discrimination and the reduction of the disguised voice are finished by means of the noise robustness of the automatic speaker recognition and a universal disguised voice detection model; the method solves the problems that the accuracy is not ideal and suspect voice data is needed in a real noisy scene in the existing camouflage degree identification method.
Drawings
FIG. 1 is a schematic diagram of an electronic camouflage voice reduction method of the present invention;
FIG. 2 is a schematic diagram of a speaker verification method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an automatic distinguishing and restoring method for electronic camouflage voice includes the following steps:
step 1: first, a disguised speech detection model is trained using a training data set containing real speech and electronically disguised speech for detecting the type of pre-restored speech. In the process of training the disguised voice detection model, firstly, extracting Mel Frequency Cepstral Coefficients (MFCC) of training voice data, and respectively training voice Models of real voice and electronic disguised voice, as shown in FIG. 2, extracting audio information by using a speaker confirmation method based on (Gaussian Mixture Models, generalized Background Models, GMM-UBM) and i-vector, extracting disguised information in audio by using the method, and expressing the disguised information in the audio by using a function g; and after the i-vector of the training voice is obtained, the training voice is respectively input into the camouflage voice detection model for classification training, so that the camouflage voice detection model can judge whether the voice to be detected is real voice or electronic camouflage voice and the camouflage degree.
Step 2: for input disguised voice data y to be detected, traversing a theoretical value range-R to R of a disguising factor R, wherein R can be 11, utilizing the theoretical value of the disguising factor to pre-restore the frequency spectrum characteristics of the voice to be detected one by one, and if a disguising function is recorded as f, recording the process as f-1The voice obtained by pre-reducing each camouflage factor is f-1(y)。
And step 3: pre-reduction is carried out to obtain the characteristics g (f) of the audio frequency-1(y)) is input to the masquerade voice detection model to obtain a detection score.
And 4, step 4: and comparing all the detection scores, wherein the corresponding disguise factor with the detection score closest to the score of the unmasked voice is regarded as the correct disguise factor, and the corresponding pre-restored voice is regarded as the correct restored voice.
In one embodiment, the disguise level may be expressed in numerical form, denoted as a disguise value. The closer the numerical value is to the real voice detection score, namely the smaller the disguise value is, the smaller the disguise degree is represented, and the lower the possibility of disguise is; conversely, the farther the numerical value is from the real voice detection score, that is, the larger the disguise value is, the greater the degree of disguise is indicated, and the greater the possibility of disguise is. The pre-restored voice with the minimum camouflage value is considered to be the correct restored voice, and the corresponding camouflage factor is considered to be the correct camouflage factor.
In the above embodiment, the g function may also be implemented by using a deep learning model such as d-vector, x-vector, etc.; the disguised voice detection model can be realized by SVM (support Vector machine), SVR and the like.
To illustrate the effect of the present invention, an electronic camouflage voice is generated by the SoundStretch audio processing program. The SoundStretch can perform three operations of realizing variable speed and non-variable speed (Rate), variable speed and non-variable speed (Pitch) and variable speed and simultaneously variable speed (Tempo) on the WAV audio file. Since the Rate process is not so disturbing for the speaker verification system and for the human ear recognition, only Pitch and Tempo processes are considered here as the voicing measure. When the disguising degree is too small or too large, the disguising effect is not obvious or semantic features cannot be distinguished, and the threat to a speaker confirmation system and a human ear identification system is small. 18 camouflage voices are therefore considered, with camouflage factors of +3 to +11 and-3 to-11.
VoxCeleb1 is an audio-visual data set consisting of short human speech films extracted from the interview video uploaded to YouTube, with some real noise and irregular noise occurrence time points. The speakers covered different ages, sexes, accents; the scenes of the voices are also very rich, including the walking show of the red blanket, an outdoor venue, an indoor studio and the like, and the voices belong to completely real English voices. Here, 100 speakers in the data set were randomly selected, 11 voices per person, of which 10 were used to register 1 for testing. The test voice is disguised to different degrees by utilizing a SoundStretch audio processing program to obtain 18 groups of frequency domain electronic disguised voices with disguising factors of +3 to +11 and-3 to-11 and 18 groups of time domain electronic disguised voices with disguising factors of +3 to +11 and-3 to-11, wherein each group contains 1 electronic disguised voice to be tested of each 100 speakers.
The authenticity of the speech (real speech/electronic masquerading speech) is judged on the Rate masquerading data set of VoxCeleb1 by using an SVM as a classifier, and the experimental results are shown in table 1.
TABLE 1 VoxColeb 1 Experimental results for SVM discrimination speech truth on Rate camouflage data set
Figure BDA0003047152780000051
The fundamental frequency ratio is used for determining the camouflage factor on the Rate camouflage data set of the VoxColeb 1, and the experimental result is shown in the table 2. Wherein:
(1) s is a real camouflage factor;
(2)s′mean(s)s'mean(s) is the average of the masking factors estimated for each set of data;
(3) mean error Emean(s)=|s'mean(s)-s|Emean(s)=|s′mean(s-s|;
(4)Emean(s)/sEmean(s/s is the average error rate;
(5) var(s) is the variance of each set of experimental data.
Table 2 experimental results of estimating masquerading factor using fundamental frequency ratio on Rate masquerading dataset of VoxColeb 1
Figure BDA0003047152780000061
The GMM-UBM system is used for carrying out camouflage factor estimation experiments on the Rate camouflage data set and the Temp camouflage data set of the VoxColeb 1 respectively, and the experimental results are shown in tables 3 and 4. The mean error Rate of the masquerading factors estimated by the automated speaker verification system was 12.26% with a mean variance of 5.05 on the Rate masquerading dataset and 14.13% with a mean variance of 6.44 on the Temp masquerading dataset.
TABLE 3 experiment results of estimation of masking factors using ASV on Rate masking dataset of VoxColeb 1
Figure BDA0003047152780000062
Correspondingly, the invention also provides an automatic distinguishing and restoring system for the electronic camouflage voice, which comprises:
the acquisition module is used for acquiring the voice audio information to be detected;
the traversal module is used for traversing all values of the theoretical value range of the disguised factors, and performing pre-reduction on the voice audio information to be detected one by one for each value to obtain pre-reduced voices corresponding to all the disguised factors;
the extraction module is used for extracting audio features of all pre-reduced voices by adopting an automatic speaker confirmation method;
the output module is used for inputting the audio features of all the pre-restored voices into a pre-trained universal disguised voice detection model for detection and outputting the detection scores of the pre-restored voices;
and the comparison module is used for comparing all the detection scores, outputting a corresponding disguise factor with the score closest to the score of the unmasked voice as a correct disguise factor, and taking the corresponding pre-restored voice as a correct restored voice.
The traversal module comprises a pre-restoration module,
the method is used for fixing a certain disguise factor and modifying the voice frequency spectrum according to the disguise factor to obtain pre-restored voice.
The output module includes a model training module that,
the method comprises the steps of acquiring a training data set containing real voice and electronic camouflage voice; extracting mel cepstrum coefficients of the training data set; extracting real voice training audio features and electronic camouflage voice audio features with different camouflage factors from real voice and electronic camouflage voice of a training data set by adopting a speaker confirmation method based on a Gaussian mixture model and a speaker feature vector based on a Mel cepstrum coefficient; taking the real voice training audio features and electronic camouflage voice audio features with different camouflage factors as input data, taking the camouflage factor corresponding to each electronic camouflage voice as a target value, establishing a machine learning regression model between the input data and the target value, and determining a final universal camouflage voice detection model by minimizing the parameters of a regression error estimation model;
the comparing module comprises a process of comparing all the detection scores, which comprises the following steps:
and inputting the audio features extracted from the pre-reduced voice obtained by each camouflage factor into a general camouflage voice detection model to obtain detection scores, comparing the detection scores with the non-camouflage scores in the model, and taking the camouflage factor corresponding to the most similar score and the pre-reduced voice thereof as output results.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. An electronic camouflage voice automatic distinguishing and restoring method is characterized by comprising the following steps:
acquiring voice audio information to be detected;
traversing all values of the theoretical value range of the disguised factors, and performing pre-reduction on the voice audio information to be detected one by one for each value to obtain pre-reduced voices corresponding to all the disguised factors;
extracting audio features of all pre-restored voices by adopting an automatic speaker confirmation method;
inputting the audio features of all pre-restored voices into a pre-trained universal disguised voice detection model for detection, and outputting the detection scores of all pre-restored voices;
and comparing all the detection scores, outputting a corresponding disguise factor with the score closest to the score of the unmasked voice as a correct disguise factor, and taking the corresponding pre-restored voice as a correct restored voice.
2. The method for automatically distinguishing and restoring the electronic camouflage voices according to claim 1, wherein the step of performing pre-restoration on the voice audio information to be detected one by one comprises the following steps:
fixing a certain disguise factor;
and modifying the voice frequency spectrum according to the disguised factor to obtain the pre-restored voice.
3. The method for automatically discriminating and restoring electronic camouflage voice according to claim 1, wherein the training process of the pre-trained universal camouflage voice detection model comprises the following steps:
acquiring a training data set containing real voice and electronic camouflage voice;
extracting mel cepstrum coefficients of the training data set;
extracting real voice training audio features and electronic camouflage voice audio features with different camouflage factors from real voice and electronic camouflage voice of a training data set by adopting a speaker confirmation method based on a Gaussian mixture model and a speaker feature vector based on a Mel cepstrum coefficient;
the method comprises the steps of taking real voice training audio features and electronic camouflage voice audio features with different camouflage factors as input data, taking camouflage factors corresponding to each electronic camouflage voice as target values, establishing a machine learning regression model between the input data and the target values, and determining a final universal camouflage voice detection model by minimizing parameters of a regression error estimation model.
4. The method of automatically discriminating and restoring an electronically camouflaged voice according to claim 1, wherein the comparing all the detection scores comprises:
and inputting the audio features extracted from the pre-reduced voice obtained by each camouflage factor into a general camouflage voice detection model to obtain detection scores, comparing the detection scores with the non-camouflage scores in the model, and taking the camouflage factor corresponding to the most similar score and the pre-reduced voice thereof as output results.
5. An electronic camouflage voice automatic distinguishing and restoring system is characterized by comprising:
the acquisition module is used for acquiring the voice audio information to be detected;
the traversal module is used for traversing all values of the theoretical value range of the disguised factors, and performing pre-reduction on the voice audio information to be detected one by one for each value to obtain pre-reduced voices corresponding to all the disguised factors;
the extraction module is used for extracting audio features of all pre-reduced voices by adopting an automatic speaker confirmation method;
the output module is used for inputting the audio features of all the pre-restored voices into a pre-trained universal disguised voice detection model for detection and outputting the detection scores of the pre-restored voices;
and the comparison module is used for comparing all the detection scores, outputting a corresponding disguise factor with the score closest to the score of the unmasked voice as a correct disguise factor, and taking the corresponding pre-restored voice as a correct restored voice.
6. The electronic camouflage voice automatic distinguishing and restoring system of claim 5, wherein the traversal module comprises a pre-restoring module,
the method is used for fixing a certain disguise factor and modifying the voice frequency spectrum according to the disguise factor to obtain pre-restored voice.
7. The electronic camouflage voice automatic distinguishing and restoring system of claim 5, wherein the output module comprises a model training module,
the method comprises the steps of acquiring a training data set containing real voice and electronic camouflage voice; extracting mel cepstrum coefficients of the training data set; extracting real voice training audio features and electronic camouflage voice audio features with different camouflage factors from real voice and electronic camouflage voice of a training data set by adopting a speaker confirmation method based on a Gaussian mixture model and a speaker feature vector based on a Mel cepstrum coefficient; the method comprises the steps of taking real voice training audio features and electronic camouflage voice audio features with different camouflage factors as input data, taking camouflage factors corresponding to each electronic camouflage voice as target values, establishing a machine learning regression model between the input data and the target values, and determining a final universal camouflage voice detection model by minimizing parameters of a regression error estimation model.
8. The system of claim 5, wherein the comparing module comprises comparing all the detection scores, including:
and inputting the audio features extracted from the pre-reduced voice obtained by each camouflage factor into a general camouflage voice detection model to obtain detection scores, comparing the detection scores with the non-camouflage scores in the model, and taking the camouflage factor corresponding to the most similar score and the pre-reduced voice thereof as output results.
CN202110476114.3A 2021-04-29 2021-04-29 Electronic camouflage voice automatic distinguishing and restoring method and system Pending CN113270112A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110476114.3A CN113270112A (en) 2021-04-29 2021-04-29 Electronic camouflage voice automatic distinguishing and restoring method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110476114.3A CN113270112A (en) 2021-04-29 2021-04-29 Electronic camouflage voice automatic distinguishing and restoring method and system

Publications (1)

Publication Number Publication Date
CN113270112A true CN113270112A (en) 2021-08-17

Family

ID=77230110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110476114.3A Pending CN113270112A (en) 2021-04-29 2021-04-29 Electronic camouflage voice automatic distinguishing and restoring method and system

Country Status (1)

Country Link
CN (1) CN113270112A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831127A (en) * 2023-01-09 2023-03-21 浙江大学 Voiceprint reconstruction model construction method and device based on voice conversion and storage medium
CN116013323A (en) * 2022-12-27 2023-04-25 浙江大学 Active evidence obtaining method oriented to voice conversion

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103730121A (en) * 2013-12-24 2014-04-16 中山大学 Method and device for recognizing disguised sounds
CN104464724A (en) * 2014-12-08 2015-03-25 南京邮电大学 Speaker recognition method for deliberately pretended voices

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103730121A (en) * 2013-12-24 2014-04-16 中山大学 Method and device for recognizing disguised sounds
CN104464724A (en) * 2014-12-08 2015-03-25 南京邮电大学 Speaker recognition method for deliberately pretended voices

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张雄伟 等: "语音欺骗检测方法的研究现状及展望", 数据采集与处理 *
郑琳琳 等: "基于i-vector的电子伪装语音鲁棒还原方法研究", 数据采集与处理, pages 880 - 891 *
郑琳琳 等: "语音伪装方法及其防御对策综述", 网络与信息安全 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116013323A (en) * 2022-12-27 2023-04-25 浙江大学 Active evidence obtaining method oriented to voice conversion
CN115831127A (en) * 2023-01-09 2023-03-21 浙江大学 Voiceprint reconstruction model construction method and device based on voice conversion and storage medium
CN115831127B (en) * 2023-01-09 2023-05-05 浙江大学 Voiceprint reconstruction model construction method and device based on voice conversion and storage medium

Similar Documents

Publication Publication Date Title
CN109473123B (en) Voice activity detection method and device
US11869261B2 (en) Robust audio identification with interference cancellation
Nagrani et al. Voxceleb: a large-scale speaker identification dataset
Muckenhirn et al. Long-term spectral statistics for voice presentation attack detection
CN102394062B (en) Method and system for automatically identifying voice recording equipment source
Liu et al. Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing
CN109584884A (en) A kind of speech identity feature extractor, classifier training method and relevant device
CN108986824A (en) A kind of voice playback detection method
Marcheret et al. Detecting audio-visual synchrony using deep neural networks.
CN110767239A (en) Voiceprint recognition method, device and equipment based on deep learning
CN113270112A (en) Electronic camouflage voice automatic distinguishing and restoring method and system
Yoon et al. A new replay attack against automatic speaker verification systems
CN111816185A (en) Method and device for identifying speaker in mixed voice
CN116490920A (en) Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system
Muckenhirn et al. Presentation attack detection using long-term spectral statistics for trustworthy speaker verification
US20200082830A1 (en) Speaker recognition
KR102018286B1 (en) Method and Apparatus for Removing Speech Components in Sound Source
Weng et al. The sysu system for the interspeech 2015 automatic speaker verification spoofing and countermeasures challenge
CN113920560A (en) Method, device and equipment for identifying identity of multi-modal speaker
Shi et al. Speech emotion recognition based on data mining technology
CN114303186A (en) System and method for adapting human speaker embedding in speech synthesis
CN111091810A (en) VR game character expression control method based on voice information and storage medium
Liu et al. Identification of fake stereo audio
CN112489692A (en) Voice endpoint detection method and device
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination