CN115083422A - Voice traceability evidence obtaining method and device, equipment and storage medium - Google Patents

Voice traceability evidence obtaining method and device, equipment and storage medium Download PDF

Info

Publication number
CN115083422A
CN115083422A CN202210859678.XA CN202210859678A CN115083422A CN 115083422 A CN115083422 A CN 115083422A CN 202210859678 A CN202210859678 A CN 202210859678A CN 115083422 A CN115083422 A CN 115083422A
Authority
CN
China
Prior art keywords
level
algorithm
voice
fingerprint
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210859678.XA
Other languages
Chinese (zh)
Other versions
CN115083422B (en
Inventor
陶建华
晏鑫蕊
易江燕
张震
李鹏
石瑾
王立强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
National Computer Network and Information Security Management Center
Original Assignee
Institute of Automation of Chinese Academy of Science
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science, National Computer Network and Information Security Management Center filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202210859678.XA priority Critical patent/CN115083422B/en
Publication of CN115083422A publication Critical patent/CN115083422A/en
Application granted granted Critical
Publication of CN115083422B publication Critical patent/CN115083422B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The present disclosure relates to a voice source tracing evidence obtaining method, device, equipment and storage medium, wherein the method comprises: extracting at least two different acoustic characteristics of the voice to be tested; fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic; extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested by the segment-level algorithm fingerprint features; the predicted generation algorithm of the voice to be tested is used as a voice source tracing and evidence obtaining result, the authenticity of the audio can be judged by extracting algorithm fingerprints, the source tracing and evidence obtaining can be further carried out, and the generation source of the false audio can be obtained.

Description

Voice traceability evidence obtaining method and device, equipment and storage medium
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech source tracing forensics.
Background
At present, the voice generation technology is mature day by day, the voice generation effect is better and better, the generated voice can be comparable with the real voice under specific conditions, the generated voice is improperly utilized and spread, certain threats can be generated to media spread and social stability, and therefore, the false generated voice has great harm to the society. In practical application scenarios, such as public security or court, not only is the true validity of the audio itself a concern, but it is also necessary to know what the source of its generation is if the audio is spurious. Therefore, in many practical scenes, many requirements for audio authenticity evidence emerge, but a detection model for generating audio cannot meet the requirement of giving a false audio generation source.
Disclosure of Invention
In order to solve the technical problems or at least partially solve the technical problems, embodiments of the present disclosure provide a method and an apparatus for voice source tracing and forensics, a device, and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a voice source tracing forensics method, including:
extracting at least two different acoustic characteristics of the voice to be tested;
fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic;
extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice source tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to a feature weighted average vector and a weighted standard deviation vector obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features;
and taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.
In a possible implementation manner, the voice traceability evidence obtaining model comprises a frame-level algorithm fingerprint extractor, an attention statistics pooling layer and a segment-level algorithm fingerprint extractor which are connected in sequence, the frame-level algorithm fingerprint extractor comprises a plurality of self-attention layers which are connected with each other, the attention statistics pooling layer comprises an attention model and an attention statistics pooling network layer which are connected in sequence, and the segment-level algorithm fingerprint extractor comprises a full-connection layer and a classification layer which are connected in sequence.
In one possible implementation, the voice traceability forensic model is trained by the following steps:
extracting at least two different acoustic features of the known speech;
fusing at least two different acoustic features of the extracted known voice to obtain a second fused acoustic feature;
inputting the second fusion acoustic feature into an algorithm fingerprint extractor at the frame level of the voice traceability evidence-obtaining model before training, and outputting the algorithm fingerprint feature at the frame level of the known voice;
inputting the algorithm fingerprint characteristics of the frame level of the known voice into an attention statistics pooling layer, and outputting a weighted average vector and a weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level;
inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level into an algorithm fingerprint extractor of the segment level, and outputting the algorithm fingerprint characteristics of the segment level of the known voice and a corresponding generation algorithm;
taking the algorithm output by the algorithm fingerprint extractor at the segment level as a prediction result of a generation algorithm of the known voice;
calculating a loss function value through a preset loss function based on a prediction result and an actual generation algorithm of a known voice generation algorithm, adjusting weight parameters of an algorithm fingerprint extractor at a frame level, an attention statistics pooling layer and an algorithm fingerprint extractor at a segment level according to the loss function value until the loss function value meets a preset condition,
wherein the loss function is:
L=αLsoftmaxLtriplet
Figure 163941DEST_PATH_IMAGE001
Ltriplet=max( d(q,p)-d(q,n)+margin, 0 )
wherein p (i, j) is the prediction probability of the ith sample to the jth class, l (i, j) is the output location of the model to the ith sample in the jth class, C is the number of classes, q is the current input feature vector, p is the feature vector in the same class as i, n is the feature vector in a different class from i, margin is a constant greater than 0, d () is a distance function, d (q, p) is the distance between q and p, d (q, n) is the distance between q and n, and both alpha and beta are constants of 0-1.
In one possible implementation, the inputting the frame-level algorithmic fingerprint features of the known speech to the attention statistics pooling layer and outputting a weighted average vector and a weighted standard deviation vector of the frame-level algorithmic fingerprint features comprises:
based on the attention model, obtaining a normalized weight score according to the algorithm fingerprint characteristics of the frame level of the known voice;
and calculating to obtain a weighted average vector and a weighted standard deviation vector according to the weight fraction and the algorithm fingerprint characteristics of the frame level of the known voice.
In one possible embodiment, the segment-level algorithm fingerprint extractor includes a plurality of fully-connected layers and a classification layer, and the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint features at the frame level are input into the segment-level algorithm fingerprint extractor, and the segment-level algorithm fingerprint features of the known speech and the corresponding generating algorithm thereof are output, including:
inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint features at the frame level into a plurality of full connection layers of the algorithm fingerprint extractor at the segment level to obtain the algorithm fingerprint features at the segment level of the known voice;
inputting the algorithm fingerprint features of the known voice segment level into the classification layer of the algorithm fingerprint extractor of the segment level to obtain a generation algorithm corresponding to the algorithm fingerprint features of the segment level.
In one possible implementation, before the inputting the segment-level algorithmic fingerprint features of the known speech into the classification layer of the segment-level algorithmic fingerprint extractor, the method further comprises:
whitening, length normalization and processing for reducing dimensionality of the algorithm fingerprint characteristics of the known speech at the segment level by using a linear discriminant analysis method so as to input the processed algorithm fingerprint characteristics at the segment level into a classification layer of an algorithm fingerprint extractor at the segment level.
In one possible embodiment, after the training of the speech traceability forensic model is completed, the method further includes:
for a given test data set, calculating the accuracy, the precision, the recall rate and an F-value of the voice traceability evidence-obtaining model, wherein the accuracy is the ratio of the number of correctly classified samples to the total number of samples in the test data set, the precision is the number of actual positive samples in samples predicted to be positive in the test data set, the recall rate is the number of correctly predicted samples in actual positive samples of the test data set, and the F-value is the harmonic average value of the precision and the recall rate;
judging whether the accuracy, the precision, the recall rate and the F-value respectively meet preset requirements:
when the accuracy, the precision, the recall rate and the F-value all meet preset requirements, the voice source tracing and forensics model is used for extracting frame-level algorithm fingerprint features from the first fusion acoustic features, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features.
In a second aspect, an embodiment of the present disclosure provides a voice traceability forensics apparatus, including:
the extraction module is used for extracting at least two different acoustic characteristics of the voice to be tested;
the fusion module is used for fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic;
the prediction module is used for extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features;
and the evidence obtaining module is used for taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the voice source tracing evidence obtaining method when executing the program stored in the memory.
In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the above-mentioned voice traceability evidence-obtaining method.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:
the voice source tracing evidence obtaining method comprises the steps of extracting at least two different acoustic characteristics of voice to be tested; fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic; extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested by the segment-level algorithm fingerprint features; the predicted generation algorithm of the voice to be tested is used as a voice source tracing evidence obtaining result, the authenticity of the audio can be judged by extracting algorithm fingerprints, the source tracing evidence obtaining can be further carried out, the generation source of the false audio is obtained, and the requirement of a real scene on the false detection evidence is met.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 schematically illustrates a flow chart of a voice traceability forensics method according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a structural diagram of a speech traceability forensics model according to an embodiment of the present disclosure;
fig. 3 schematically illustrates a block diagram of a voice traceability forensics apparatus according to an embodiment of the present disclosure; and
fig. 4 schematically shows a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The overall idea of the present disclosure is: in the training stage, preprocessing the collected generated voice data set, labeling the generated voice data set according to the generated algorithm or source thereof, inputting the acoustic characteristics of the part of training set into a voice traceability evidence obtaining model for training to obtain a trained voice traceability evidence obtaining model; in the testing stage, the acoustic features of the extracted generated voice test set are input into a tracing model, and a generation algorithm or a source type of the tracing model is further judged, wherein the generated voice data set is from voice generated by a voice synthesis technology or voice generated by a voice conversion technology.
Referring to fig. 1, an embodiment of the present disclosure provides a voice traceability forensics method, including the following steps:
s1, extracting at least two different acoustic characteristics of the voice to be tested;
in practical application, the at least two different acoustic features are selected from a mel cepstrum coefficient, a linear frequency cepstrum coefficient, a linear prediction coefficient, a constant Q transform cepstrum coefficient, a log spectrum, a magnitude spectrum, a line spectrum pair parameter and a phoneme duration.
In practical applications, the at least two different acoustic features may be mel-frequency cepstral coefficients and line spectrum pairs, wherein the extraction method of the mel-frequency cepstral coefficients is as follows: firstly, performing a series of operations of pre-emphasis, framing and windowing on voice, then obtaining Fast Fourier Transform (FFT) characteristics, further obtaining corresponding frequency spectrums, performing logarithm operation on the frequency spectrums after passing through a Mel filter set, and finally performing discrete cosine transform to obtain Mel cepstrum coefficient acoustic characteristics; the line spectrum pair is obtained by line spectrum pair analysis, wherein the line spectrum pair analysis is a method for representing the spectral characteristics of the voice signal by using the distribution density of p discrete frequencies omega and theta.
S2, fusing at least two different acoustic features of the extracted voice to be tested to obtain a first fused acoustic feature;
s3, extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features;
in practical applications, algorithm fingerprint refers to a method or feature for distinguishing algorithms, and a speech generation algorithm refers to an algorithm for speech synthesis and speech conversion.
Referring to fig. 2, the voice traceability evidence obtaining model includes a frame-level algorithm fingerprint extractor, an attention statistics pooling layer and a segment-level algorithm fingerprint extractor, which are connected in sequence, the frame-level algorithm fingerprint extractor includes a plurality of self-attention layers connected to each other, the attention statistics pooling layer includes an attention model and an attention statistics pooling network layer, which are connected in sequence, and the segment-level algorithm fingerprint extractor includes a full-connection layer and a classification layer, which are connected in sequence.
Taking the at least two different acoustic features as examples, which are mel-frequency cepstral coefficients and line spectrum pairs, the speech tracing evidence obtaining method of the present disclosure is explained as follows:
first, referring to FIG. 2, the mel-frequency cepstral coefficients and the line spectrum are relatively fused to obtain features
Figure 687326DEST_PATH_IMAGE002
Inputting the voice frame into a frame-level algorithm fingerprint extractor, wherein the frame-level algorithm fingerprint extractor comprises five self-attention layers, and obtaining frame-level algorithm fingerprint characteristics after operating the generated voice frame
Figure 680952DEST_PATH_IMAGE003
Then, the attention statistics pooling layer comprises an attention model and an attention statistics pooling network layer, and the features at the frame level are normalized by the attention model to obtain a normalized weight score
Figure 922578DEST_PATH_IMAGE004
Then score the weight
Figure 260018DEST_PATH_IMAGE004
And frame level algorithmic fingerprint features
Figure 903489DEST_PATH_IMAGE005
Inputting the attention statistics pooling network layer for calculation to obtain the weighted average vector of the algorithm fingerprint characteristics of the frame level
Figure 238656DEST_PATH_IMAGE006
Sum weighted standard deviation vector
Figure 29894DEST_PATH_IMAGE007
Finally, weighted average vector of algorithm fingerprint characteristics of frame level is calculated
Figure 108709DEST_PATH_IMAGE006
Sum weighted standard deviation vector
Figure 606686DEST_PATH_IMAGE007
Inputting a segment-level algorithm fingerprint extractor, wherein the segment-level algorithm fingerprint extractor comprises a plurality of fully-connected layers connected with each other, outputting a segment-level algorithm fingerprint embeddings from the last fully-connected layer:
Figure 440650DEST_PATH_IMAGE009
the output layer connected to the last full connection layer is a softmax layer, each output node of which corresponds to an algorithm ID.
There are different levels of types for the algorithm ID. The method can be used by various speech synthesis APP manufacturers (such as Baidu, Aliyun, news flying, dog searching, cibye, Bibei technologies) and the like, and can also be used by various speech generation algorithms (such as audio tampering, waveform splicing, manual simulation, speech synthesis, speech conversion and the like). Among these, the use of speech synthesis and speech conversion algorithms is more common, including corresponding speech synthesis and speech conversion algorithms based on Straight vocoders, World vocoders, LPCNet vocoders, WaveNet vocoders, WaveRNN vocoders, HiFiGAN vocoders, PWG vocoders, MelGan vocoders, StyleGan vocoders, and the like. All the types mentioned above can be traced back and proved by extracting the model fingerprint.
In practical application, the method for predicting the speech to be tested based on the section-level algorithm fingerprint features includes the steps of extracting frame-level algorithm fingerprint features from the first fusion acoustic features, performing pooling averaging on the frame-level algorithm fingerprint features, calculating section-level algorithm fingerprint features according to a feature weighted average vector and a weighted standard deviation vector obtained by the pooling averaging, and predicting the speech to be tested based on the section-level algorithm fingerprint features, and includes the following steps:
inputting the first fusion acoustic feature into a frame-level algorithm fingerprint extractor of a pre-trained voice traceability evidence-obtaining model, and outputting a frame-level algorithm fingerprint feature;
inputting the algorithm fingerprint characteristics of the frame level into an attention statistics pooling layer, and outputting a weighted average vector and a weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level;
and inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level into the algorithm fingerprint extractor of the segment level, and outputting the algorithm fingerprint characteristics of the segment level and a corresponding generation algorithm thereof.
And S4, taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.
In practical applications, tracing and forensics refer to verifying the generation source or generation method of the generated speech.
In this embodiment, in step S3, the speech source tracing evidence obtaining model is obtained by training:
extracting at least two different acoustic features of the known voice, wherein the known voice is a plurality of groups of generated voice;
fusing at least two different acoustic features of the extracted known voice to obtain a second fused acoustic feature;
inputting the second fusion acoustic feature into an algorithm fingerprint extractor at the frame level of the voice traceability evidence-obtaining model before training, and outputting the algorithm fingerprint feature at the frame level of the known voice;
inputting the algorithm fingerprint characteristics of the frame level of the known voice into an attention statistics pooling layer, and outputting a weighted average vector and a weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level;
inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level into an algorithm fingerprint extractor of the segment level, and outputting the algorithm fingerprint characteristics of the segment level of the known voice and a corresponding generation algorithm;
taking the algorithm output by the algorithm fingerprint extractor at the segment level as a prediction result of a generation algorithm of the known voice;
calculating a loss function value through a preset loss function based on a prediction result and an actual generation algorithm of a known voice generation algorithm, adjusting weight parameters of an algorithm fingerprint extractor at a frame level, an attention statistics pooling layer and an algorithm fingerprint extractor at a segment level according to the loss function value until the loss function value meets a preset condition,
wherein the loss function is:
L=αLsoftmaxLtriplet
Figure 391288DEST_PATH_IMAGE001
Ltriplet=max( d(q,p)-d(q,n)+margin, 0 )
wherein p (i, j) is the prediction probability of the ith sample to the jth class, l (i, j) is the output location of the model to the ith sample in the jth class, C is the number of classes, q is the current input feature vector, p is the feature vector in the same class as i, n is the feature vector in a different class from i, margin is a constant greater than 0, d () is a distance function, d (q, p) is the distance between q and p, d (q, n) is the distance between q and n, and both alpha and beta are constants of 0-1.
In the training process of the voice traceability evidence-obtaining model, 100 rounds of training are summed, an adaptive moment estimation optimizer is selected, the initial learning rate is set to be 0.001, the learning rate is linearly attenuated, and the batch data size is 256.
In this embodiment, the inputting the frame-level algorithmic fingerprint features of the known speech to the attention statistics pooling layer and outputting the weighted average vector and the weighted standard deviation vector of the frame-level algorithmic fingerprint features includes:
based on the attention model, obtaining a normalized weight score according to the algorithm fingerprint characteristics of the frame level of the known voice;
and calculating to obtain a weighted average vector and a weighted standard deviation vector according to the weight fraction and the algorithm fingerprint characteristics of the frame level of the known voice.
In this embodiment, the segment-level algorithm fingerprint extractor includes a plurality of full-link layers and a classification layer, and inputs the weighted average vector and the weighted standard deviation vector of the frame-level algorithm fingerprint features into the segment-level algorithm fingerprint extractor, and outputs the segment-level algorithm fingerprint features of the known speech and the corresponding generation algorithm thereof, including:
inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint features at the frame level into a plurality of full connection layers of the algorithm fingerprint extractor at the segment level to obtain the algorithm fingerprint features at the segment level of the known voice;
inputting the algorithm fingerprint features of the known voice segment level into the classification layer of the algorithm fingerprint extractor of the segment level to obtain a generation algorithm corresponding to the algorithm fingerprint features of the segment level.
In this embodiment, before the inputting the segment-level algorithmic fingerprint features of the known speech into the classification layer of the segment-level algorithmic fingerprint extractor, the method further comprises:
whitening, length normalization and processing for reducing dimensionality of the algorithm fingerprint characteristics of the known speech at the segment level by using a linear discriminant analysis method so as to input the processed algorithm fingerprint characteristics at the segment level into a classification layer of an algorithm fingerprint extractor at the segment level. Here, the processed segment-level algorithm fingerprint features may be stored to facilitate the extension of new algorithm fingerprint types without the need to retrain a new algorithm fingerprint extractor, such that the learning class of the speech tracing forensic model includes not only the labeled generation algorithm or generation source class, but also the unknown source class to be adapted.
In this embodiment, after the training of the speech source tracing forensic model is completed, the method further includes:
for a given test data set, calculating the accuracy rate, the precision rate, the recall rate and an F-value of the voice traceability evidence-obtaining model, wherein the accuracy rate is the ratio of the number of correctly classified samples to the total number of samples in the test data set, the precision rate is the number of actual positive samples in samples predicted to be positive in the test data set, the recall rate is the number of correctly predicted samples in actual positive samples of the test data set, and the F-value is the harmonic mean value of the accuracy rate and the recall rate;
judging whether the accuracy, the precision, the recall rate and the F-value respectively meet preset requirements:
when the accuracy, the precision, the recall rate and the F-value all meet preset requirements, the voice source tracing and forensics model is used for extracting frame-level algorithm fingerprint features from the first fusion acoustic features, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features.
This is disclosed utilizes the algorithm fingerprint, not only can judge the authenticity of audio frequency, can further trace to the source moreover and collect evidence, obtains the generation source of false audio frequency to in many realistic scenes such as judicial evidence collection, satisfy the demand to false detection evidence, also can promote the further development in audio frequency detection field simultaneously.
The method and the device realize the source tracing and evidence obtaining of the generated voice by associating the model characteristics with the algorithm fingerprints and further classifying and distinguishing the generated voice generating algorithm.
The method disclosed by the invention can utilize the fingerprint of the generating algorithm to trace the source of the generated voice and obtain evidence, so that the source tracing deficiency of the algorithm in the current audio field is overcome, the accuracy is high, the robustness is strong, and certain universality is realized. With the development of deep learning technology, the speech generated by speech synthesis and speech conversion technology becomes more vivid. The method further traces the source and obtains evidence on the basis of judging the true and false tasks at present. The source tracing and evidence obtaining can reach a certain accuracy, and the evidence obtaining analysis result can be utilized by the voice counterfeit identification model in a targeted manner, so that the technical progress and development of the voice counterfeit identification model are further promoted. Finally, the accuracy rate of the whole task is integrally improved by detecting and tracing evidence, and good accuracy rate is achieved. The current judicial, news and entertainment fields have great demands on the voice traceability evidence-obtaining technology, and the identification traceability technology faces the challenge of the real scene. In an actual environment, environmental noise, channels, speakers and the like have the influence of various factors, and the method disclosed by the invention can overcome the influence of the environment, still keeps good tracing accuracy rate and has higher robustness. At present, research on generating voice is widely concerned in both industry and academia, and generation algorithms of scholars and companies are various. The method has a universal generation algorithm tracing function, can realize high coverage of a generated voice algorithm, can quickly adapt to more unknown deep counterfeiting technologies in the future, and has good universality.
The method and the device emphasize the traceability research on the level of the generation algorithm, can prove not only the generation algorithm but also the generation technology of different companies, and provide the traceability evidence-obtaining of the generated voice based on the algorithm fingerprint.
Referring to fig. 3, an embodiment of the present disclosure provides a voice traceability forensics apparatus, including:
an extraction module 11, configured to extract at least two different acoustic features of a voice to be tested;
the fusion module 12 is configured to fuse at least two different acoustic features of the extracted speech to be tested to obtain a first fused acoustic feature;
the prediction module 13 is configured to extract frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained speech tracing forensics model, perform pooling averaging on the frame-level algorithm fingerprint features, calculate segment-level algorithm fingerprint features according to a feature weighted average vector and a weighted standard deviation vector obtained by the pooling averaging, and predict a generation algorithm of the speech to be tested based on the segment-level algorithm fingerprint features;
and the evidence obtaining module 14 is used for taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
In the second embodiment, any plurality of the extraction module 11, the fusion module 12, the prediction module 13, and the forensics module 14 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of other modules and implemented in one module. At least one of the extraction module 11, the fusion module 12, the prediction module 13, and the forensics module 14 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware by any other reasonable manner of integrating or packaging a circuit, or may be implemented in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any of them. Alternatively, at least one of the extraction module 11, the fusion module 12, the prediction module 13 and the forensics module 14 may be at least partially implemented as a computer program module, which when executed may perform a corresponding function.
Referring to fig. 4, an electronic device provided by a fourth exemplary embodiment of the present disclosure includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, where the processor 1110, the communication interface 1120, and the memory 1130 complete communication with each other through the communication bus 1140;
a memory 1130 for storing computer programs;
the processor 1110 is configured to implement the following voice source tracing evidence obtaining method when executing the program stored in the memory 1130:
extracting at least two different acoustic characteristics of the voice to be tested;
fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic;
extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested by the segment-level algorithm fingerprint features;
and taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.
The communication bus 1140 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices.
The Memory 1130 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory 1130 may also be at least one memory device located remotely from the processor 1110.
The Processor 1110 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
Embodiments of the present disclosure also provide a computer-readable storage medium. The computer readable storage medium stores thereon a computer program, which when executed by a processor implements the voice traceability evidence-obtaining method as described above.
The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The computer readable storage medium carries one or more programs which, when executed, implement the voice traceability forensics method according to the embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A voice traceability evidence obtaining method is characterized by comprising the following steps:
extracting at least two different acoustic characteristics of the voice to be tested;
fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic;
extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested by the segment-level algorithm fingerprint features;
and taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.
2. The method according to claim 1, wherein the speech provenance forensics model comprises a frame-level algorithmic fingerprint extractor, an attention statistics pooling layer and a segment-level algorithmic fingerprint extractor which are connected in sequence, wherein the frame-level algorithmic fingerprint extractor comprises a plurality of self-attention layers which are connected with each other, the attention statistics pooling layer comprises an attention model and an attention statistics pooling network layer which are connected in sequence, and the segment-level algorithmic fingerprint extractor comprises a full-connection layer and a classification layer which are connected in sequence.
3. The method of claim 2, wherein the speech traceability forensic model is trained by:
extracting at least two different acoustic features of the known speech;
fusing at least two different acoustic features of the extracted known voice to obtain a second fused acoustic feature;
inputting the second fusion acoustic feature into an algorithm fingerprint extractor at the frame level of the voice traceability evidence-obtaining model before training, and outputting the algorithm fingerprint feature at the frame level of the known voice;
inputting the algorithm fingerprint characteristics of the known voice at the frame level into an attention statistics pooling layer, and outputting a weighted average vector and a weighted standard deviation vector of the algorithm fingerprint characteristics at the frame level;
inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level into an algorithm fingerprint extractor of the segment level, and outputting the algorithm fingerprint characteristics of the segment level of the known voice and a corresponding generation algorithm;
taking the algorithm output by the algorithm fingerprint extractor at the segment level as a prediction result of a generation algorithm of the known voice;
calculating a loss function value through a preset loss function based on a prediction result and an actual generation algorithm of a known voice generation algorithm, adjusting weight parameters of an algorithm fingerprint extractor at a frame level, an attention statistics pooling layer and an algorithm fingerprint extractor at a segment level according to the loss function value until the loss function value meets a preset condition,
wherein the loss function is:
L=αLsoftmaxLtriplet
Figure 560901DEST_PATH_IMAGE001
Ltriplet=max( d(q,p)-d(q,n)+margin, 0 )
wherein p (i, j) is the prediction probability of the ith sample to the jth class, l (i, j) is the output location of the model to the ith sample in the jth class, C is the number of classes, q is the current input feature vector, p is the feature vector in the same class as i, n is the feature vector in a different class from i, margin is a constant greater than 0, d () is a distance function, d (q, p) is the distance between q and p, d (q, n) is the distance between q and n, and both alpha and beta are constants of 0-1.
4. The method of claim 3, wherein inputting the frame-level algorithmic fingerprint features of the known speech to the attention statistics pooling layer and outputting a weighted average vector and a weighted standard deviation vector of the frame-level algorithmic fingerprint features comprises:
based on the attention model, obtaining a normalized weight score according to the algorithm fingerprint characteristics of the frame level of the known voice;
and calculating to obtain a weighted average vector and a weighted standard deviation vector according to the weight fraction and the algorithm fingerprint characteristics of the frame level of the known voice.
5. The method of claim 3, wherein the segment-level algorithmic fingerprint extractor comprises a plurality of fully connected layers and a classification layer, and the weighted average vector and the weighted standard deviation vector of the frame-level algorithmic fingerprint features are input into the segment-level algorithmic fingerprint extractor, and the segment-level algorithmic fingerprint features of the known speech and the corresponding generating algorithm are output, comprising:
inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint features at the frame level into a plurality of full connection layers of the algorithm fingerprint extractor at the segment level to obtain the algorithm fingerprint features at the segment level of the known voice;
inputting the algorithm fingerprint features of the known voice segment level into the classification layer of the algorithm fingerprint extractor of the segment level to obtain a generation algorithm corresponding to the algorithm fingerprint features of the segment level.
6. The method of claim 5, wherein prior to the inputting the segment-level algorithmic fingerprint features of the known speech into the classification layer of the segment-level algorithmic fingerprint extractor, the method further comprises:
whitening, length normalization and processing for reducing dimensionality of the algorithm fingerprint characteristics of the known speech at the segment level by using a linear discriminant analysis method so as to input the processed algorithm fingerprint characteristics at the segment level into a classification layer of an algorithm fingerprint extractor at the segment level.
7. The method of claim 3, wherein after the training of the speech traceability forensic model is completed, the method further comprises:
for a given test data set, calculating the accuracy, the precision, the recall rate and an F-value of the voice traceability evidence-obtaining model, wherein the accuracy is the ratio of the number of correctly classified samples to the total number of samples in the test data set, the precision is the number of actual positive samples in samples predicted to be positive in the test data set, the recall rate is the number of correctly predicted samples in actual positive samples of the test data set, and the F-value is the harmonic average value of the precision and the recall rate;
judging whether the accuracy, the precision, the recall rate and the F-value respectively meet preset requirements:
when the accuracy, the precision, the recall rate and the F-value all meet preset requirements, the voice traceability evidence obtaining model is used for extracting frame-level algorithm fingerprint features from the first fusion acoustic features, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features.
8. A voice traceability evidence obtaining device is characterized by comprising:
the extraction module is used for extracting at least two different acoustic characteristics of the voice to be tested;
the fusion module is used for fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic;
the prediction module is used for extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features;
and the evidence obtaining module is used for taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the voice traceability evidence-obtaining method of any one of claims 1-7 when executing the program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the voice traceability forensics method of any one of claims 1 to 7.
CN202210859678.XA 2022-07-21 2022-07-21 Voice traceability evidence obtaining method and device, equipment and storage medium Active CN115083422B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210859678.XA CN115083422B (en) 2022-07-21 2022-07-21 Voice traceability evidence obtaining method and device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210859678.XA CN115083422B (en) 2022-07-21 2022-07-21 Voice traceability evidence obtaining method and device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115083422A true CN115083422A (en) 2022-09-20
CN115083422B CN115083422B (en) 2022-11-15

Family

ID=83242760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210859678.XA Active CN115083422B (en) 2022-07-21 2022-07-21 Voice traceability evidence obtaining method and device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115083422B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116153337A (en) * 2023-04-20 2023-05-23 北京中电慧声科技有限公司 Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium
CN118016051A (en) * 2024-04-07 2024-05-10 中国科学院自动化研究所 Model fingerprint clustering-based generated voice tracing method and device
CN118298809A (en) * 2024-04-10 2024-07-05 中国人民解放军陆军工程大学 Open world fake voice attribution method and system based on soft comparison fake learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170358306A1 (en) * 2016-06-13 2017-12-14 Alibaba Group Holding Limited Neural network-based voiceprint information extraction method and apparatus
CN111276131A (en) * 2020-01-22 2020-06-12 厦门大学 Multi-class acoustic feature integration method and system based on deep neural network
CN111933188A (en) * 2020-09-14 2020-11-13 电子科技大学 Sound event detection method based on convolutional neural network
CN113488027A (en) * 2021-09-08 2021-10-08 中国科学院自动化研究所 Hierarchical classification generated audio tracing method, storage medium and computer equipment
CN114267329A (en) * 2021-12-24 2022-04-01 厦门大学 Multi-speaker speech synthesis method based on probability generation and non-autoregressive model
CN114420100A (en) * 2022-03-30 2022-04-29 中国科学院自动化研究所 Voice detection method and device, electronic equipment and storage medium
CN114420135A (en) * 2021-12-10 2022-04-29 江苏清微智能科技有限公司 Attention mechanism-based voiceprint recognition method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170358306A1 (en) * 2016-06-13 2017-12-14 Alibaba Group Holding Limited Neural network-based voiceprint information extraction method and apparatus
CN111276131A (en) * 2020-01-22 2020-06-12 厦门大学 Multi-class acoustic feature integration method and system based on deep neural network
CN111933188A (en) * 2020-09-14 2020-11-13 电子科技大学 Sound event detection method based on convolutional neural network
CN113488027A (en) * 2021-09-08 2021-10-08 中国科学院自动化研究所 Hierarchical classification generated audio tracing method, storage medium and computer equipment
CN114420135A (en) * 2021-12-10 2022-04-29 江苏清微智能科技有限公司 Attention mechanism-based voiceprint recognition method and device
CN114267329A (en) * 2021-12-24 2022-04-01 厦门大学 Multi-speaker speech synthesis method based on probability generation and non-autoregressive model
CN114420100A (en) * 2022-03-30 2022-04-29 中国科学院自动化研究所 Voice detection method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116153337A (en) * 2023-04-20 2023-05-23 北京中电慧声科技有限公司 Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium
CN118016051A (en) * 2024-04-07 2024-05-10 中国科学院自动化研究所 Model fingerprint clustering-based generated voice tracing method and device
CN118016051B (en) * 2024-04-07 2024-07-19 中国科学院自动化研究所 Model fingerprint clustering-based generated voice tracing method and device
CN118298809A (en) * 2024-04-10 2024-07-05 中国人民解放军陆军工程大学 Open world fake voice attribution method and system based on soft comparison fake learning

Also Published As

Publication number Publication date
CN115083422B (en) 2022-11-15

Similar Documents

Publication Publication Date Title
CN115083422B (en) Voice traceability evidence obtaining method and device, equipment and storage medium
Kong et al. Weakly labelled audioset tagging with attention neural networks
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN107610707A (en) A kind of method for recognizing sound-groove and device
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
CN109408660B (en) Music automatic classification method based on audio features
US20110093427A1 (en) System and method for tagging signals of interest in time variant data
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN111462761A (en) Voiceprint data generation method and device, computer device and storage medium
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
CN111192601A (en) Music labeling method and device, electronic equipment and medium
CN116153337B (en) Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium
CN110120230A (en) A kind of acoustic events detection method and device
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
Gomez-Alanis et al. Adversarial transformation of spoofing attacks for voice biometrics
CN112885330A (en) Language identification method and system based on low-resource audio
Karthikeyan Adaptive boosted random forest-support vector machine based classification scheme for speaker identification
CN115394318A (en) Audio detection method and device
Mandalapu et al. Multilingual voice impersonation dataset and evaluation
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
CN116072146A (en) Pumped storage station detection method and system based on voiceprint recognition
Tsai et al. Bird species identification based on timbre and pitch features
Patil et al. Content-based audio classification and retrieval: A novel approach
CN115376498A (en) Speech recognition method, model training method, device, medium, and electronic apparatus
CN114822557A (en) Method, device, equipment and storage medium for distinguishing different sounds in classroom

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant