CN115083422A

CN115083422A - Voice traceability evidence obtaining method and device, equipment and storage medium

Info

Publication number: CN115083422A
Application number: CN202210859678.XA
Authority: CN
Inventors: 陶建华; 晏鑫蕊; 易江燕; 张震; 李鹏; 石瑾; 王立强
Original assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Current assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-09-20
Anticipated expiration: 2042-07-21
Also published as: CN115083422B

Abstract

The present disclosure relates to a voice source tracing evidence obtaining method, device, equipment and storage medium, wherein the method comprises: extracting at least two different acoustic characteristics of the voice to be tested; fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic; extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested by the segment-level algorithm fingerprint features; the predicted generation algorithm of the voice to be tested is used as a voice source tracing and evidence obtaining result, the authenticity of the audio can be judged by extracting algorithm fingerprints, the source tracing and evidence obtaining can be further carried out, and the generation source of the false audio can be obtained.

Description

Voice traceability evidence obtaining method and device, equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech source tracing forensics.

Background

At present, the voice generation technology is mature day by day, the voice generation effect is better and better, the generated voice can be comparable with the real voice under specific conditions, the generated voice is improperly utilized and spread, certain threats can be generated to media spread and social stability, and therefore, the false generated voice has great harm to the society. In practical application scenarios, such as public security or court, not only is the true validity of the audio itself a concern, but it is also necessary to know what the source of its generation is if the audio is spurious. Therefore, in many practical scenes, many requirements for audio authenticity evidence emerge, but a detection model for generating audio cannot meet the requirement of giving a false audio generation source.

Disclosure of Invention

In order to solve the technical problems or at least partially solve the technical problems, embodiments of the present disclosure provide a method and an apparatus for voice source tracing and forensics, a device, and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a voice source tracing forensics method, including:

extracting at least two different acoustic characteristics of the voice to be tested;

fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic;

extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice source tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to a feature weighted average vector and a weighted standard deviation vector obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features;

and taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.

In a possible implementation manner, the voice traceability evidence obtaining model comprises a frame-level algorithm fingerprint extractor, an attention statistics pooling layer and a segment-level algorithm fingerprint extractor which are connected in sequence, the frame-level algorithm fingerprint extractor comprises a plurality of self-attention layers which are connected with each other, the attention statistics pooling layer comprises an attention model and an attention statistics pooling network layer which are connected in sequence, and the segment-level algorithm fingerprint extractor comprises a full-connection layer and a classification layer which are connected in sequence.

In one possible implementation, the voice traceability forensic model is trained by the following steps:

extracting at least two different acoustic features of the known speech;

fusing at least two different acoustic features of the extracted known voice to obtain a second fused acoustic feature;

inputting the second fusion acoustic feature into an algorithm fingerprint extractor at the frame level of the voice traceability evidence-obtaining model before training, and outputting the algorithm fingerprint feature at the frame level of the known voice;

inputting the algorithm fingerprint characteristics of the frame level of the known voice into an attention statistics pooling layer, and outputting a weighted average vector and a weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level;

inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level into an algorithm fingerprint extractor of the segment level, and outputting the algorithm fingerprint characteristics of the segment level of the known voice and a corresponding generation algorithm;

taking the algorithm output by the algorithm fingerprint extractor at the segment level as a prediction result of a generation algorithm of the known voice;

calculating a loss function value through a preset loss function based on a prediction result and an actual generation algorithm of a known voice generation algorithm, adjusting weight parameters of an algorithm fingerprint extractor at a frame level, an attention statistics pooling layer and an algorithm fingerprint extractor at a segment level according to the loss function value until the loss function value meets a preset condition,

wherein the loss function is:

L=αLsoftmax+βLtriplet

Ltriplet=max( d(q,p)-d(q,n)+margin, 0 )

wherein p (i, j) is the prediction probability of the ith sample to the jth class, l (i, j) is the output location of the model to the ith sample in the jth class, C is the number of classes, q is the current input feature vector, p is the feature vector in the same class as i, n is the feature vector in a different class from i, margin is a constant greater than 0, d () is a distance function, d (q, p) is the distance between q and p, d (q, n) is the distance between q and n, and both alpha and beta are constants of 0-1.

In one possible implementation, the inputting the frame-level algorithmic fingerprint features of the known speech to the attention statistics pooling layer and outputting a weighted average vector and a weighted standard deviation vector of the frame-level algorithmic fingerprint features comprises:

based on the attention model, obtaining a normalized weight score according to the algorithm fingerprint characteristics of the frame level of the known voice;

and calculating to obtain a weighted average vector and a weighted standard deviation vector according to the weight fraction and the algorithm fingerprint characteristics of the frame level of the known voice.

In one possible embodiment, the segment-level algorithm fingerprint extractor includes a plurality of fully-connected layers and a classification layer, and the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint features at the frame level are input into the segment-level algorithm fingerprint extractor, and the segment-level algorithm fingerprint features of the known speech and the corresponding generating algorithm thereof are output, including:

inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint features at the frame level into a plurality of full connection layers of the algorithm fingerprint extractor at the segment level to obtain the algorithm fingerprint features at the segment level of the known voice;

inputting the algorithm fingerprint features of the known voice segment level into the classification layer of the algorithm fingerprint extractor of the segment level to obtain a generation algorithm corresponding to the algorithm fingerprint features of the segment level.

In one possible implementation, before the inputting the segment-level algorithmic fingerprint features of the known speech into the classification layer of the segment-level algorithmic fingerprint extractor, the method further comprises:

whitening, length normalization and processing for reducing dimensionality of the algorithm fingerprint characteristics of the known speech at the segment level by using a linear discriminant analysis method so as to input the processed algorithm fingerprint characteristics at the segment level into a classification layer of an algorithm fingerprint extractor at the segment level.

In one possible embodiment, after the training of the speech traceability forensic model is completed, the method further includes:

for a given test data set, calculating the accuracy, the precision, the recall rate and an F-value of the voice traceability evidence-obtaining model, wherein the accuracy is the ratio of the number of correctly classified samples to the total number of samples in the test data set, the precision is the number of actual positive samples in samples predicted to be positive in the test data set, the recall rate is the number of correctly predicted samples in actual positive samples of the test data set, and the F-value is the harmonic average value of the precision and the recall rate;

judging whether the accuracy, the precision, the recall rate and the F-value respectively meet preset requirements:

when the accuracy, the precision, the recall rate and the F-value all meet preset requirements, the voice source tracing and forensics model is used for extracting frame-level algorithm fingerprint features from the first fusion acoustic features, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features.

In a second aspect, an embodiment of the present disclosure provides a voice traceability forensics apparatus, including:

the extraction module is used for extracting at least two different acoustic characteristics of the voice to be tested;

the fusion module is used for fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic;

the prediction module is used for extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features;

and the evidence obtaining module is used for taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the voice source tracing evidence obtaining method when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the above-mentioned voice traceability evidence-obtaining method.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:

the voice source tracing evidence obtaining method comprises the steps of extracting at least two different acoustic characteristics of voice to be tested; fusing at least two different acoustic characteristics of the extracted voice to be tested to obtain a first fused acoustic characteristic; extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested by the segment-level algorithm fingerprint features; the predicted generation algorithm of the voice to be tested is used as a voice source tracing evidence obtaining result, the authenticity of the audio can be judged by extracting algorithm fingerprints, the source tracing evidence obtaining can be further carried out, the generation source of the false audio is obtained, and the requirement of a real scene on the false detection evidence is met.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 schematically illustrates a flow chart of a voice traceability forensics method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a structural diagram of a speech traceability forensics model according to an embodiment of the present disclosure;

fig. 3 schematically illustrates a block diagram of a voice traceability forensics apparatus according to an embodiment of the present disclosure; and

fig. 4 schematically shows a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The overall idea of the present disclosure is: in the training stage, preprocessing the collected generated voice data set, labeling the generated voice data set according to the generated algorithm or source thereof, inputting the acoustic characteristics of the part of training set into a voice traceability evidence obtaining model for training to obtain a trained voice traceability evidence obtaining model; in the testing stage, the acoustic features of the extracted generated voice test set are input into a tracing model, and a generation algorithm or a source type of the tracing model is further judged, wherein the generated voice data set is from voice generated by a voice synthesis technology or voice generated by a voice conversion technology.

Referring to fig. 1, an embodiment of the present disclosure provides a voice traceability forensics method, including the following steps:

s1, extracting at least two different acoustic characteristics of the voice to be tested;

in practical application, the at least two different acoustic features are selected from a mel cepstrum coefficient, a linear frequency cepstrum coefficient, a linear prediction coefficient, a constant Q transform cepstrum coefficient, a log spectrum, a magnitude spectrum, a line spectrum pair parameter and a phoneme duration.

In practical applications, the at least two different acoustic features may be mel-frequency cepstral coefficients and line spectrum pairs, wherein the extraction method of the mel-frequency cepstral coefficients is as follows: firstly, performing a series of operations of pre-emphasis, framing and windowing on voice, then obtaining Fast Fourier Transform (FFT) characteristics, further obtaining corresponding frequency spectrums, performing logarithm operation on the frequency spectrums after passing through a Mel filter set, and finally performing discrete cosine transform to obtain Mel cepstrum coefficient acoustic characteristics; the line spectrum pair is obtained by line spectrum pair analysis, wherein the line spectrum pair analysis is a method for representing the spectral characteristics of the voice signal by using the distribution density of p discrete frequencies omega and theta.

S2, fusing at least two different acoustic features of the extracted voice to be tested to obtain a first fused acoustic feature;

s3, extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features;

in practical applications, algorithm fingerprint refers to a method or feature for distinguishing algorithms, and a speech generation algorithm refers to an algorithm for speech synthesis and speech conversion.

Referring to fig. 2, the voice traceability evidence obtaining model includes a frame-level algorithm fingerprint extractor, an attention statistics pooling layer and a segment-level algorithm fingerprint extractor, which are connected in sequence, the frame-level algorithm fingerprint extractor includes a plurality of self-attention layers connected to each other, the attention statistics pooling layer includes an attention model and an attention statistics pooling network layer, which are connected in sequence, and the segment-level algorithm fingerprint extractor includes a full-connection layer and a classification layer, which are connected in sequence.

Taking the at least two different acoustic features as examples, which are mel-frequency cepstral coefficients and line spectrum pairs, the speech tracing evidence obtaining method of the present disclosure is explained as follows:

first, referring to FIG. 2, the mel-frequency cepstral coefficients and the line spectrum are relatively fused to obtain features

Inputting the voice frame into a frame-level algorithm fingerprint extractor, wherein the frame-level algorithm fingerprint extractor comprises five self-attention layers, and obtaining frame-level algorithm fingerprint characteristics after operating the generated voice frame

。

Then, the attention statistics pooling layer comprises an attention model and an attention statistics pooling network layer, and the features at the frame level are normalized by the attention model to obtain a normalized weight score

Then score the weight

And frame level algorithmic fingerprint features

Inputting the attention statistics pooling network layer for calculation to obtain the weighted average vector of the algorithm fingerprint characteristics of the frame level

Sum weighted standard deviation vector

。

Finally, weighted average vector of algorithm fingerprint characteristics of frame level is calculated

Sum weighted standard deviation vector

Inputting a segment-level algorithm fingerprint extractor, wherein the segment-level algorithm fingerprint extractor comprises a plurality of fully-connected layers connected with each other, outputting a segment-level algorithm fingerprint embeddings from the last fully-connected layer:

the output layer connected to the last full connection layer is a softmax layer, each output node of which corresponds to an algorithm ID.

There are different levels of types for the algorithm ID. The method can be used by various speech synthesis APP manufacturers (such as Baidu, Aliyun, news flying, dog searching, cibye, Bibei technologies) and the like, and can also be used by various speech generation algorithms (such as audio tampering, waveform splicing, manual simulation, speech synthesis, speech conversion and the like). Among these, the use of speech synthesis and speech conversion algorithms is more common, including corresponding speech synthesis and speech conversion algorithms based on Straight vocoders, World vocoders, LPCNet vocoders, WaveNet vocoders, WaveRNN vocoders, HiFiGAN vocoders, PWG vocoders, MelGan vocoders, StyleGan vocoders, and the like. All the types mentioned above can be traced back and proved by extracting the model fingerprint.

In practical application, the method for predicting the speech to be tested based on the section-level algorithm fingerprint features includes the steps of extracting frame-level algorithm fingerprint features from the first fusion acoustic features, performing pooling averaging on the frame-level algorithm fingerprint features, calculating section-level algorithm fingerprint features according to a feature weighted average vector and a weighted standard deviation vector obtained by the pooling averaging, and predicting the speech to be tested based on the section-level algorithm fingerprint features, and includes the following steps:

inputting the first fusion acoustic feature into a frame-level algorithm fingerprint extractor of a pre-trained voice traceability evidence-obtaining model, and outputting a frame-level algorithm fingerprint feature;

inputting the algorithm fingerprint characteristics of the frame level into an attention statistics pooling layer, and outputting a weighted average vector and a weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level;

and inputting the weighted average vector and the weighted standard deviation vector of the algorithm fingerprint characteristics of the frame level into the algorithm fingerprint extractor of the segment level, and outputting the algorithm fingerprint characteristics of the segment level and a corresponding generation algorithm thereof.

And S4, taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.

In practical applications, tracing and forensics refer to verifying the generation source or generation method of the generated speech.

In this embodiment, in step S3, the speech source tracing evidence obtaining model is obtained by training:

extracting at least two different acoustic features of the known voice, wherein the known voice is a plurality of groups of generated voice;

wherein the loss function is:

L=αLsoftmax+βLtriplet

Ltriplet=max( d(q,p)-d(q,n)+margin, 0 )

In the training process of the voice traceability evidence-obtaining model, 100 rounds of training are summed, an adaptive moment estimation optimizer is selected, the initial learning rate is set to be 0.001, the learning rate is linearly attenuated, and the batch data size is 256.

In this embodiment, the inputting the frame-level algorithmic fingerprint features of the known speech to the attention statistics pooling layer and outputting the weighted average vector and the weighted standard deviation vector of the frame-level algorithmic fingerprint features includes:

In this embodiment, the segment-level algorithm fingerprint extractor includes a plurality of full-link layers and a classification layer, and inputs the weighted average vector and the weighted standard deviation vector of the frame-level algorithm fingerprint features into the segment-level algorithm fingerprint extractor, and outputs the segment-level algorithm fingerprint features of the known speech and the corresponding generation algorithm thereof, including:

In this embodiment, before the inputting the segment-level algorithmic fingerprint features of the known speech into the classification layer of the segment-level algorithmic fingerprint extractor, the method further comprises:

whitening, length normalization and processing for reducing dimensionality of the algorithm fingerprint characteristics of the known speech at the segment level by using a linear discriminant analysis method so as to input the processed algorithm fingerprint characteristics at the segment level into a classification layer of an algorithm fingerprint extractor at the segment level. Here, the processed segment-level algorithm fingerprint features may be stored to facilitate the extension of new algorithm fingerprint types without the need to retrain a new algorithm fingerprint extractor, such that the learning class of the speech tracing forensic model includes not only the labeled generation algorithm or generation source class, but also the unknown source class to be adapted.

In this embodiment, after the training of the speech source tracing forensic model is completed, the method further includes:

for a given test data set, calculating the accuracy rate, the precision rate, the recall rate and an F-value of the voice traceability evidence-obtaining model, wherein the accuracy rate is the ratio of the number of correctly classified samples to the total number of samples in the test data set, the precision rate is the number of actual positive samples in samples predicted to be positive in the test data set, the recall rate is the number of correctly predicted samples in actual positive samples of the test data set, and the F-value is the harmonic mean value of the accuracy rate and the recall rate;

This is disclosed utilizes the algorithm fingerprint, not only can judge the authenticity of audio frequency, can further trace to the source moreover and collect evidence, obtains the generation source of false audio frequency to in many realistic scenes such as judicial evidence collection, satisfy the demand to false detection evidence, also can promote the further development in audio frequency detection field simultaneously.

The method and the device realize the source tracing and evidence obtaining of the generated voice by associating the model characteristics with the algorithm fingerprints and further classifying and distinguishing the generated voice generating algorithm.

The method disclosed by the invention can utilize the fingerprint of the generating algorithm to trace the source of the generated voice and obtain evidence, so that the source tracing deficiency of the algorithm in the current audio field is overcome, the accuracy is high, the robustness is strong, and certain universality is realized. With the development of deep learning technology, the speech generated by speech synthesis and speech conversion technology becomes more vivid. The method further traces the source and obtains evidence on the basis of judging the true and false tasks at present. The source tracing and evidence obtaining can reach a certain accuracy, and the evidence obtaining analysis result can be utilized by the voice counterfeit identification model in a targeted manner, so that the technical progress and development of the voice counterfeit identification model are further promoted. Finally, the accuracy rate of the whole task is integrally improved by detecting and tracing evidence, and good accuracy rate is achieved. The current judicial, news and entertainment fields have great demands on the voice traceability evidence-obtaining technology, and the identification traceability technology faces the challenge of the real scene. In an actual environment, environmental noise, channels, speakers and the like have the influence of various factors, and the method disclosed by the invention can overcome the influence of the environment, still keeps good tracing accuracy rate and has higher robustness. At present, research on generating voice is widely concerned in both industry and academia, and generation algorithms of scholars and companies are various. The method has a universal generation algorithm tracing function, can realize high coverage of a generated voice algorithm, can quickly adapt to more unknown deep counterfeiting technologies in the future, and has good universality.

The method and the device emphasize the traceability research on the level of the generation algorithm, can prove not only the generation algorithm but also the generation technology of different companies, and provide the traceability evidence-obtaining of the generated voice based on the algorithm fingerprint.

Referring to fig. 3, an embodiment of the present disclosure provides a voice traceability forensics apparatus, including:

an extraction module 11, configured to extract at least two different acoustic features of a voice to be tested;

the fusion module 12 is configured to fuse at least two different acoustic features of the extracted speech to be tested to obtain a first fused acoustic feature;

the prediction module 13 is configured to extract frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained speech tracing forensics model, perform pooling averaging on the frame-level algorithm fingerprint features, calculate segment-level algorithm fingerprint features according to a feature weighted average vector and a weighted standard deviation vector obtained by the pooling averaging, and predict a generation algorithm of the speech to be tested based on the segment-level algorithm fingerprint features;

and the evidence obtaining module 14 is used for taking the predicted generation algorithm of the voice to be tested as a voice source tracing evidence obtaining result.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

In the second embodiment, any plurality of the extraction module 11, the fusion module 12, the prediction module 13, and the forensics module 14 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of other modules and implemented in one module. At least one of the extraction module 11, the fusion module 12, the prediction module 13, and the forensics module 14 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware by any other reasonable manner of integrating or packaging a circuit, or may be implemented in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any of them. Alternatively, at least one of the extraction module 11, the fusion module 12, the prediction module 13 and the forensics module 14 may be at least partially implemented as a computer program module, which when executed may perform a corresponding function.

Referring to fig. 4, an electronic device provided by a fourth exemplary embodiment of the present disclosure includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, where the processor 1110, the communication interface 1120, and the memory 1130 complete communication with each other through the communication bus 1140;

a memory 1130 for storing computer programs;

the processor 1110 is configured to implement the following voice source tracing evidence obtaining method when executing the program stored in the memory 1130:

extracting frame-level algorithm fingerprint features from the first fusion acoustic features based on a pre-trained voice tracing evidence obtaining model, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested by the segment-level algorithm fingerprint features;

The communication bus 1140 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 1120 is used for communication between the electronic device and other devices.

The Memory 1130 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory 1130 may also be at least one memory device located remotely from the processor 1110.

The Processor 1110 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

Embodiments of the present disclosure also provide a computer-readable storage medium. The computer readable storage medium stores thereon a computer program, which when executed by a processor implements the voice traceability evidence-obtaining method as described above.

The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The computer readable storage medium carries one or more programs which, when executed, implement the voice traceability forensics method according to the embodiment of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A voice traceability evidence obtaining method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the speech provenance forensics model comprises a frame-level algorithmic fingerprint extractor, an attention statistics pooling layer and a segment-level algorithmic fingerprint extractor which are connected in sequence, wherein the frame-level algorithmic fingerprint extractor comprises a plurality of self-attention layers which are connected with each other, the attention statistics pooling layer comprises an attention model and an attention statistics pooling network layer which are connected in sequence, and the segment-level algorithmic fingerprint extractor comprises a full-connection layer and a classification layer which are connected in sequence.

3. The method of claim 2, wherein the speech traceability forensic model is trained by:

extracting at least two different acoustic features of the known speech;

inputting the algorithm fingerprint characteristics of the known voice at the frame level into an attention statistics pooling layer, and outputting a weighted average vector and a weighted standard deviation vector of the algorithm fingerprint characteristics at the frame level;

wherein the loss function is:

L=αLsoftmax+βLtriplet

Ltriplet=max( d(q,p)-d(q,n)+margin, 0 )

4. The method of claim 3, wherein inputting the frame-level algorithmic fingerprint features of the known speech to the attention statistics pooling layer and outputting a weighted average vector and a weighted standard deviation vector of the frame-level algorithmic fingerprint features comprises:

5. The method of claim 3, wherein the segment-level algorithmic fingerprint extractor comprises a plurality of fully connected layers and a classification layer, and the weighted average vector and the weighted standard deviation vector of the frame-level algorithmic fingerprint features are input into the segment-level algorithmic fingerprint extractor, and the segment-level algorithmic fingerprint features of the known speech and the corresponding generating algorithm are output, comprising:

6. The method of claim 5, wherein prior to the inputting the segment-level algorithmic fingerprint features of the known speech into the classification layer of the segment-level algorithmic fingerprint extractor, the method further comprises:

7. The method of claim 3, wherein after the training of the speech traceability forensic model is completed, the method further comprises:

when the accuracy, the precision, the recall rate and the F-value all meet preset requirements, the voice traceability evidence obtaining model is used for extracting frame-level algorithm fingerprint features from the first fusion acoustic features, performing pooling averaging on the frame-level algorithm fingerprint features, calculating segment-level algorithm fingerprint features according to feature weighted average vectors and weighted standard deviation vectors obtained by the pooling averaging, and predicting a generation algorithm of the voice to be tested based on the segment-level algorithm fingerprint features.

8. A voice traceability evidence obtaining device is characterized by comprising:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the voice traceability evidence-obtaining method of any one of claims 1-7 when executing the program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the voice traceability forensics method of any one of claims 1 to 7.