CN113808579A

CN113808579A - Detection method and device for generated voice, electronic equipment and storage medium

Info

Publication number: CN113808579A
Application number: CN202111383856.8A
Authority: CN
Inventors: 易江燕; 陶建华; 傅睿博; 聂帅; 梁山
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2021-12-17
Anticipated expiration: 2041-11-22
Also published as: CN113808579B

Abstract

The present disclosure relates to a detection method, apparatus, electronic device, and storage medium for generating a voice, the method comprising: acquiring a voice to be detected, and extracting a first acoustic feature and a second acoustic feature of the voice to be detected; inputting the first acoustic feature into a speech recognition model, and outputting a text sequence corresponding to the first acoustic feature; extracting a word vector and a chord vector of the text sequence through a word embedding model and a voice embedding model respectively; splicing the word vector and the voice vector to obtain a first fusion characteristic, inputting the first fusion characteristic into a rhythm prediction model, and outputting rhythm characteristics; and performing the splicing processing on the second acoustic characteristic and the rhythm characteristic to obtain a second fusion characteristic, inputting the second fusion characteristic into a voice detection model, and outputting a voice detection result.

Description

Detection method and device for generated voice, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a method and an apparatus for detecting a generated speech, an electronic device, and a storage medium.

Background

With the rapid development of deep learning, the voice synthesis technology is mature day by day, and can generate voices which are comparable to those of real people, so that the voice synthesis technology is widely applied to the fields of man-machine interaction, smart home, entertainment, education and the like. However, since a speech synthesis technique with poor use is harmful to people and society, a detection technique for generating speech is an urgent need in the present society. In the prior art, voice is often generated by using acoustic feature detection or phoneme duration feature detection, but the accuracy of the method for generating voice by using acoustic feature detection and phoneme duration feature detection is not high enough, and the generalization of a detection model used in the detection process is not enough.

In the course of implementing the disclosed concept, the inventors found that there are at least the following technical problems in the related art: the accuracy of the detection generated voice is not high enough, and the generalization of the detection model used in the detection process is not enough.

Disclosure of Invention

In order to solve the above technical problem or at least partially solve the above technical problem, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a storage medium for detecting generated speech, so as to solve at least the problems in the prior art that the accuracy of detecting generated speech is not high enough, and the generalization of a detection model used in a detection process is not enough.

The purpose of the present disclosure is realized by the following technical scheme:

in a first aspect, an embodiment of the present disclosure provides a detection method for generating speech, including: acquiring a voice to be detected, and extracting a first acoustic feature and a second acoustic feature of the voice to be detected; inputting the first acoustic feature into a speech recognition model, and outputting a text sequence corresponding to the first acoustic feature; extracting a word vector and a chord vector of the text sequence through a word embedding model and a voice embedding model respectively; splicing the word vector and the voice vector to obtain a first fusion characteristic, inputting the first fusion characteristic into a rhythm prediction model, and outputting rhythm characteristics; and performing the splicing processing on the second acoustic feature and the rhythm feature to obtain a second fusion feature, inputting the second fusion feature into a voice detection model, and outputting a voice detection result, wherein the voice detection result comprises: the speech to be detected is a real speech and the speech to be detected is a generated speech.

In an exemplary embodiment, the inputting the first fused feature into a prosody rhythm prediction model and outputting the prosody rhythm feature before the inputting includes: acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices; extracting a third acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; extracting training voice word vectors and training voice sound vectors of the training voice text sequence through the word embedding model and the voice embedding model respectively; performing the splicing processing on the training voice word vector and the training voice sound vector to obtain a third fusion characteristic, and performing first labeling processing on the third fusion characteristic; and training the rhythm prediction model according to the third fusion characteristics after the first labeling processing by using a random gradient descent algorithm.

In one exemplary embodiment, the prosodic rhythm prediction model includes: a multi-layer self-attention network, wherein each layer of the self-attention network comprises: a plurality of self-attention head functions.

In an exemplary embodiment, before inputting the second fused feature into the speech detection model and outputting the speech detection result, the method further includes: acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices; extracting a third acoustic feature and a fourth acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; extracting training voice word vectors and training voice sound vectors of the training voice text sequence through the word embedding model and the voice embedding model respectively; performing the splicing processing on the training voice word vector and the training voice vector to obtain a third fusion characteristic, inputting the third fusion characteristic into the rhythm prediction model, and outputting a training voice rhythm characteristic; performing the splicing processing on the training voice rhythm characteristic and a fourth acoustic characteristic to obtain a fourth fusion characteristic, and performing second labeling processing on the fourth fusion characteristic; and training the voice detection model according to the fourth fusion feature after the second labeling processing by using a random gradient descent algorithm.

In an exemplary embodiment, the speech detection model includes: the network comprises a plurality of time delay neural network layers, a plurality of residual error network layers and a full connection layer.

In an exemplary embodiment, before extracting the word vector and the chord vector of the text sequence by the word embedding model and the voice embedding model respectively, the method includes: acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices; extracting a third acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; and respectively carrying out third labeling processing and fourth labeling processing on the training voice text sequence, training the word embedding model through the training voice text sequence subjected to the third labeling processing, and training the voice embedding model through the training voice text sequence subjected to the fourth labeling processing.

In an exemplary embodiment, before inputting the first acoustic feature into a speech recognition model and outputting a text sequence corresponding to the first acoustic feature, the method includes: acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices; extracting a third acoustic feature of each training voice in the training voice data set; and performing fifth labeling processing on the third acoustic features, and training the voice recognition model through the fifth labeled third acoustic features.

In a second aspect, an embodiment of the present disclosure provides a detection apparatus for generating speech, including: the first extraction module is used for acquiring a voice to be detected and extracting a first acoustic feature and a second acoustic feature of the voice to be detected; the first model module is used for inputting the first acoustic feature into a voice recognition model and outputting a text sequence corresponding to the first acoustic feature; the second extraction module is used for extracting word vectors and chord vectors of the text sequence through a word embedding model and a voice embedding model respectively; the second model module is used for splicing the word vector and the voice vector to obtain a first fusion characteristic, inputting the first fusion characteristic into a rhythm prediction model and outputting a rhythm characteristic; a third model module, configured to perform the splicing processing on the second acoustic feature and the rhythm feature to obtain a second fusion feature, input the second fusion feature into a voice detection model, and output a voice detection result, where the voice detection result includes: the speech to be detected is a real speech and the speech to be detected is a generated speech.

In a third aspect, embodiments of the present disclosure provide an electronic device. The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; a processor for implementing the method for detecting generated speech or the method for image processing as described above when executing the program stored in the memory.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium. The above-mentioned computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method of detecting generated speech or the method of image processing as described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages: acquiring a voice to be detected, and extracting a first acoustic feature and a second acoustic feature of the voice to be detected; inputting the first acoustic feature into a speech recognition model, and outputting a text sequence corresponding to the first acoustic feature; extracting a word vector and a chord vector of the text sequence through a word embedding model and a voice embedding model respectively; splicing the word vector and the voice vector to obtain a first fusion characteristic, inputting the first fusion characteristic into a rhythm prediction model, and outputting rhythm characteristics; and performing the splicing processing on the second acoustic feature and the rhythm feature to obtain a second fusion feature, inputting the second fusion feature into a voice detection model, and outputting a voice detection result, wherein the voice detection result comprises: the speech to be detected is a real speech and the speech to be detected is a generated speech. Because the second fusion characteristic detected by the voice detection model comprises the rhythm characteristic, the difference of rhythm distribution exists between the real voice and the generated voice, and meanwhile, the voice detection model trained by the second fusion characteristic fused with the second acoustic characteristic and the rhythm characteristic has good prediction capability on other variable domains, so that the technical means can solve the problems that the accuracy of detecting and generating the voice is not high enough, the generalization of the detection model used in the detection process is not enough and the like in the prior art, and further the accuracy of detecting and generating the voice and the generalization of the detection model used in the detection process are improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a block diagram schematically illustrating a hardware structure of a computer terminal of a detection method for generating a voice according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a detection method of generating speech according to an embodiment of the present disclosure;

FIG. 3 is a block diagram schematically illustrating a detection apparatus for generating speech according to an embodiment of the present disclosure;

fig. 4 schematically shows a block diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided by the embodiments of the present disclosure may be executed in a computer terminal or a similar computing device. Taking an example of the present invention running on a computer terminal, fig. 1 schematically shows a hardware block diagram of a computer terminal of a detection method for generating speech according to an embodiment of the present disclosure. As shown in fig. 1, a computer terminal may include one or more processors 102 (only one is shown in fig. 1), wherein the processors 102 may include but are not limited to a processing device such as a Microprocessor (MPU) or a Programmable Logic Device (PLD) and a memory 104 for storing data, and optionally, the computer terminal may further include a transmission device 106 for communication function and an input/output device 108, it is understood by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not a limitation to the structure of the computer terminal, for example, the computer terminal may further include more or less components than those shown in fig. 1, or have equivalent functions or different configurations than those shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the detection method for generating a voice in the embodiment of the present disclosure, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In the embodiment of the present disclosure, a detection method for generating a speech is provided, and fig. 2 schematically illustrates a flowchart of the detection method for generating a speech according to the embodiment of the present disclosure, and as shown in fig. 2, the flowchart includes the following steps:

step S202, acquiring a voice to be detected, and extracting a first acoustic feature and a second acoustic feature of the voice to be detected;

step S204, inputting the first acoustic feature into a voice recognition model, and outputting a text sequence corresponding to the first acoustic feature;

step S206, extracting word vectors and chord vectors of the text sequence through a word embedding model and a voice embedding model respectively;

step S208, splicing the word vector and the voice vector to obtain a first fusion characteristic, inputting the first fusion characteristic into a rhythm prediction model, and outputting rhythm characteristics;

step S210, performing the stitching processing on the second acoustic feature and the rhythm feature to obtain a second fusion feature, inputting the second fusion feature into a voice detection model, and outputting a voice detection result, where the voice detection result includes: the speech to be detected is a real speech and the speech to be detected is a generated speech.

The first acoustic feature in the present disclosure may be a mel-frequency spectral coefficient MFCC or FBank feature and the second acoustic feature may be a linear prediction spectral coefficient LFCC and a linear power spectral coefficient LPC.

In step S208, inputting the first fusion feature into a prosody rhythm prediction model, and before outputting the prosody rhythm feature, the method includes: acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices; extracting a third acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; extracting training voice word vectors and training voice sound vectors of the training voice text sequence through the word embedding model and the voice embedding model respectively; performing the splicing processing on the training voice word vector and the training voice sound vector to obtain a third fusion characteristic, and performing first labeling processing on the third fusion characteristic; and training the rhythm prediction model according to the third fusion characteristics after the first labeling processing by using a random gradient descent algorithm.

It should be noted that the third acoustic feature is the same as the first acoustic feature, and the fourth acoustic feature is the same as the second acoustic feature, and this is only to distinguish whether the extracted acoustic feature is the acoustic feature of the speech to be detected or the acoustic feature of the training speech in the training speech data set. And inputting the third acoustic feature into the speech recognition model, outputting a training speech text sequence corresponding to the third acoustic feature, and converting the third acoustic feature into the training speech text sequence directly through a speech recognition technology. Similarly, in the above embodiment, the first acoustic feature is input into a speech recognition model, and a text sequence corresponding to the first acoustic feature is output, or the first acoustic feature may be directly converted into the text sequence by a speech recognition technology. And performing first labeling processing on the third fusion features, namely labeling corresponding rhythm labels on the third fusion features, wherein the rhythm labels are rhythm characteristics of training voice. And training the rhythm prediction model according to the third fusion characteristic after the first labeling processing by using a random gradient descent algorithm, so that the rhythm prediction model is trained, and learns and stores the corresponding relation between the third fusion characteristic and the rhythm characteristic of the training voice.

The training speech word vector is the word vector of the training speech extracted by the word embedding model, and the training speech sound vector is the sound vector of the training speech extracted by the word embedding model.

The prosodic rhythm prediction model includes: a multi-layer self-attention network, wherein each layer of the self-attention network comprises: a plurality of self-attention head functions.

Optionally, the prosodic rhythm prediction model includes: a 3-layer self-attention network, wherein each layer of the self-attention network comprises: 8 self-attention functions. The prosodic tempo prediction model is a self-attention coding model.

In step S210, before inputting the second fusion feature into a speech detection model and outputting a speech detection result, the method further includes: acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices; extracting a third acoustic feature and a fourth acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; extracting training voice word vectors and training voice sound vectors of the training voice text sequence through the word embedding model and the voice embedding model respectively; performing the splicing processing on the training voice word vector and the training voice vector to obtain a third fusion characteristic, inputting the third fusion characteristic into the rhythm prediction model, and outputting a training voice rhythm characteristic; performing the splicing processing on the training voice rhythm characteristic and a fourth acoustic characteristic to obtain a fourth fusion characteristic, and performing second labeling processing on the fourth fusion characteristic; and training the voice detection model according to the fourth fusion feature after the second labeling processing by using a random gradient descent algorithm.

And performing second labeling processing on the fourth fusion feature, namely labeling the fourth fusion feature with a label that the training voice is real voice or a label that the training voice is generated voice. Training the voice detection model according to the fourth fusion feature after the second labeling processing by using a stochastic gradient descent algorithm, so that the voice detection model is trained, learns and saves the corresponding relationship between the training voice and the training voice detection result, and the training voice detection result comprises: the training speech is real speech and the training speech is generated speech.

The voice detection model comprises: the network comprises a plurality of time delay neural network layers, a plurality of residual error network layers and a full connection layer.

Optionally, the speech detection model includes: 2 time delay neural network layers, 6 residual error network layers and 1 full connection layer. The speech detection model can therefore be viewed as a combination of a residual network and a time-delay neural network. The activation function of the speech detection model is Relu.

In step S206, before extracting the word vector and the chord vector of the text sequence by the word embedding model and the speech embedding model respectively, the method includes: acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices; extracting a third acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; and respectively carrying out third labeling processing and fourth labeling processing on the training voice text sequence, training the word embedding model through the training voice text sequence subjected to the third labeling processing, and training the voice embedding model through the training voice text sequence subjected to the fourth labeling processing.

And respectively carrying out third labeling processing and fourth labeling processing on the training voice text sequence, namely labeling the training voice text sequence with a label of a training voice word vector corresponding to the training voice text sequence and labeling the training voice text sequence with a label of a training voice word vector corresponding to the training voice text sequence. And training the word embedding model through the training voice text sequence subjected to the third labeling processing, so that the word embedding model is trained, learns and stores the corresponding relation between the training voice text sequence and the training voice word vector. And training the voice embedded model through the training voice text sequence subjected to the fourth labeling processing, so that the word embedded model is trained, learns and saves the corresponding relation between the training voice text sequence and the training voice sound vector.

In step S204, before inputting the first acoustic feature into a speech recognition model and outputting a text sequence corresponding to the first acoustic feature, the method includes: acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices; extracting a third acoustic feature of each training voice in the training voice data set; and performing fifth labeling processing on the third acoustic features, and training the voice recognition model through the fifth labeled third acoustic features.

And performing fifth labeling processing on the third acoustic feature, namely labeling and training a voice text sequence on the third acoustic feature, and training the voice recognition model through the third acoustic feature subjected to the fifth labeling processing, so that the voice recognition model is trained, learned and stored with the corresponding relation between the third acoustic feature label and the training voice text sequence.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present disclosure or portions contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a component server, or a network device) to execute the methods of the embodiments of the present disclosure.

In this embodiment, a detection apparatus for generating a speech is further provided, and the detection apparatus for generating a speech is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a block diagram schematically illustrating a structure of a detection apparatus for generating speech according to an alternative embodiment of the present disclosure, and as shown in fig. 3, the apparatus includes:

a first extraction module 302, configured to obtain a to-be-detected voice, and extract a first acoustic feature and a second acoustic feature of the to-be-detected voice;

a first model module 304, configured to input the first acoustic feature into a speech recognition model, and output a text sequence corresponding to the first acoustic feature;

a second extraction module 306, configured to extract a word vector and a chord vector of the text sequence through a word embedding model and a speech embedding model, respectively;

a second model module 308, configured to splice the word vector and the voice vector to obtain a first fusion feature, input the first fusion feature into a prosodic rhythm prediction model, and output a prosodic rhythm feature;

a third model module 310, configured to perform the splicing processing on the second acoustic feature and the prosodic rhythm feature to obtain a second fusion feature, input the second fusion feature into a speech detection model, and output a speech detection result, where the speech detection result includes: the speech to be detected is a real speech and the speech to be detected is a generated speech.

The first acoustic features may be mel-frequency spectral coefficients MFCC or FBank features and the second acoustic features may be linear prediction spectral coefficients LFCC and linear power spectral coefficients LPC. Optionally, the second model module 308 is further configured to obtain a training speech data set, where the training speech data set includes a plurality of training speeches, and the training speech is a real speech or a generated speech; extracting a third acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; extracting training voice word vectors and training voice sound vectors of the training voice text sequence through the word embedding model and the voice embedding model respectively; performing the splicing processing on the training voice word vector and the training voice sound vector to obtain a third fusion characteristic, and performing first labeling processing on the third fusion characteristic; and training the rhythm prediction model according to the third fusion characteristics after the first labeling processing by using a random gradient descent algorithm.

Optionally, the third model module 310 is further configured to obtain a training speech data set, where the training speech data set includes a plurality of training speeches, and the training speech is a real speech or a generated speech; extracting a third acoustic feature and a fourth acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; extracting training voice word vectors and training voice sound vectors of the training voice text sequence through the word embedding model and the voice embedding model respectively; performing the splicing processing on the training voice word vector and the training voice vector to obtain a third fusion characteristic, inputting the third fusion characteristic into the rhythm prediction model, and outputting a training voice rhythm characteristic; performing the splicing processing on the training voice rhythm characteristic and a fourth acoustic characteristic to obtain a fourth fusion characteristic, and performing second labeling processing on the fourth fusion characteristic; and training the voice detection model according to the fourth fusion feature after the second labeling processing by using a random gradient descent algorithm.

Optionally, the second extraction module is further configured to obtain a training speech data set, where the training speech data set includes a plurality of training speeches, and the training speech is a real speech or a generated speech; extracting a third acoustic feature of each training voice in the training voice data set; inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature; and respectively carrying out third labeling processing and fourth labeling processing on the training voice text sequence, training the word embedding model through the training voice text sequence subjected to the third labeling processing, and training the voice embedding model through the training voice text sequence subjected to the fourth labeling processing.

Optionally, the first model module 304 is further configured to obtain a training speech data set, where the training speech data set includes a plurality of training speeches, and the training speech is a real speech or a generated speech; extracting a third acoustic feature of each training voice in the training voice data set; and performing fifth labeling processing on the third acoustic features, and training the voice recognition model through the fifth labeled third acoustic features.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present disclosure provide an electronic device.

Referring to fig. 4, an electronic device 400 provided in the embodiment of the present disclosure includes a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete communication with each other through the communication bus 404; a memory 403 for storing a computer program; the processor 401, when executing the program stored in the memory, is configured to implement the steps in any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring a voice to be detected, and extracting a first acoustic feature and a second acoustic feature of the voice to be detected;

s2, inputting the first acoustic feature into a speech recognition model, and outputting a text sequence corresponding to the first acoustic feature;

s3, extracting word vectors and chord vectors of the text sequence through a word embedding model and a voice embedding model respectively;

s4, splicing the word vectors and the voice vectors to obtain first fusion characteristics, inputting the first fusion characteristics into a rhythm prediction model, and outputting rhythm characteristics;

s5, performing the stitching process on the second acoustic feature and the rhythm feature to obtain a second fusion feature, inputting the second fusion feature into a speech detection model, and outputting a speech detection result, where the speech detection result includes: the speech to be detected is a real speech and the speech to be detected is a generated speech.

Embodiments of the present disclosure also provide a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the method embodiments described above.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present disclosure described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A detection method for generating speech, comprising:

acquiring a voice to be detected, and extracting a first acoustic feature and a second acoustic feature of the voice to be detected;

inputting the first acoustic feature into a speech recognition model, and outputting a text sequence corresponding to the first acoustic feature;

extracting a word vector and a chord vector of the text sequence through a word embedding model and a voice embedding model respectively;

splicing the word vector and the voice vector to obtain a first fusion characteristic, inputting the first fusion characteristic into a rhythm prediction model, and outputting rhythm characteristics;

and performing the splicing processing on the second acoustic feature and the rhythm feature to obtain a second fusion feature, inputting the second fusion feature into a voice detection model, and outputting a voice detection result, wherein the voice detection result comprises: the speech to be detected is a real speech and the speech to be detected is a generated speech.

2. The method of claim 1, wherein inputting the first fused feature into a prosodic tempo prediction model and outputting a prosodic tempo feature comprises:

acquiring a training voice data set, wherein the training voice data set comprises a plurality of training voices, and the training voices are real voices or generated voices;

extracting a third acoustic feature of each training voice in the training voice data set;

inputting the third acoustic feature into the voice recognition model, and outputting a training voice text sequence corresponding to the third acoustic feature;

extracting training voice word vectors and training voice sound vectors of the training voice text sequence through the word embedding model and the voice embedding model respectively;

performing the splicing processing on the training voice word vector and the training voice sound vector to obtain a third fusion characteristic, and performing first labeling processing on the third fusion characteristic;

and training the rhythm prediction model according to the third fusion characteristics after the first labeling processing by using a random gradient descent algorithm.

3. The method of claim 1 or 2, wherein the prosodic rhythm prediction model comprises: a multi-layer self-attention network, wherein each layer of the self-attention network comprises: a plurality of self-attention head functions.

4. The method of claim 1, wherein before inputting the second fused feature into a speech detection model and outputting a speech detection result, the method further comprises:

extracting a third acoustic feature and a fourth acoustic feature of each training voice in the training voice data set;

performing the splicing processing on the training voice word vector and the training voice vector to obtain a third fusion characteristic, inputting the third fusion characteristic into the rhythm prediction model, and outputting a training voice rhythm characteristic;

performing the splicing processing on the training voice rhythm characteristic and a fourth acoustic characteristic to obtain a fourth fusion characteristic, and performing second labeling processing on the fourth fusion characteristic;

and training the voice detection model according to the fourth fusion feature after the second labeling processing by using a random gradient descent algorithm.

5. The method of claim 1 or 4, wherein the speech detection model comprises: the network comprises a plurality of time delay neural network layers, a plurality of residual error network layers and a full connection layer.

6. The method of claim 1, wherein before extracting the word vector and the chord vector of the text sequence by the word embedding model and the speech embedding model respectively, the method comprises:

and respectively carrying out third labeling processing and fourth labeling processing on the training voice text sequence, training the word embedding model through the training voice text sequence subjected to the third labeling processing, and training the voice embedding model through the training voice text sequence subjected to the fourth labeling processing.

7. The method of claim 1, wherein before inputting the first acoustic feature into a speech recognition model and outputting a text sequence corresponding to the first acoustic feature, the method comprises:

and performing fifth labeling processing on the third acoustic features, and training the voice recognition model through the fifth labeled third acoustic features.

8. A detection apparatus that generates speech, comprising:

the first extraction module is used for acquiring a voice to be detected and extracting a first acoustic feature and a second acoustic feature of the voice to be detected;

the first model module is used for inputting the first acoustic feature into a voice recognition model and outputting a text sequence corresponding to the first acoustic feature;

the second extraction module is used for extracting word vectors and chord vectors of the text sequence through a word embedding model and a voice embedding model respectively;

the second model module is used for splicing the word vector and the voice vector to obtain a first fusion characteristic, inputting the first fusion characteristic into a rhythm prediction model and outputting a rhythm characteristic;

a third model module, configured to perform the splicing processing on the second acoustic feature and the rhythm feature to obtain a second fusion feature, input the second fusion feature into a voice detection model, and output a voice detection result, where the voice detection result includes: the speech to be detected is a real speech and the speech to be detected is a generated speech.

9. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1 to 7 when executing a program stored on a memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.