WO2021135454A1 - Procédé, dispositif et support de stockage lisible par ordinateur pour reconnaissance de faux signal vocal - Google Patents

Procédé, dispositif et support de stockage lisible par ordinateur pour reconnaissance de faux signal vocal Download PDF

Info

Publication number
WO2021135454A1
WO2021135454A1 PCT/CN2020/118450 CN2020118450W WO2021135454A1 WO 2021135454 A1 WO2021135454 A1 WO 2021135454A1 CN 2020118450 W CN2020118450 W CN 2020118450W WO 2021135454 A1 WO2021135454 A1 WO 2021135454A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
feature map
recognized
layer
convolutional layer
Prior art date
Application number
PCT/CN2020/118450
Other languages
English (en)
Chinese (zh)
Inventor
张超
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021135454A1 publication Critical patent/WO2021135454A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method for recognizing fake speech, a recognition device, a computer device, and a computer-readable storage medium.
  • ASV automatic speaker verification
  • the ASV or voiceprint system itself does not have the ability to recognize fake speech, and with the maturity of the text to speech (TTS) technology of speech synthesis, fake speech on the voice side is increasing. The more difficult it is to recognize, including high-quality recording equipment recording and playback, the latest technology of speech synthesis, etc.
  • TTS text to speech
  • the inventor realized that the existing implementation at least contained the following problem: how to recognize fake speech is an urgent problem to be solved.
  • the purpose of the embodiments of this application is to propose a method, recognition device, computer device, and computer-readable storage medium for fake speech recognition, so as to solve the problem that may exist in the prior art due to the lack of recognition means for fake speech Security vulnerabilities.
  • the embodiments of the present application provide a method for recognizing fake speech, a recognition device, a computer device, and a computer-readable storage medium, and the following technical solutions are adopted:
  • an embodiment of the present application provides a method for recognizing fake speech, which may include:
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into the target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • the recognition method may further include:
  • the voice in the real voice data set is converted into a voice signal feature map of the first category by using STFT transformation
  • the voice in the fake voice data set is converted into a voice signal feature map of the second category, so as to obtain the voice signal feature map that can include the first category.
  • the recognition method may further include:
  • the training of the initial DenseNet network using the first speech signal feature map data set may include:
  • the target DenseNet network model may sequentially include the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, the third channel expansion module, and the first fully connected layer.
  • a second fully connected layer, the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, and the third channel expansion module are used in order
  • the features of the feature map of the voice signal to be recognized are sequentially extracted and the first feature map is output.
  • the first fully connected layer and the second fully connected layer are used to further extract the features of the first feature map, and output two features according to the extracted features.
  • the classification result is used to further extract the features of the first feature map, and output two features according to the extracted features.
  • first channel expansion module, the second channel expansion module, and the third channel expansion module may each include 4 upper structures and 4 lower structures, respectively, and the upper structure may in turn include a second convolutional layer, 4
  • the third convolutional layer, the fourth convolutional layer, and the first SE block are arranged in parallel.
  • the underlying structure may sequentially include the fifth convolutional layer, four paralleled sixth convolutional layers, seventh convolutional layer, and second convolutional layer. SE block.
  • the second convolutional layer, the fourth convolutional layer, the fifth convolutional layer, and the seventh convolutional layer are all convolutional layers with a core size of 1 ⁇ 1, and the second convolutional layer and The fifth convolutional layer is used to reduce the number of channels of the input feature map, and the fourth convolutional layer is used to splice the four feature maps output by the third convolutional layer and input the first SEblock for processing, The seventh convolutional layer is used to splice the four feature maps output by the sixth convolutional layer and input the second SE block for processing, and the first SE block is used for the input of the fourth convolutional layer.
  • Each feature map is assigned a corresponding weight according to the channel
  • the second SE block is used to assign each feature map input to the seventh convolutional layer with a corresponding weight according to the channel.
  • performing STFT conversion processing on the voice to be recognized may include:
  • STFT conversion processing After performing framing and windowing operations on the voice to be recognized, STFT conversion processing is performed.
  • an embodiment of the present application provides a fake voice recognition device, and the recognition device may include:
  • the first acquisition module is used to acquire the voice to be recognized
  • the first conversion module is configured to perform STFT conversion processing on the voice to be recognized, and convert the voice to be recognized into a feature map of the voice signal to be recognized;
  • the processing module is used to input the feature map of the voice signal to be recognized into the target DenseNet network model, and output the binary classification result of the voice to be recognized as a real voice or a fake voice.
  • an embodiment of the present application provides a computer device, including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the computer-readable instructions, the following pseudo Steps of the recognition method of fake speech:
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into a target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • an embodiment of the present application provides a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium.
  • the computer-readable instructions are executed by a processor, the following pseudo Steps of the recognition method of fake speech:
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into a target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • this solution uses the target DenseNet network model for speech recognition. Based on the self-learning function of the neural network, it provides a highly accurate method of automatically identifying fake speech and reduces the generation of security vulnerabilities in ASV or voiceprint systems. .
  • FIG. 1 is a schematic diagram of an embodiment of a method for recognizing fake speech provided by an embodiment of the present application
  • Figure 2 is a schematic structural diagram of a target DenseNet network model provided by an embodiment of the present application
  • Fig. 3 is a schematic structural diagram of the first channel expansion module in the target DenseNet network model shown in Fig. 2;
  • Fig. 4 is a schematic structural diagram of the first SE block in the first channel expansion module shown in Fig. 3;
  • FIG. 5 is a schematic diagram of another embodiment of a method for recognizing fake speech according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another embodiment of a method for recognizing fake speech provided by an embodiment of the present application.
  • FIG. 7A is a schematic diagram of an embodiment of a fake speech recognition device provided by an embodiment of the present application.
  • FIG. 7B is a schematic diagram of another embodiment of a fake speech recognition device provided by an embodiment of the present application.
  • FIG. 7C is a schematic diagram of another embodiment of a fake speech recognition device provided by an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the method for recognizing fake speech includes the following steps:
  • Step S101 Obtain a voice to be recognized.
  • the method for recognizing fake voices runs on an electronic device (for example, a server/terminal device) on which the fake voice is recognized, and the electronic device can collect voice data to be recognized.
  • an electronic device for example, a server/terminal device
  • the acquired speech to be recognized can be stored in the blockchain.
  • Step S102 Perform STFT conversion processing on the voice to be recognized, and convert the voice to be recognized into a feature map of the voice signal to be recognized.
  • the electronic device may perform a short-time Fourier transform (STFT) on the voice to be recognized, so as to convert the voice data to be recognized into a feature map of the voice signal to be recognized.
  • STFT short-time Fourier transform
  • the conversion process may sequentially include framing, hamming windowing, and STFT conversion operations.
  • framing refers to segmenting the collected speech to be recognized in the time domain, and then dividing the segmented speech obtained in the previous step into multiple frames according to the preset duration of one frame.
  • Windowing is to use the window function to process the voice data of each frame to obtain a time segment, and then use the time segment obtained by the truncation to perform period extension processing to obtain a virtual infinite signal, so that the signal can be STFT Mathematical processing such as transformation and correlation analysis.
  • the corresponding window function can be selected according to the waveform of the voice to be recognized.
  • the selection of the specific window function is not too limited here.
  • the electronic device further performs STFT transformation processing on each frame of voice data after the windowing operation, and converts the voice to be recognized in the time domain into a feature map of the voice signal to be recognized.
  • the horizontal axis represents the time dimension
  • the vertical axis represents the frequency dimension.
  • each segment of 5 seconds can be divided into 10 segments of voice.
  • the common frame length is generally 20-50 milliseconds.
  • 25 milliseconds can be selected as the frame length, and each segmented voice can be divided into 200 frames.
  • the electronic device performs a windowing operation on each frame, and then performs STFT transformation on each frame of voice data after windowing, and converts it to obtain a feature map of the voice signal to be recognized.
  • Step S103 input the feature map of the voice signal to be recognized into the target DenseNet network model, and output the binary classification result of the voice to be recognized as real voice or fake voice.
  • the electronic device inputs the feature map of the voice signal to be recognized obtained in step S102 into the trained target DenseNet network model, so as to output the binary classification result of the voice signal to be recognized as a real voice or a fake voice.
  • the target DenseNet network model may be trained in advance by the electronic device, or it may be sent to the electronic device by another electronic device after the training is completed.
  • the target DenseNet network model may be an improved DenseNet network model.
  • the main improvement to the DenseNet network lies in: changing the dense block of the existing DenseNet network to a custom channel stretch block structure.
  • the improved target DenseNet network model can significantly reduce the number of parameters of the training model.
  • the parameter amount of the existing DenseNet network is 1.71x10 ⁇ 5, and the amount of floating point calculation is 7.16x10 ⁇ 9, and the parameter amount of the target DenseNet network model obtained after improvement is 8.2x10 ⁇ 4, and the amount of floating point calculation is 3.53x10 ⁇ 9.
  • the existing DenseNet network is a common network structure at present, and this embodiment will not elaborate too much.
  • the target DenseNet network model obtained by improving the existing DenseNet network can refer to Figure 2.
  • the target DenseNet network model can include:
  • the first convolutional layer (convolutional layer), the first channel expansion module, the first transition layer (transition layer), the second channel expansion module, the second transition layer 205 and the third channel expansion module, the first fully connected layer (fully connected layers, FC) and the second fully connected layer.
  • the first convolutional layer is a convolutional layer whose kernel size can be 1 ⁇ 1.
  • the first transition layer and the second transition layer are composed of a convolutional layer. It is composed of a pooling layer.
  • the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, and the third channel expansion module are used to sequentially extract the features of the feature map of the voice signal to be recognized and output The first feature map.
  • the first fully connected layer and the second fully connected layer can convert the input data into different categories, which can be used to further extract the features of the first feature map, and output the discrimination result of the two classifications according to the extracted features.
  • first-channel telescopic module the aforementioned first-channel telescopic module, second-channel telescopic module, and third-channel telescopic module have the same structure, including 4 upper structures and 4 lower structures, respectively.
  • Figure 3 it is a schematic diagram of the structure of the first channel telescopic module, which may include:
  • each upper layer structure includes: the second convolutional layer, 4 parallel third convolutional layers, the fourth convolutional layer and the first SE block (Squeeze-and-Excitation block); 4 lower structures , Each lower layer structure includes: a fifth convolutional layer, four parallel sixth convolutional layers, a seventh convolutional layer, and a second SEblock.
  • the second convolutional layer, the fourth convolutional layer, the fifth convolutional layer, and the seventh convolutional layer may all be 1 ⁇ 1 convolutional layers.
  • the second convolutional layer and the fifth convolutional layer can be used to perform 1 ⁇ 1 convolution on the received feature maps, reduce the number of input feature maps, and input the output feature maps in parallel to 4 third volumes respectively Build-up layers and 4 sixth convolutional layers. For example, if the number of channels of the feature map input to the second convolutional layer is 64, after 1 ⁇ 1 convolution by the second convolution layer, the feature map with the number of channels of 32 can be output, and the output feature map can be input in parallel Give 4 third convolutional layers of 3 ⁇ 3 core size.
  • the fourth convolutional layer and the seventh convolutional layer can respectively perform splicing operations on the feature maps output by the 4 third convolutional layers and 4 sixth convolutional layers (adding by channel), and output after the splicing operation
  • the feature map of is input to the first SE block and the second SE block for processing.
  • the first SE block and the second SE block have similar and the same structures.
  • the first SE block is taken as an example.
  • FIG. 4 is a schematic diagram of the structure of the first SE block in an embodiment of the application.
  • SEblock can in turn include:
  • G Global pooling layer, fully connected layer, activation layer (Relu), fully connected layer, sigmoid layer and scale layer.
  • C represents the number of channels
  • r is a self-set parameter, which can be set to 16.
  • the brief flow of the first SE block processing can include: the feature maps of the C channels output from the fourth convolutional layer, after inputting the first SE block, the C channels are finally calculated through the layers on the right side of the first SE block. C weights W corresponding to the channel. After that, in the scale layer, the weight W is used to multiply the feature maps of each channel corresponding to the original input to output the weighted feature maps.
  • the first SE block and the second SE block are used to learn feature weights according to the loss function, and the effective channel weights in the feature map are adjusted to be larger, and the invalid or less effective channel weights are reduced to make the model training better.
  • the weights corresponding to each channel of the feature map are allocated and adjusted in the network. For example, there are 64 channels in the network. In the prior art, the contributions of these channels to the network are the same, or the weights are the same. If SE block is added, different weights can be assigned to achieve better results.
  • the discrimination result of the speech to be recognized and the two classifications can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the fake speech recognition method provided in the embodiments of the present application can be applied in the fields of smart medical care, smart government affairs, smart education, or technology finance.
  • the fake voice recognition method can be used to identify and verify the collected voice to identify whether it is a real person's voice, so as to avoid system security loopholes caused by fake voice.
  • this solution uses the target DenseNet network model for speech recognition. Based on the self-learning function of the neural network, it provides a highly accurate method of automatically identifying fake speech and reduces the generation of security vulnerabilities in ASV or voiceprint systems. .
  • this is a schematic diagram of another embodiment of a method for recognizing fake speech according to an embodiment of the present application:
  • the above-mentioned electronic device may also perform the training process of the initial DenseNet network, and obtain the target DenseNet network model when the training result reaches the expected target.
  • the process of obtaining the target DenseNet network model by training may include:
  • S501 Acquire a real voice data set and a fake voice data set.
  • the electronic device can obtain the real voice data set and the fake voice data set from the external device.
  • the real voice data set can include directly collected real-person voice data collected under different conditions such as different ages, different genders, different regions, and different emotions
  • the fake voice data set can include the use of speech synthesis technology (text to Voice technology) simulated human fake voice, voice conversion fake voice (voice-to-speech technology, using a segment of the target person’s voice to convert any non-target person’s voice into the target person’s voice), using part of the real person’s voice
  • speech synthesis technology text to Voice technology
  • voice conversion fake voice voice-to-speech technology
  • S502 Use STFT transformation to convert the voice in the real voice data set into a voice signal feature map of the first category, and convert the voice in the fake voice data set into a voice signal feature map of the second category to obtain a voice signal feature map that includes the first category. And the first speech signal feature map data set of the second category speech signal feature map.
  • the electronic device uses STFT transformation to convert the voice in the real voice data set obtained above into a voice signal feature map of the first category, and converts the voice in the fake voice data set into a voice signal feature map of the second category to obtain A first speech signal feature map data set including a first category voice signal feature map and a second category voice signal feature map.
  • STFT transformation is similar to the processing method of step S102 in the embodiment shown in FIG. 1, and will not be repeated here.
  • the electronic device after obtaining the first category voice signal feature map and the second category voice signal feature map, the electronic device also needs to respond to the user's labeling operation, set tags for different categories of voice signal feature maps to generate label files, and save Enter the first voice signal feature map data set.
  • the label is set to a two-category form, for example, it can be set to 0 or 1, where 0 represents the feature map of the first category of speech signals, and 1 represents the feature map of the second category of speech signals.
  • S503 Use the first speech signal feature map data set to train the initial DenseNet network, and adjust the weight parameters of each layer of the initial DenseNet network based on the loss function, until the loss function is less than the preset value, lock the weight of each layer of the initial DenseNet network Parameters to get the target DenseNet network model.
  • the electronic device uses the first speech signal feature map data set to train the initial DenseNet network, and performs weighting parameters for each layer of the initial DenseNet network based on the loss function. Adjust until the loss function is less than the preset value, lock the weight parameters of each layer of the initial DenseNet network to obtain the target DenseNet network model.
  • this is a schematic diagram of another embodiment of a method for recognizing fake speech provided by an embodiment of the present application.
  • the method for recognizing fake speech may include:
  • S601 Perform a masking operation on part of the frequency features of the part of the voice signal feature maps in the first voice signal feature map data set, so as to convert the first voice signal feature map data set into the second voice signal feature map data set.
  • the electronic device may perform a mask operation on some features of a part of the voice signal feature maps in the first voice signal feature map data set.
  • the continuous part of the features of the voice signal feature map can be reset to 0.
  • the frequency dimension of the original voice signal feature map is 256 dimensions and the range is from 0 to 8000 Hz. Then 30 of the 256 dimensions can be randomly selected. Zero, for the frequency, the information from 0 to 8000Hz is erased, which increases the unknownness of the data for the model.
  • we do not use random dropout in the network. Which greatly improves the generalization performance of the network, and further improves the accuracy of network recognition by about 30%.
  • Step S503 Use the first speech signal feature map data set to train the initial DenseNet network, and adjust the weight parameters of each layer of the initial DenseNet network based on the loss function, until the loss function is smaller than the preset value, lock the initial DenseNet network layer
  • the weight parameters to obtain the target DenseNet network model can include:
  • S602 Use the second speech signal feature map data set to train the initial DenseNet network, and adjust the weight parameters of each layer of the initial DenseNet network based on the loss function, until the loss function is less than the preset value, lock the weight of each layer of the initial DenseNet network Parameters to get the target DenseNet network model.
  • step S602 in this embodiment is similar to step S503 in the embodiment shown in FIG. 5, and will not be repeated here.
  • the feature masking method increases the unknownness of the data for the model, greatly improves the generalization performance of the network, and thereby improves the recognition ability of the target DenseNet network model for unknown fake speech.
  • FIG. 7A is a schematic structural diagram of a fake speech recognition device provided by an embodiment of the application, and the recognition device may include:
  • the first obtaining module 701 is configured to obtain a voice to be recognized
  • the first conversion module 702 is configured to perform STFT conversion processing on the voice to be recognized, and convert the voice to be recognized into a feature map of the voice signal to be recognized;
  • the processing module 703 is configured to input the feature map of the voice signal to be recognized into the target DenseNet network model, and output the binary classification result of the voice to be recognized as a real voice or a fake voice.
  • FIG. 7B is another schematic structural diagram of a fake voice recognition device provided by an embodiment of the application, and the fake voice recognition device may further include:
  • the second obtaining module 704 is used to obtain a real voice data set and a fake voice data set
  • the second transformation module 705 is configured to use STFT transformation to convert the voice in the real voice data set into a voice signal feature map of the first category, and convert the voice in the fake voice data set into a voice signal feature map of the second category to obtain A first voice signal feature map data set that may include the first category of voice signal feature maps and the second category of voice signal feature maps;
  • the training module 706 is used to train the initial DenseNet network using the first speech signal feature map data set, and adjust the weight parameters of each layer of the initial DenseNet network based on the loss function, until the loss function is less than the preset value, lock
  • the weight parameters of each layer of the initial DenseNet network are used to obtain the target DenseNet network model.
  • FIG. 7C is another schematic structural diagram of a fake voice recognition device provided by an embodiment of the application, and the fake voice recognition device may further include:
  • the editing module 707 is configured to perform a masking operation on part of the frequency features of the part of the speech signal feature maps in the first speech signal feature map data set, so as to convert the first speech signal feature map data set into a second speech signal feature map data set ;
  • the training module 706 is also specifically configured to train the initial DenseNet network by using the second speech signal feature map data set.
  • the target DenseNet network model may sequentially include the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, the third channel expansion module, and the first fully connected layer.
  • a second fully connected layer, the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, and the third channel expansion module are used in order
  • the features of the feature map of the voice signal to be recognized are sequentially extracted and the first feature map is output.
  • the first fully connected layer and the second fully connected layer are used to further extract the features of the first feature map, and output two features according to the extracted features.
  • the classification result is used to further extract the features of the first feature map, and output two features according to the extracted features.
  • first channel expansion module, the second channel expansion module, and the third channel expansion module may each include 4 upper structures and 4 lower structures, respectively, and the upper structure may in turn include a second convolutional layer, 4
  • the third convolutional layer, the fourth convolutional layer, and the first SE block are arranged in parallel.
  • the underlying structure may sequentially include the fifth convolutional layer, four paralleled sixth convolutional layers, seventh convolutional layer, and second convolutional layer. SE block.
  • the second convolutional layer, the fourth convolutional layer, the fifth convolutional layer, and the seventh convolutional layer are all convolutional layers with a core size of 1 ⁇ 1, and the second convolutional layer and The fifth convolutional layer is used to reduce the number of channels of the input feature map, and the fourth convolutional layer is used to splice the four feature maps output by the third convolutional layer and input the first SEblock for processing,
  • the seventh convolutional layer is used to splice the four feature maps output by the sixth convolutional layer and input the second SE block for processing.
  • the first SE block is used for the input of the fourth convolutional layer.
  • Each feature map is assigned a corresponding weight according to the channel
  • the second SE block is used to assign each feature map input to the seventh convolutional layer with a corresponding weight according to the channel.
  • the first conversion module 702 is specifically configured to perform STFT conversion processing after performing framing and windowing operations on the voice to be recognized.
  • FIG. 8 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 8 includes a memory 81, a processor 82, and a network interface 83 that are connected to each other in communication through a system bus. It should be pointed out that the figure only shows the computer device 8 with components 81-83, but it should be understood that it is not required to implement all the shown components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 81 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 81 may be an internal storage unit of the computer device 8, such as a hard disk or a memory of the computer device 8.
  • the memory 81 may also be an external storage device of the computer device 8, such as a plug-in hard disk equipped on the computer device 8, a smart media card (SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 81 may also include both the internal storage unit of the computer device 8 and its external storage device.
  • the memory 81 is generally used to store the operating system and various application software installed in the computer device 8, for example, to implement the steps of the method for recognizing fake speech as described below:
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into a target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • the memory 81 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 82 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 82 is generally used to control the overall operation of the computer device 8.
  • the processor 82 is configured to run computer-readable instructions or processed data stored in the memory 81, for example, to run a fake voice recognition in the embodiment shown in FIG. 1, 5, or 6. Computer readable instructions for the method.
  • the network interface 83 may include a wireless network interface or a wired network interface, and the network interface 83 is generally used to establish a communication connection between the computer device 8 and other electronic devices.
  • the processor 82 on the computer device 8 executes the computer-readable instructions of the method for recognizing fake speech in the embodiment shown in FIG. 1, 5, or 6, thereby providing a neural-based Method for automatically identifying fake voice in network
  • the present application also provides another implementation manner, that is, a computer-readable storage medium is provided, where the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the steps of the method for recognizing fake speech as described below :
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into a target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • this application can be used in many general or special-purpose computer system environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, multi-processor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network PCs, small computers, large computers, including Distributed computing environment for any of the above systems or equipment, etc.
  • This application may be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • This application can also be practiced in distributed computing environments. In these distributed computing environments, tasks are performed by remote processing devices connected through a communication network.
  • program modules can be located in local and remote computer storage media including storage devices.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the identification method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Selon des modes de réalisation, la présente demande concerne le domaine de l'intelligence artificielle, et porte sur un procédé de reconnaissance de faux signal vocal, consistant à : obtenir le signal vocal à reconnaître ; réaliser un traitement de conversion STFT sur le signal vocal à reconnaître, et convertir le signal vocal à reconnaître en une carte de caractéristiques d'un signal vocal à reconnaître ; entrer la carte de caractéristiques du signal vocal à reconnaître dans un modèle de réseau DenseNet cible, et produire un résultat de classification binaire du signal vocal à reconnaître sous la forme d'un vrai signal vocal ou d'un faux signal vocal. La présente demande concerne également un appareil de reconnaissance de faux signal vocal, un dispositif informatique et un support de stockage. De plus, la présente demande concerne également une technologie de chaîne de blocs ; les données de signal vocal obtenues d'un utilisateur et un résultat de différenciation à deux classes peuvent être stockés dans la chaîne de blocs. La présente solution utilise un modèle de réseau DenseNet cible pour effectuer une reconnaissance vocale, et, sur la base d'une fonction d'auto-apprentissage de réseau neuronal, fournit un procédé hautement précis pour reconnaître automatiquement un faux signal vocal, réduire les vulnérabilités de sécurité des systèmes d'impression ASV ou vocale. La présente demande peut être appliquée à des domaines tels que les soins médicaux intelligents, les affaires gouvernementales intelligentes, l'enseignement numérique, et la technologie et les finances.
PCT/CN2020/118450 2020-07-16 2020-09-28 Procédé, dispositif et support de stockage lisible par ordinateur pour reconnaissance de faux signal vocal WO2021135454A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010688484.9 2020-07-16
CN202010688484.9A CN111933154B (zh) 2020-07-16 2020-07-16 一种伪冒语音的识别方法、设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021135454A1 true WO2021135454A1 (fr) 2021-07-08

Family

ID=73313228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118450 WO2021135454A1 (fr) 2020-07-16 2020-09-28 Procédé, dispositif et support de stockage lisible par ordinateur pour reconnaissance de faux signal vocal

Country Status (2)

Country Link
CN (1) CN111933154B (fr)
WO (1) WO2021135454A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220172739A1 (en) * 2020-12-02 2022-06-02 Google Llc Self-Supervised Speech Representations for Fake Audio Detection

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327621A (zh) * 2021-06-09 2021-08-31 携程旅游信息技术(上海)有限公司 模型训练方法、用户识别方法、系统、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767776A (zh) * 2019-01-14 2019-05-17 广东技术师范学院 一种基于密集神经网络的欺骗语音检测方法
US20190172476A1 (en) * 2017-12-04 2019-06-06 Apple Inc. Deep learning driven multi-channel filtering for speech enhancement
US20200005046A1 (en) * 2018-07-02 2020-01-02 Adobe Inc. Brand safety in video content

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108281158A (zh) * 2018-01-12 2018-07-13 平安科技(深圳)有限公司 基于深度学习的语音活体检测方法、服务器及存储介质
CN110767218A (zh) * 2019-10-31 2020-02-07 南京励智心理大数据产业研究院有限公司 端到端语音识别方法、系统、装置及其存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190172476A1 (en) * 2017-12-04 2019-06-06 Apple Inc. Deep learning driven multi-channel filtering for speech enhancement
US20200005046A1 (en) * 2018-07-02 2020-01-02 Adobe Inc. Brand safety in video content
CN109767776A (zh) * 2019-01-14 2019-05-17 广东技术师范学院 一种基于密集神经网络的欺骗语音检测方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220172739A1 (en) * 2020-12-02 2022-06-02 Google Llc Self-Supervised Speech Representations for Fake Audio Detection
US11756572B2 (en) * 2020-12-02 2023-09-12 Google Llc Self-supervised speech representations for fake audio detection

Also Published As

Publication number Publication date
CN111933154A (zh) 2020-11-13
CN111933154B (zh) 2024-02-13

Similar Documents

Publication Publication Date Title
WO2021208287A1 (fr) Procédé et appareil de détection d'activité vocale pour reconnaissance d'émotion, dispositif électronique et support de stockage
WO2020177380A1 (fr) Procédé, appareil et dispositif de détection d'empreinte vocale sur la base d'un texte court, et support d'enregistrement
CN110443692B (zh) 企业信贷审核方法、装置、设备及计算机可读存储介质
CN106887225B (zh) 基于卷积神经网络的声学特征提取方法、装置和终端设备
WO2020073665A1 (fr) Procédé et système permettant d'effectuer une reconnaissance d'émotion dans la voix à l'aide d'un spectre, et support d'informations
CN110276259A (zh) 唇语识别方法、装置、计算机设备及存储介质
CN107221320A (zh) 训练声学特征提取模型的方法、装置、设备和计算机存储介质
CN112562691A (zh) 一种声纹识别的方法、装置、计算机设备及存储介质
JP6756079B2 (ja) 人工知能に基づく三元組チェック方法、装置及びコンピュータプログラム
CN110633991A (zh) 风险识别方法、装置和电子设备
CN112328761B (zh) 一种意图标签设置方法、装置、计算机设备及存储介质
WO2021208728A1 (fr) Procédé et appareil de détection de point final de parole basée sur un réseau neuronal, dispositif et support
WO2021135454A1 (fr) Procédé, dispositif et support de stockage lisible par ordinateur pour reconnaissance de faux signal vocal
CN110222780A (zh) 物体检测方法、装置、设备和存储介质
CN112233698A (zh) 人物情绪识别方法、装置、终端设备及存储介质
CN111653274B (zh) 唤醒词识别的方法、装置及存储介质
Tiwari et al. Virtual home assistant for voice based controlling and scheduling with short speech speaker identification
CN107341464A (zh) 一种用于提供交友对象的方法、设备及系统
CN113314150A (zh) 基于语音数据的情绪识别方法、装置及存储介质
CN113450822B (zh) 语音增强方法、装置、设备及存储介质
US10446138B2 (en) System and method for assessing audio files for transcription services
WO2021128847A1 (fr) Procédé et appareil d'interaction de terminal, dispositif informatique et support de stockage
WO2023222071A1 (fr) Procédé et appareil de traitement de signal vocal, et dispositif et support
Shah et al. Speech recognition using spectrogram-based visual features
Bear et al. Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20908575

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20908575

Country of ref document: EP

Kind code of ref document: A1