CN110797031A

CN110797031A - Voice change detection method, system, mobile terminal and storage medium

Info

Publication number: CN110797031A
Application number: CN201910888401.8A
Authority: CN
Inventors: 陈文敏; 肖龙源; 李稀敏; 蔡振华; 刘晓葳; 王静
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-02-14

Abstract

The invention is suitable for the technical field of automatic speaker verification, and provides a voice inflection detection method, a system, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring sample voice data, and performing feature extraction on the sample voice data to obtain cqt voice features; cqt voice features are optimized to obtain cqcc voice features, and the cqcc voice features are input into a preset convolutional neural network for model training to obtain a voice detection model; and acquiring the voice to be detected, inputting the voice to be detected into the voice detection model for voice analysis, and performing inflexion judgment on the voice to be detected according to the analysis result of the voice detection model. According to the method, manual feature selection is not needed, model training is performed in a convolutional neural network-based mode, the accuracy of subsequent inflexion detection on the voice to be detected is improved, and the resolution of a voice detection model is improved through extraction and optimization based on cqt features.

Description

Voice change detection method, system, mobile terminal and storage medium

Technical Field

The invention belongs to the technical field of automatic speaker verification, and particularly relates to a voice inflection detection method, a system, a mobile terminal and a storage medium.

Background

Automatic Speaker Verification (ASV) technology has matured to become a low-cost, reliable method of identity verification and identification. However, just like all biometric patterns, this technique may be attacked by some fraudulent speech, such as replay speech, inflexion speech, synthetic speech, etc. The intention of using these types of speech is to impersonate other enrollees and then breach the verification system, thereby performing some illegal operations, and therefore, the inflexion detection step of the speech to be detected is particularly important in the use of ASV technology.

The existing voice inflection detection methods need manual sound wave feature selection, and correspondingly perform inflection judgment on voice to be detected in a sound wave matching mode, namely, the voice to be detected is subjected to ripple matching with preset sound waves through selection based on the manual sound wave features, so that an inflection judgment result is obtained, but the voice inflection detection efficiency is low due to the sound wave matching mode selected based on the manual features, and the accuracy is poor.

Disclosure of Invention

The embodiment of the invention aims to provide a voice inflection detection method, a system, a mobile terminal and a storage medium, and aims to solve the problems of low detection efficiency and poor detection accuracy caused by the fact that an acoustic wave matching mode is adopted for inflection judgment in the existing voice inflection detection process.

The embodiment of the invention is realized in such a way that a voice inflection detection method comprises the following steps:

acquiring sample voice data, and performing feature extraction on the sample voice data to obtain cqt voice features, wherein the sample voice data comprises positive sample data and negative sample data;

optimizing the cqt voice features to obtain cqcc voice features, and inputting the cqcc voice features to a preset convolutional neural network for model training to obtain a voice detection model;

and acquiring the voice to be detected, inputting the voice to be detected into the voice detection model for voice analysis, and performing inflexion judgment on the voice to be detected according to the analysis result of the voice detection model.

Further, the step of optimizing the cqt voice features comprises:

carrying out rate spectrum conversion on the cqt voice features to obtain a voice power spectrum, and acquiring the logarithm of the voice power spectrum;

and resampling is carried out according to the logarithm acquisition result of the voice power spectrum, and discrete cosine transform is carried out after resampling, so as to obtain the cqcc voice feature.

Further, the step of performing inflexion determination on the speech to be detected according to the analysis result of the speech detection model includes:

acquiring a probability value output by a softmax layer in the voice detection model;

when the probability value is judged to be larger than the probability threshold value, judging the voice to be detected to be inflexion voice;

and when the probability value is not larger than the probability threshold value, judging that the voice to be detected is non-inflexion voice.

Further, the step of inputting the cqcc speech features into a preset convolutional neural network for model training includes:

controlling the preset convolutional neural network to adopt a cross entropy loss function and updating network parameters by adopting an Adam algorithm;

and carrying out iteration for a preset number of times according to the cqcc voice characteristics to obtain the voice detection model.

Further, after the step of updating the network parameter by using the Adam algorithm, the method further includes:

and adding random inactivation operation into the preset convolutional neural network.

Furthermore, the preset convolutional neural network comprises three convolutional layers and two full-connection layers.

Another objective of an embodiment of the present invention is to provide a system for detecting a voice inflection, where the system includes:

the characteristic extraction module is used for acquiring sample voice data and extracting characteristics of the sample voice data to obtain cqt voice characteristics, wherein the sample voice data comprises positive sample data and negative sample data;

the model training module is used for optimizing the cqt voice features to obtain cqcc voice features, and inputting the cqcc voice features into a preset convolutional neural network for model training to obtain a voice detection model;

and the voice detection module is used for acquiring the voice to be detected, inputting the voice to be detected into the voice detection model for voice analysis, and carrying out voice change judgment on the voice to be detected according to the analysis result of the voice detection model.

Further, the model training module is further configured to:

Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above-mentioned voice inflection detection method.

Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the voice inflection detection method.

According to the embodiment of the invention, manual feature selection is not needed, model training is carried out by adopting a convolutional neural network-based mode, so that the accuracy of subsequent inflexion detection for the voice to be detected is effectively improved, the resolution of the voice detection model is improved by extraction and optimization based on cqt features, the voice detection model can better distinguish the difference between inflexion voice and normal voice, the data computation amount can be reduced, and the detection efficiency of voice inflexion detection is improved.

Drawings

Fig. 1 is a flowchart of a voice inflection detection method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a method for detecting voice inflection provided in a second embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a system for detecting a voice inflection according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

The existing voice inflection detection method has low voice inflection detection efficiency and poor accuracy due to the fact that a sound wave matching mode selected based on artificial features is adopted, therefore, the method aims to carry out model training by adopting a mode based on a convolutional neural network so as to improve the accuracy of subsequent inflection detection aiming at voice to be detected, improves the resolution of a voice detection model by extracting and optimizing based on cqt features, enables the voice detection model to better distinguish the difference between the inflection voice and normal voice, reduces the operation amount of data and improves the detection efficiency of the voice inflection detection.

Example one

Please refer to fig. 1, which is a flowchart illustrating a voice inflection detection method according to a first embodiment of the present invention, including the steps of:

step S10, obtaining sample voice data, and performing feature extraction on the sample voice data to obtain cqt voice features;

the sample voice data includes positive sample data and negative sample data, specifically, the positive sample data mainly includes sound data of a real person, and the negative sample data mainly includes pitch change data, record playback data, synthetic audio data, and the like;

preferably, the inflexion data can be collected through some mainstream inflexion apps, or the sound in the audio can be converted into the sound of a specific person through a related conversion algorithm, the record playback data can be acquired by some recording devices, and in addition, the synthetic sound data can also be generated through any voice interface;

step S20, optimizing the cqt voice features to obtain cqcc voice features, and inputting the cqcc voice features to a preset convolutional neural network for model training to obtain a voice detection model;

specifically, in the step, cqt voice features are optimized to obtain a design of cqcc (constant Q Cepstral coefficients) voice features, so that the characteristics of the voice features are improved, and the detection accuracy of a subsequent model for inflexion and non-inflexion is improved;

step S30, acquiring a voice to be detected, inputting the voice to be detected into the voice detection model for voice analysis, and performing inflexion judgment on the voice to be detected according to the analysis result of the voice detection model;

acquiring a voice to be detected by adopting a sound pickup mode, and in the step, acquiring a target cqcc characteristic in the voice to be detected and inputting the target cqcc characteristic into a network in the voice detection model for analysis to obtain a voice analysis result;

this embodiment need not to carry out artifical feature selection, through adopting the mode based on convolutional neural network in order to carry out the model training for the effectual follow-up accuracy that detects to the inflexion of waiting to detect pronunciation that has improved, through extraction and optimization based on cqt characteristics, improved the resolution ratio of pronunciation detection model, and when making the pronunciation detection model can be better distinguish inflexion pronunciation and normal pronunciation difference, also can reduce the operand of data, improved the detection efficiency that pronunciation inflexion detected.

Example two

Please refer to fig. 2, which is a flowchart illustrating a voice inflection detection method according to a second embodiment of the present invention, including the steps of:

step S11, obtaining sample voice data, and performing feature extraction on the sample voice data to obtain cqt voice features;

preferably, because human voice mainly gathers in low frequency, the low frequency has higher resolution and the high frequency has lower resolution, so that the model obtained by subsequent training can better distinguish the difference between inflected voice and normal voice and reduce the operation amount of the data by extracting based on cqt features in the step;

step S21, performing rate spectrum conversion on the cqt voice features to obtain a voice power spectrum, and acquiring the logarithm of the voice power spectrum;

step S31, resampling is carried out according to the logarithm acquisition result of the voice power spectrum, and discrete cosine transform is carried out after resampling, so as to obtain the cqcc voice feature;

wherein, the design of resampling is carried out to process the logarithm into the same scale, and the design of discrete cosine transform is based on, so that the voice information is concentrated in the low frequency part;

step S41, controlling the preset convolutional neural network to adopt a cross entropy loss function, adopting an Adam algorithm to update network parameters, and adding random inactivation operation into the preset convolutional neural network;

the design of random inactivation operation is added into the preset convolutional neural network, so that overfitting in the model construction process can be effectively prevented, and the stability of model construction is improved;

step S51, performing iteration for preset times according to the cqcc voice characteristics to obtain the voice detection model;

the preset times can be set according to the requirements of the user, for example, 500 times, 1000 times or 2000 times;

step S61, acquiring a voice to be detected, inputting the voice to be detected into the voice detection model for voice analysis, and performing inflexion judgment on the voice to be detected according to the analysis result of the voice detection model;

when the voice analysis result is voice information, the detection result is directly played to the user by adopting a voice playing mode;

when the voice analysis result is text information, numerical information or image information, the voice analysis result is sent to the target equipment corresponding to the preset display address through obtaining based on the preset display address, so that the detection result is correspondingly displayed on the target equipment, and the user can effectively and conveniently check the sound change detection result of the voice to be detected;

specifically, in this step, the step of performing inflection point determination on the speech to be detected according to the analysis result of the speech detection model includes:

when the probability value is judged to be not greater than the probability threshold value, judging the voice to be detected to be non-inflexion voice;

EXAMPLE III

Please refer to fig. 3, which is a schematic structural diagram of a system 100 for detecting a voice inflection point according to a third embodiment of the present invention, including: feature extraction module 10, model training module 11 and speech detection module 12, wherein:

the feature extraction module 10 is configured to obtain sample voice data, and perform feature extraction on the sample voice data to obtain cqt voice features, where the sample voice data includes positive sample data and negative sample data.

And the model training module 11 is configured to perform optimization processing on the cqt speech features to obtain cqcc speech features, and input the cqcc speech features to a preset convolutional neural network for model training to obtain a speech detection model, where the preset convolutional neural network includes three convolutional layers and two fully-connected layers.

Wherein the model training module 11 is further configured to: controlling the preset convolutional neural network to adopt a cross entropy loss function and updating network parameters by adopting an Adam algorithm; and carrying out iteration for a preset number of times according to the cqcc voice characteristics to obtain the voice detection model.

Furthermore, the model training module 11 is further configured to: and adding random inactivation operation into the preset convolutional neural network.

Further, the model training module 11 is further configured to: carrying out rate spectrum conversion on the cqt voice features to obtain a voice power spectrum, and acquiring the logarithm of the voice power spectrum; and resampling is carried out according to the logarithm acquisition result of the voice power spectrum, and discrete cosine transform is carried out after resampling, so as to obtain the cqcc voice feature.

The voice detection module 12 is configured to acquire a voice to be detected, input the voice to be detected to the voice detection model for voice analysis, and perform inflexion judgment on the voice to be detected according to an analysis result of the voice detection model.

Wherein the voice detection module 12 is further configured to: acquiring a probability value output by a softmax layer in the voice detection model; when the probability value is judged to be larger than the probability threshold value, judging the voice to be detected to be inflexion voice; and when the probability value is not larger than the probability threshold value, judging that the voice to be detected is non-inflexion voice.

Example four

Referring to fig. 4, a mobile terminal 101 according to a fourth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the above-mentioned voice change detection method.

The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:

and acquiring the voice to be detected, inputting the voice to be detected into the voice detection model for voice analysis, and performing inflexion judgment on the voice to be detected according to the analysis result of the voice detection model. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.

Those skilled in the art will appreciate that the configuration shown in fig. 3 is not intended to limit the present invention and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components, and that the voice inflection detection method of fig. 1-2 may be implemented using more or fewer components than those shown in fig. 3, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the voice inflection detection system and that can perform a specific function, and all of them can be stored in a storage device (not shown) of the voice inflection detection system.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for detecting a voice inflection, the method comprising:

2. The method of detecting inflections in speech of claim 1 wherein said step of optimizing said cqt speech feature comprises:

3. The method according to claim 1, wherein the step of performing inflection decision on the speech to be detected according to the analysis result of the speech detection model comprises:

4. The method of detecting inflexion in speech of claim 1, where the step of inputting the cqcc speech features into a predetermined convolutional neural network for model training comprises:

5. The method of detecting inflections in speech according to claim 4, wherein after the step of employing the Adam algorithm for updating the network parameters, the method further comprises:

6. The method of detecting inflectional to speech of claim 1, wherein the predetermined convolutional neural network comprises three convolutional layers and two fully-connected layers.

7. A system for detecting a voice inflection, the system comprising:

8. The voice inflection detection system of claim 7 wherein the model training module is further configured to:

9. A mobile terminal, characterized by comprising a storage device for storing a computer program and a processor for executing the computer program to cause the mobile terminal to perform the voice inflection detection method according to any one of claims 1 to 6.

10. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 9, which computer program, when executed by a processor, implements the steps of the method of detecting a voicing according to any of claims 1-6.