CN111009238B

CN111009238B - Method, device and equipment for recognizing spliced voice

Info

Publication number: CN111009238B
Application number: CN202010002558.9A
Authority: CN
Inventors: 陈剑超; 肖龙源; 李稀敏; 蔡振华; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2023-06-23
Anticipated expiration: 2040-01-02
Also published as: CN111009238A

Abstract

The invention discloses a method, a device and equipment for recognizing spliced voice. Wherein the method comprises the following steps: acquiring normal voice data of a user, cutting the normal voice data into a preset number of segments, splicing the normal voice data cut into the preset number of segments according to voice disorder to obtain spliced voice data, constructing a binary model based on the normal voice data and the spliced voice data, training the binary model by adopting a long-short-period memory network and a convolutional neural network, and recognizing spliced voice of the voice data according to the binary model trained by the spliced voice model. Through the mode, the recognition of spliced voice can be realized, and the safety of voice verification can be further ensured.

Description

Method, device and equipment for recognizing spliced voice

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for recognizing spliced speech.

Background

In many scenes of real life, it is often necessary to perform voice verification on a user, for example, to log in a software program through voice verification or log in a terminal device through voice verification, but some illegal persons cut voices of other users than the illegal persons and splice spliced voices of specific audio contents, attempt to use the spliced voices to imitate identities of real users to perform voice verification, so that benefits are illegally obtained or illegal operations are performed, and the safety of voice verification cannot be ensured.

However, the prior art cannot recognize the spliced voice, and thus cannot guarantee the security of voice verification.

Disclosure of Invention

In view of the above, the present invention aims to provide a method, a device and a device for recognizing spliced voice, which can recognize the spliced voice and further guarantee the security of voice verification.

According to one aspect of the present invention, there is provided a recognition method of spliced voice, including:

acquiring normal voice data of a user;

cutting the normal voice data into a preset number of segments, and splicing the normal voice data cut into the preset number of segments according to voice disorder to obtain spliced voice data;

constructing a classification model based on the normal voice data and the spliced voice data;

training a spliced voice model of the two classification models by adopting a long-term memory network and a convolution neural network;

and recognizing the spliced voice of the voice data according to the two classification models trained by the spliced voice model.

Wherein the constructing a classification model based on the normal voice data and the spliced voice data comprises:

and constructing a binary model based on the normal voice data and the spliced voice data by adopting a mode of respectively extracting linear prediction analysis characteristics and tone characteristics of the normal voice data and the spliced voice data, carrying out differential operation and normalization operation on the linear prediction analysis characteristics and the tone characteristics, and taking the linear prediction analysis characteristics and the tone characteristics after the differential operation and the normalization operation as training inputs of a long-short-period memory network and a convolutional neural network.

The training of the spliced voice model by adopting the long-term memory network and the convolutional neural network comprises the following steps:

and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-term memory network and a convolutional neural network, and training the spliced voice model by adopting the long-term memory network and the convolutional neural network.

Wherein after the recognition of the spliced voice is performed on the voice data according to the two classification models trained by the spliced voice model, the method further comprises the following steps:

and carrying out parameter counting on the long-short-term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and training and updating the dichotomy model through iteration of preset times by adopting the long-short-term memory network and the convolutional neural network after parameter updating.

According to another aspect of the present invention, there is provided a recognition apparatus for spliced voice, comprising:

the system comprises an acquisition module, a splicing module, a construction module, a training module and an identification module;

the acquisition module is used for acquiring normal voice data of a user;

the splicing module is used for cutting the normal voice data into a preset number of segments, and splicing the normal voice data cut into the preset number of segments according to voice disorder to obtain spliced voice data;

the construction module is used for constructing a classification model based on the normal voice data and the spliced voice data;

the training module is used for training the spliced voice model of the two classification models by adopting a long-term memory network and a convolution neural network;

and the recognition module is used for recognizing the spliced voice of the voice data according to the two classification models trained by the spliced voice model.

The construction module is specifically configured to:

The training module is specifically configured to:

Wherein, the recognition device of concatenation pronunciation still includes:

updating a module;

the updating module is used for carrying out parameter updating on the long-period memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and carrying out training updating on the two classification models through iteration of preset times by adopting the long-period memory network and the convolutional neural network after parameter updating.

According to still another aspect of the present invention, there is provided a recognition apparatus for spliced voice, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of recognizing a spliced speech as claimed in any one of the preceding claims.

According to a further aspect of the present invention, there is provided a computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements a method of recognizing spliced speech as defined in any one of the above.

It can be found that according to the scheme, normal voice data of a user can be obtained, the normal voice data can be cut into a preset number of segments, the normal voice data cut into the preset number of segments is spliced according to voice disorder to obtain spliced voice data, a two-class model based on the normal voice data and the spliced voice data can be constructed, a long-term memory network and a convolution neural network can be adopted to train the two-class model to splice the voice model, and the voice data can be identified according to the two-class model trained by the spliced voice model, so that the identification of the spliced voice can be realized, and further the safety of voice verification can be ensured.

Further, according to the scheme, the linear prediction analysis feature and the pitch feature of the normal voice data and the spliced voice data can be extracted respectively, the difference operation and the normalization operation are carried out on the linear prediction analysis feature and the pitch feature, the linear prediction analysis feature and the pitch feature after the difference operation and the normalization operation are used as training inputs of a long-short-period memory network and a convolutional neural network, and a classification model based on the normal voice data and the spliced voice data is constructed.

Furthermore, according to the scheme, the acoustic features can be extracted from the two-classification model, the extracted acoustic features are input into the long-term memory network and the convolutional neural network, and the long-term memory network and the convolutional neural network are adopted to train the spliced voice model of the two-classification model.

Furthermore, according to the scheme, parameters of the long-period memory network and the short-period memory network and the convolutional neural network which are adopted by the scheme can be counted more through a loss function and an optimization algorithm of cross entropy loss, and the training and updating of the two classification models are carried out through iteration of preset times by adopting the long-period memory network and the convolutional neural network which are subjected to parameter updating, so that the accuracy rate of recognition of spliced voice can be improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an embodiment of a method for recognizing a spliced voice according to the present invention;

FIG. 2 is a flow chart of another embodiment of a method for recognizing a spliced voice according to the present invention;

FIG. 3 is a schematic diagram illustrating an embodiment of a speech recognition apparatus according to the present invention;

FIG. 4 is a schematic diagram of another embodiment of a speech recognition device according to the present invention;

fig. 5 is a schematic structural diagram of an embodiment of a speech recognition device according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is specifically noted that the following examples are only for illustrating the present invention, but do not limit the scope of the present invention. Likewise, the following examples are only some, but not all, of the examples of the present invention, and all other examples, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the present invention.

The invention provides a recognition method of spliced voice, which can realize the recognition of the spliced voice and further ensure the safety of voice verification.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a method for recognizing a spliced voice according to the present invention. It should be noted that, if there are substantially the same results, the method of the present invention is not limited to the flow sequence shown in fig. 1. As shown in fig. 1, the method comprises the steps of:

s101: and acquiring normal voice data of the user.

In this embodiment, the user may be a single user or a plurality of users, and the present invention is not limited thereto.

In this embodiment, the normal voice data of a plurality of users may be obtained at one time, or may be obtained multiple times, or may be obtained one by one, or the normal voice data of the users may be obtained one by one.

S102: cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to voice disorder to obtain spliced voice data.

In this embodiment, the normal voice data may be cut into 2 segments, or may be cut into 3 segments, or may be cut into other segments, which is not limited by the present invention.

S103: and constructing a classification model based on the normal voice data and the spliced voice data.

Wherein, the constructing a classification model based on the normal voice data and the spliced voice data may include:

the method comprises the steps of respectively extracting LPC (Linear Predictive Coding, linear prediction analysis) characteristics and pitch characteristics of the normal voice data and the spliced voice data, carrying out differential operation and normalization operation on the linear prediction analysis characteristics and the pitch characteristics, taking the linear prediction analysis characteristics and the pitch characteristics after the differential operation and the normalization operation as training inputs of LSTM (Long Short-Term Memory network and convolutional neural network) and CNN (Convolutional Neural Networks, convolutional neural network), and constructing a two-class model based on the normal voice data and the spliced voice data.

S104: and training the spliced voice model by adopting a long-term memory network and a convolution neural network.

The training of the spliced voice model by adopting the long-term memory network and the convolutional neural network can comprise the following steps:

the method has the advantages that the extracted acoustic features can make the features of the spliced voice more prominent, and the accuracy of the recognition of the spliced voice can be improved.

In this embodiment, the long-short-term memory network and the convolutional neural network may include two long-short-term memory layers and two full-connection layers, may include three long-short-term memory layers and three full-connection layers, and may include four long-short-term memory layers and four full-connection layers.

S105: and according to the two classification models trained by the spliced voice model, carrying out recognition of spliced voice on voice data.

After the recognition of the spliced voice is performed on the voice data according to the two classification models trained by the spliced voice model, the method further comprises the following steps:

the parameters of the long-short-term memory network and the convolutional neural network are more numerous through a loss function of cross entropy loss and an optimization algorithm, and the training and updating of the two classification models are carried out through iteration of preset times by adopting the long-short-term memory network and the convolutional neural network after parameter updating, so that the accuracy rate of recognition of spliced voice can be improved.

It can be found that in this embodiment, normal voice data of a user may be obtained, the normal voice data may be cut into a preset number of segments, and the normal voice data cut into the preset number of segments may be spliced according to a voice disorder to obtain spliced voice data, a binary model based on the normal voice data and the spliced voice data may be constructed, a long-short-term memory network and a convolutional neural network may be used to train the binary model to splice voice models, and recognition of spliced voice may be performed on voice data according to the binary model trained by the spliced voice model, so that recognition of spliced voice may be realized, and further security of voice verification may be ensured.

Further, in this embodiment, a manner of extracting the linear prediction analysis feature and the pitch feature of the normal voice data and the spliced voice data, performing differential operation and normalization operation on the linear prediction analysis feature and the pitch feature, and taking the linear prediction analysis feature and the pitch feature after the differential operation and the normalization operation as training inputs of a long-short-period memory network and a convolutional neural network can be adopted to construct a classification model based on the normal voice data and the spliced voice data, which is advantageous in that the long-short-period memory network and the convolutional neural network can retain information of audio context, thereby being capable of facilitating recognition of spliced voice.

Further, in this embodiment, the acoustic features may be extracted from the two-classification model, and the extracted acoustic features may be input to the long-short-term memory network and the convolutional neural network, and training of the spliced voice model is performed on the two-classification model by using the long-short-term memory network and the convolutional neural network.

Referring to fig. 2, fig. 2 is a flowchart of another embodiment of a method for recognizing a spliced voice according to the present invention. In this embodiment, the method includes the steps of:

s201: and acquiring normal voice data of the user.

As described in S101, a detailed description is omitted here.

S202: cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to voice disorder to obtain spliced voice data.

As described in S102, the description is omitted here.

S203: and constructing a classification model based on the normal voice data and the spliced voice data.

As described in S103, a detailed description is omitted here.

S204: and training the spliced voice model by adopting a long-term memory network and a convolution neural network.

As described in S104, a detailed description is omitted here.

S205: and according to the two classification models trained by the spliced voice model, carrying out recognition of spliced voice on voice data.

S206: and carrying out parameter updating on the long-short-term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and training and updating the two classification models through iteration of preset times by adopting the long-short-term memory network and the convolutional neural network after parameter updating.

It can be found that in this embodiment, parameters of the long-short-term memory network and the convolutional neural network can be more numerous through a loss function and an optimization algorithm of cross entropy loss, and the training update of the two-class model is performed through iteration of preset times by using the long-short-term memory network and the convolutional neural network after parameter update, so that the accuracy rate of recognition of spliced voice can be improved.

The invention also provides a recognition device for the spliced voice, which can realize the recognition of the spliced voice and further ensure the safety of voice verification.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of a speech recognition device according to the present invention. In this embodiment, the recognition device 30 for spliced voice includes an acquisition module 31, a splicing module 32, a construction module 33, a training module 34, and a recognition module 35.

The acquiring module 31 is configured to acquire normal voice data of a user.

The splicing module 32 is configured to cut the normal voice data into a preset number of segments, and splice the normal voice data cut into the preset number of segments according to a voice disorder to obtain spliced voice data.

The construction module 33 is configured to construct a classification model based on the normal voice data and the spliced voice data.

The training module 34 is configured to train the spliced speech model on the two classification models using a long-term memory network and a convolutional neural network.

The recognition module 35 is configured to recognize the spliced voice from the voice data according to the two classification models trained by the spliced voice model.

Alternatively, the construction module 33 may be specifically configured to:

the method comprises the steps of respectively extracting linear prediction analysis characteristics and pitch characteristics of the normal voice data and the spliced voice data, carrying out differential operation and normalization operation on the linear prediction analysis characteristics and the pitch characteristics, and constructing a binary classification model based on the normal voice data and the spliced voice data by taking the linear prediction analysis characteristics and the pitch characteristics after the differential operation and the normalization operation as training inputs of a long-short-term memory network and a convolutional neural network.

Optionally, the training module 34 may be specifically configured to:

and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-term memory network and a convolutional neural network, and training the spliced voice model of the two classification models by adopting the long-term memory network and the convolutional neural network.

Referring to fig. 4, fig. 4 is a schematic structural diagram of another embodiment of a speech recognition device according to the present invention. Unlike the previous embodiment, the apparatus 40 for recognizing spliced voice according to the present embodiment further includes an updating module 41.

The updating module 41 is configured to perform parameter updating on the long-term and short-term memory networks and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and perform training updating on the two classification models through preset times of iteration on the long-term and short-term memory networks and the convolutional neural network after parameter updating.

The respective unit modules of the recognition device 30/40 for spliced voice can execute the corresponding steps in the above method embodiments, so that the detailed description of the respective unit modules is omitted herein.

The present invention further provides a recognition device for spliced voice, as shown in fig. 5, including: at least one processor 51; and a memory 52 communicatively coupled to the at least one processor 51; the memory 52 stores instructions executable by the at least one processor 51, and the instructions are executed by the at least one processor 51 to enable the at least one processor 51 to perform the above-described method for recognizing spliced speech.

Where the memory 52 and the processor 51 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 51 and the memory 52 together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 51 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 51.

The processor 51 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory 52 may be used to store data used by the processor 51 in performing operations.

The present invention further provides a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.

In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description is only a partial embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes using the descriptions and the drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims

1. A method for recognizing spliced speech, comprising:

acquiring normal voice data of a user;

constructing a classification model based on the normal voice data and the spliced voice data, comprising:

the method comprises the steps of respectively extracting linear prediction analysis characteristics and pitch characteristics of normal voice data and spliced voice data, performing differential operation and normalization operation on the linear prediction analysis characteristics and the pitch characteristics, and constructing a binary model based on the normal voice data and the spliced voice data by taking the linear prediction analysis characteristics and the pitch characteristics after the differential operation and the normalization operation as training inputs of a long-short-term memory network and a convolutional neural network;

2. The method for recognizing a spliced voice according to claim 1, wherein the training of the spliced voice model using the long-short-term memory network and the convolutional neural network comprises:

3. The method for recognizing a spliced voice according to claim 1, further comprising, after the recognition of the spliced voice from the voice data according to the two classification models trained by the spliced voice model:

4. A spliced voice recognition device, comprising:

the acquisition module is used for acquiring normal voice data of a user;

the construction module is used for constructing a classification model based on the normal voice data and the spliced voice data, and is specifically used for:

5. The apparatus for recognizing spliced speech according to claim 4, wherein the training module is specifically configured to:

6. The spliced voice recognition device of claim 4, wherein the spliced voice recognition device further comprises:

updating a module;

7. A spliced voice recognition apparatus, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of recognition of spliced speech according to any one of claims 1 to 3.

8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of recognition of spliced speech according to any one of claims 1 to 3.