CN114596879B

CN114596879B - False voice detection method and device, electronic equipment and storage medium

Info

Publication number: CN114596879B
Application number: CN202210297859.8A
Authority: CN
Inventors: 孟凡芹; 郑榕
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-12-30
Anticipated expiration: 2042-03-25
Also published as: CN114596879A

Abstract

The application provides a false voice detection method, a false voice detection device, electronic equipment and a storage medium, wherein the false voice detection method comprises the following steps: acquiring a voice to be detected; inputting the voice to be detected into an embedded feature extraction network layer of a voice detection model, and determining full-band voice features and a plurality of sub-band voice features; inputting the full-band voice features and the plurality of sub-band voice features into a combined attention network layer, and determining full-band local features and the plurality of sub-band local features; the full-band local features and the sub-band local features are determined by feature extraction on at least one attention dimension; inputting the full-band local features and the multiple sub-band local features into a fusion attention network layer to determine target voice fusion features; and determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics. The speech to be detected is directly input into the speech detection model to extract the speech features of the full frequency band and the sub frequency band on different attention dimensions, so that the accuracy of false speech recognition can be improved.

Description

False voice detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech detection technologies, and in particular, to a method and an apparatus for detecting false speech, an electronic device, and a storage medium.

Background

Because there are many ways to generate false speech, such as playback, speech synthesis, speech conversion, and concatenation. And because different recording devices have many differences, the voice synthesis and conversion methods are various, and the influence of different generation modes on the frequency spectrum is distributed in different frequency domains, a great obstacle is caused to the identification of false voice and real voice, and the accuracy of the current false voice detection is low.

At present, voiceprint feature data are generally extracted from the false voice and the real voice respectively, and the features of the false voice and the real voice are input into a network in a common mode such as a Mel cepstrum coefficient feature, and the network is iteratively trained to finally obtain two classification recognition models of the false voice and the real voice. However, this method usually focuses only on information of a certain sub-band or focuses on all information of the speech without a point of measurement, which results in a decrease in accuracy of determining the final true and false speech. Therefore, how to quickly and accurately determine the false voice becomes a problem to be solved urgently.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device and a storage medium for detecting a false speech, so as to improve the accuracy of false speech recognition.

The embodiment of the application provides a false voice detection method, which comprises the following steps:

acquiring a voice to be detected;

inputting the voice to be detected into an embedded feature extraction network layer of a pre-trained voice detection model, and determining full-band voice features and a plurality of sub-band voice features;

inputting the full-band voice features and the plurality of sub-band voice features into a combined attention network layer of the voice detection model, and determining full-band local features and a plurality of sub-band local features; wherein the full-band local features and the sub-band local features are each determined by feature extraction in at least one attention dimension;

inputting the full-band local features and the sub-band local features into a fusion attention network layer of a voice detection model to determine target voice fusion features;

and determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics.

In one possible implementation, the embedded feature extraction network layer includes a full-band embedded feature extraction unit and a sub-band embedded feature extraction unit; the embedding feature extraction network layer of the speech input to be detected to the pre-trained speech detection model determines full-band speech features and a plurality of sub-band speech features, and comprises the following steps:

inputting the voice to be detected to the full-band embedded feature extraction unit, and determining the full-band voice feature;

inputting the voice to be detected into the sub-band embedded feature extraction unit, dividing the voice to be detected into a plurality of sub-band regions according to frequency, and respectively determining the sub-band voice features corresponding to each sub-band region.

In one possible embodiment, the full-band local feature is determined by:

inputting the full-band voice features to a time attention unit, and performing feature learning on the full-band voice features on a time attention dimension to determine full-band first voice features;

inputting the full-band voice feature into a spectrum attention unit, and performing feature learning on the full-band voice feature on a spectrum attention dimension to determine a full-band second voice feature;

inputting the full-band voice features into a channel attention unit, and performing feature learning on the full-band voice features on a channel attention dimension to determine a full-band third voice feature;

determining the full-band local feature according to the full-band first voice feature, the full-band second voice feature and the full-band third voice feature;

wherein the combined attention network layer includes the temporal attention unit, the spectral attention unit, and the channel attention unit.

In a possible implementation manner, the inputting the full-band local feature and the plurality of sub-band local features into a speech detection model for fusion to determine a target speech fusion feature includes:

performing feature fusion on the plurality of sub-band local features to determine a combined sub-band local feature;

and performing feature fusion on the combined sub-band local features and the full-band local features to determine target voice fusion features.

In one possible implementation, before inputting the full-band speech feature and the plurality of sub-band speech features into the combined attention network layer of the speech detection model and determining the full-band local feature and the plurality of sub-band local features, the detection method further includes:

inputting the full-band speech features and the plurality of sub-band speech features into a coding network layer of the speech detection model, and coding the full-band speech features and the plurality of sub-band speech features to obtain the coded full-band speech features and the plurality of coded sub-band speech features;

the inputting the full-band speech feature and the plurality of sub-band speech features into the combined attention network layer of the speech detection model, determining a full-band local feature and a plurality of sub-band local features, including:

and inputting the encoded full-band speech features and the encoded plurality of sub-band speech features into a combined attention network layer of the speech detection model, and determining full-band local features and a plurality of sub-band local features.

In a possible implementation manner, the determining whether the speech to be detected is false speech based on the target speech fusion feature includes:

carrying out full-connection processing on the target voice fusion characteristics to determine a false voice score of the voice to be detected;

judging whether the false voice score of the voice to be detected is larger than or equal to a preset false voice score;

if so, determining the voice to be detected as false voice;

if not, determining that the voice to be detected is real voice.

The embodiment of the present application further provides a detection apparatus for false voice, where the detection apparatus includes:

the acquisition module is used for acquiring the voice to be detected;

the feature extraction module is used for inputting the voice to be detected to an embedded feature extraction network layer of a pre-trained voice detection model and determining full-band voice features and a plurality of sub-band voice features;

a local feature determination module, configured to input the full-band speech feature and the multiple sub-band speech features into a combined attention network layer of the speech detection model, and determine a full-band local feature and multiple sub-band local features; wherein the full-band local features and the sub-band local features are each determined by feature extraction in at least one attention dimension;

the feature fusion module is used for inputting the full-band local features and the sub-band local features into a fusion attention network layer of a voice detection model and determining target voice fusion features;

and the judging module is used for determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics.

In one possible implementation, the feature extraction module includes a full-band embedded feature extraction unit and a sub-band embedded feature extraction unit at a network layer for the embedded feature extraction; the voice to be detected is input to the embedded feature extraction network layer of the pre-trained voice detection model, and a full-frequency-band voice feature and a plurality of sub-frequency-band voice features are determined, wherein the feature extraction module is specifically used for:

inputting the voice to be detected to the full-band embedded feature extraction unit to determine the full-band voice feature;

and inputting the voice to be detected to the sub-band embedded feature extraction unit, dividing the voice to be detected into a plurality of sub-band regions according to frequency, and respectively determining the sub-band voice features corresponding to each sub-band region.

An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of detecting false speech as described above.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method for detecting false speech.

The application provides a false voice detection method, a false voice detection device, electronic equipment and a storage medium, wherein the false voice detection method comprises the following steps: the voice to be detected is directly input into the embedded feature extraction network layer for feature extraction, so that the loss of voice features is avoided, the effectiveness of the voice features is improved, the obtained full-band voice features and the plurality of sub-band voice features are input into the combined attention network layer, the full-band voice features and the sub-band voice features are extracted on different attention dimensions, and therefore the accuracy of false voice recognition can be improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a method for detecting false speech according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a network structure of a speech detection model according to an embodiment of the present application;

FIG. 3 is a flow chart of another method for detecting false speech according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus for detecting false speech according to an embodiment of the present application;

fig. 5 is a second schematic structural diagram of a false speech detection apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and that steps without logical context may be performed in reverse order or concurrently. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be obtained by a person skilled in the art without making any inventive step based on the embodiments of the present application, fall within the scope of protection of the present application.

To enable one skilled in the art to use the present disclosure in conjunction with a particular application scenario "determine false speech," the following embodiments are presented to enable one skilled in the art to apply the general principles defined herein to other embodiments and application scenarios without departing from the spirit and scope of the present disclosure.

The method, the apparatus, the electronic device, or the computer-readable storage medium described in the embodiments of the present application may be applied to any scenario in which a determination on a false voice needs to be performed, and the embodiments of the present application do not limit a specific application scenario.

First, an application scenario to which the present application is applicable will be described. The method and the device can be applied to the technical field of false voice detection.

Because there are many ways to generate false speech, such as playback, speech synthesis, speech conversion, and concatenation. And because different recording devices have many differences, speech synthesis and conversion methods are various, and the influence of different generation modes on the frequency spectrum is distributed in different frequency domains, great obstacles are caused to the identification of false speech and real speech, and the accuracy of the current false speech detection is low.

Research shows that voiceprint feature data are generally extracted from false voice and real voice respectively at the present stage, the features of the false voice and the real voice are input into a network in a common mode such as Mel cepstrum coefficient features, iterative training is carried out on the network, and finally a two-class recognition model of the false voice and the real voice is obtained. However, this method usually focuses only on information of a certain sub-band or focuses on all information of the speech without a point of measurement, which results in a decrease in accuracy of determining the final true and false speech. Therefore, how to quickly and accurately determine the false voice becomes a problem to be solved urgently.

Based on this, the embodiment of the application provides a method and a device for detecting false voices, an electronic device and a storage medium, and the accuracy of false voice recognition can be improved by directly inputting voices to be detected into a voice detection model and extracting voice features of full frequency bands and sub frequency bands on different attention dimensions.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for detecting false speech according to an embodiment of the present application. As shown in fig. 1, a detection method provided in an embodiment of the present application includes:

s101: and acquiring the voice to be detected.

In this step, a section of speech to be detected can be acquired in the recording device, and the manner of acquiring the speech to be detected is not limited herein.

Here, the speech to be detected is obtained after being processed by an encoding format, for example, the original speech may have a plurality of encoding formats, such as mp3, wav, flac, etc., and the speech generally used is in pcm format, which needs to convert the original speech into the speech in pcm format, so as to prepare for inputting the speech to be detected into the speech detection model.

S102: and inputting the voice to be detected into an embedded feature extraction network layer of a pre-trained voice detection model, and determining full-band voice features and a plurality of sub-band voice features.

In the step, the voice to be detected is input to an embedded feature extraction network layer of a pre-trained voice detection model for feature extraction, and full-band voice features and a plurality of sub-band voice features are determined.

The full-band voice feature is to extract features in the full-frequency range of the voice to be detected; the sub-band voice features are that voice to be detected is divided into a plurality of sub-band regions in a frequency range, and feature extraction is carried out according to the sub-band regions.

The voice to be detected is directly input into the embedded feature calling network layer for feature extraction, and the voice detection model is not input after the features are extracted, so that the problem that part of the features of the voice to be detected are missing in the feature extraction process and the detection performance is affected can be avoided.

Here, the speech detection model is used for false speech detection of the speech to be detected. The voice detection model comprises an embedded feature extraction network layer, a full-band embedded feature extraction unit, a sub-band embedded feature extraction unit, a combined attention network layer, a time attention unit, a frequency spectrum attention unit, a channel attention unit and a fusion attention network layer, and the voice to be detected is input into the voice detection model to quickly and accurately output a false voice detection result.

Therefore, an end-to-end architecture is adopted, the characteristics of the voice to be detected do not need to be extracted independently, risks such as accidental loss of the characteristics are avoided, and the effectiveness of the characteristics is improved.

Further, the embedded feature extraction network layer comprises a full-band embedded feature extraction unit and a sub-band embedded feature extraction unit; the embedding feature extraction network layer of the speech input to be detected to the pre-trained speech detection model determines full-band speech features and a plurality of sub-band speech features, and comprises the following steps:

a: and inputting the voice to be detected to the full-band embedded feature extraction unit, and determining the full-band voice features.

Here, the embedded feature extraction network layer includes a full-band embedded feature extraction unit and a sub-band embedded feature extraction unit, and the full-band embedded feature extraction unit performs speech feature extraction only on sequence information of speech to be detected in a full frequency range to determine a full-band speech feature.

The full-band voice feature represents voice feature information of the voice to be detected in the full-frequency range.

B: inputting the voice to be detected into the sub-band embedded feature extraction unit, dividing the voice to be detected into a plurality of sub-band regions according to frequency, and respectively determining the sub-band voice features corresponding to each sub-band region.

Here, the subband embedding feature extracting unit divides the speech to be detected into a plurality of subband regions according to frequency, performs feature extraction on the speech sequence information corresponding to each subband region, and determines a subband speech feature corresponding to each subband region.

Here, the frequency may be divided according to expert experience or a preset time period to obtain a plurality of sub-band regions.

The sub-band embedded feature extraction unit and the full-band embedded feature extraction unit are respectively composed of two one-dimensional convolution neural networks, the convolution kernel size is 32, one 2-dimensional BatchNorm is formed, and the activation function is ReLu.

In the specific embodiment, the effective audio is to-be-detected voice copy two parts at 8000Hz, the two parts are respectively input into the full-band embedded feature extraction unit and the sub-band embedded feature extraction unit, the full-band embedded feature extraction unit directly extracts the voice feature of the sequence information of the to-be-detected voice in 8000Hz, and the full-band voice feature is determined. The sub-band embedded feature extraction unit divides a frequency range of 8000Hz into five sub-band regions [0, 1600], [1600, 3200], [3200, 4800], [4800, 6400] and [6400, 8000] on average, and then performs feature extraction on the five sub-band regions to determine the sub-band voice feature corresponding to each sub-band region.

Therefore, the method for extracting the features by using the sub-band embedded feature extraction unit can enable the voice features of the voice to be detected at different frequencies to be trained and learned further, so that the local feature attributes of the voice to be detected at different frequencies are concerned, and the accuracy of voice detection model detection is improved.

S103: inputting the full-band voice features and the plurality of sub-band voice features into a combined attention network layer of the voice detection model, and determining full-band local features and a plurality of sub-band local features; wherein the full-band local features and the sub-band local features are each determined by feature extraction in at least one attention dimension.

In this step, the full-band speech features and the plurality of sub-band speech features are input to the combined attention network layer, feature extraction is performed on the full-band speech features and the plurality of sub-band speech features in the attention dimension, and full-band local features and the plurality of sub-band local features are determined respectively.

Here, the attention dimension includes a time attention dimension, a frequency attention dimension, a channel attention dimension, and the like.

Further, the full-band local features are determined by:

(1): and inputting the full-band voice features to a time attention unit, and performing feature learning on the full-band voice features on a time attention dimension to determine a full-band first voice feature.

And inputting the full-band voice features into a time attention unit, and performing feature learning on the full-band voice features in a time attention dimension to determine the first full-band voice features.

The full-band first speech feature only contains feature information of the full-band speech feature on time attention and does not contain feature information of other attention dimensions.

(2): and inputting the full-band voice feature into a spectrum attention unit, and performing feature learning on the full-band voice feature on a spectrum attention dimension to determine a full-band second voice feature.

And inputting the full-band voice features into a frequency attention unit, and performing feature learning on the full-band voice features in a frequency attention dimension to determine a full-band second voice feature.

The full-band second speech feature only contains feature information of the full-band speech feature on frequency attention and does not contain feature information of other attention dimensions.

(3): and inputting the full-band voice features into a channel attention unit, and performing feature learning on the full-band voice features in a channel attention dimension to determine a full-band third voice feature.

And inputting the full-band voice features into the channel attention unit, and performing feature learning on the full-band voice features in the channel attention dimension to determine the full-band third voice features.

And the full-band third speech feature only contains feature information of the full-band speech feature on the channel attention and does not contain feature information of other attention dimensions.

(4): and determining the full-band local features according to the full-band first voice features, the full-band second voice features and the full-band third voice features.

And performing feature fusion on the full-band first voice feature, the full-band second voice feature and the full-band third voice feature to determine a full-band local feature.

The full-band local features carry feature information of the full-band speech features in a time attention dimension, a frequency attention dimension and a channel attention dimension.

Here, the temporal attention unit, the spectral attention unit, and the channel attention unit learn the high-level features of the speech from different attention dimensions, respectively. The network structure of attention units may employ a gated attention network (GAAN), which, unlike the conventional multi-head attention mechanism, does not assign equal weights to each head, but introduces a self-attention mechanism that calculates different weights for each head, and is a convolutional sub-network characterized by a central node and adjacent to it to generate gate values.

Therefore, the time attention unit, the frequency spectrum attention unit and the channel attention unit respectively learn the characteristics of the full-band voice characteristics and the sub-band voice characteristics in time, frequency and channels, and the influence of other attention dimensionalities is shielded when different attention dimensionalities learn, so that the high-level characteristics of the full-band voice characteristics and the sub-band voice characteristics are more accurately extracted, and the local characteristics of the voice to be detected are paid attention to from different attention dimensionalities.

Further, for each sub-band local feature, the sub-band local feature is determined by:

inputting the sub-band voice features to a time attention unit, and performing feature learning on the sub-band voice features on a time attention dimension to determine a sub-band first voice feature;

inputting the sub-band voice features into a spectrum attention unit, and performing feature learning on the sub-band voice features on a spectrum attention dimension to determine a second sub-band voice feature;

inputting the sub-band voice features into a channel attention unit, and performing feature learning on the sub-band voice features on a channel attention dimension to determine a third voice feature of a sub-band;

and determining the local feature of the sub-band according to the first voice feature of the sub-band, the second voice feature of the sub-band and the third voice feature of the sub-band.

For example, for each subband speech feature, the subband speech feature is input to the time attention unit to obtain a subband first speech feature a, the subband speech feature is input to the frequency attention unit to obtain a subband second speech feature b, the subband speech feature is input to the channel attention unit to obtain a subband third speech feature c, and the subband local feature a is determined by multiplying the subband first speech feature a, the subband second speech feature b, and the subband third speech feature c.

For example, the speech feature of the input combined attention network layer is

Where C represents the channel dimension, T represents the time dimension, and F represents the frequency dimension, when input to the time attention unit, the spectral attention unit, and the channel attention unit, respectively

Compressive deformation through the largest pooling layer redistributes the weight. For sub-band speech features, the features input from the temporal attention unit in the combined attention network layer are

After compression deformation

Output of

The sub-band speech feature input by the spectral attention unit is

After compression deformation

Output of

Feature of sub-band speech input by channel attention unit

After compression deformation

Output of

In which

，

For the number of subbands, m and n are the feature dimensions, t represents the temporal attention unit, f represents the spectral attention unit, c represents the channel attention unit, i represents the input, and o represents the output. For full-band speech features, the full-band speech features input from the time attention units in the combined attention network layer are compressed and deformed by the feature x

Output of

The full-band speech feature input by the spectral attention unit is x compressed and deformed

Output of

The full-band speech features input by the channel attention unit are compressed and deformed by the feature x

Output of

And gf represents a full band voice feature.

In a specific embodiment, the full-band speech feature and the plurality of sub-band speech features are input into a time attention unit, a frequency attention unit and a channel attention unit of the combined attention network layer, and the local features of the speech are focused from different attention dimensions in a time domain, a frequency domain and a channel domain of the full-band speech feature and the sub-band speech feature, respectively.

S104: and inputting the full-band local features and the sub-band local features into a fusion attention network layer of a voice detection model, and determining target voice fusion features.

In the step, the full-band local features and the multiple sub-band local features are input to a fusion attention network layer, feature fusion is carried out on the full-band local features and the multiple sub-band local features, and target voice fusion features are determined.

The fused attention network layer is used for carrying out feature fusion on each sub-band local feature and full-band local feature obtained by screening from the combined attention network layer, so that not only local features of the voice to be detected under different attention dimensions of time, frequency and channels are concerned, but also feature training of fusion is carried out on the features of the attention dimensions, and specific information of the voice to be detected is learned more comprehensively, so that false voice and real voice can be distinguished better.

Further, the determining the target speech fusion feature by inputting the full-band local feature and the plurality of sub-band local features into the speech detection model, includes:

a: and performing feature fusion on the plurality of sub-band local features to determine a combined sub-band local feature.

The feature fusion is performed on the multiple sub-band local features in a feature addition manner, so as to determine a combined sub-band local feature, which does not limit the feature fusion manner.

Wherein the plurality of sub-band local features correspond to the plurality of features.

b: and performing feature fusion on the combined sub-band local features and the full-band local features to determine target voice fusion features.

And performing feature fusion on the combined sub-band local features and the full-band local features in a feature addition mode to determine target voice fusion features, wherein the feature fusion mode is not limited by the part.

For example, 10 subband local features a are subjected to feature fusion in a feature addition manner to obtain a combined subband local feature B, and the combined subband local feature B and a full-band local feature C are subjected to feature fusion in a feature addition manner to obtain a target voice fusion feature D.

S105: and determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics.

In the step, whether the voice to be detected is false voice is determined by using the target voice fusion characteristic.

: fully concatenating the target voice fusion featuresAnd then, determining the false voice score of the voice to be detected.

And performing full-connection calculation processing on the target voice fusion characteristics to determine the false voice score of the voice to be detected.

: and judging whether the false voice score of the voice to be detected is greater than or equal to a preset false voice score.

The preset false voice score can be set through expert experience.

: if so, determining that the voice to be detected is false voice; if not, determining that the voice to be detected is real voice.

The false speech can be artificially synthesized speech, and the real speech is unprocessed speaker speech.

In a specific embodiment, the determined false voice score of the voice to be detected is compared with a preset false voice score, when the false voice score of the voice to be detected is greater than or equal to the preset false voice score, the voice to be detected is false voice, and when the false voice score of the voice to be detected is less than the preset false voice score, the voice to be detected is real voice.

The application provides a false voice detection method, which comprises the following steps: acquiring a voice to be detected; inputting the voice to be detected into an embedded feature extraction network layer of the voice detection model, and determining full-band voice features and a plurality of sub-band voice features; inputting the full-band voice features and the plurality of sub-band voice features into a combined attention network layer, and determining full-band local features and the plurality of sub-band local features; the full-band local features and the sub-band local features are determined by feature extraction on at least one attention dimension; inputting the full-band local features and the multiple sub-band local features into a fusion attention network layer to determine target voice fusion features; and determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics.

Therefore, the voice to be detected is directly input into the embedded feature extraction network layer for feature extraction, so that the voice feature loss is avoided, the effectiveness of the voice feature is improved, the full-frequency-band voice feature and the sub-frequency-band voice features are input into the combined attention network layer, the local features of the voice are concerned in different attention dimensions, and the accuracy of distinguishing the false voice is improved.

Referring to fig. 2, fig. 2 is a detailed flowchart illustrating a network structure of a speech detection model according to an embodiment of the present application. As shown in fig. 2, a full-band embedded feature extraction unit and a sub-band embedded feature extraction unit that input a voice to be detected to a voice detection model perform feature extraction to obtain a full-band voice feature and a plurality of sub-band voice features, respectively, input the full-band voice feature and the plurality of sub-band voice features to an encoding network layer, perform encoding processing, input the encoded full-band voice feature and the plurality of sub-band voice features to a combined attention network layer, determine a full-band local feature and a plurality of sub-band local features, respectively, input the full-band local feature and the plurality of sub-band local features to a fusion attention network layer, perform feature fusion on the plurality of sub-band local features to obtain a combined sub-band local feature, perform feature fusion on the combined sub-band local feature and the full-band local feature, and determine a target voice fusion feature. And then carrying out full-connection processing on the target voice fusion characteristics, and outputting whether the voice to be detected is false voice.

As shown in FIG. 2, the various network layers of the speech detection model are illustrated as follows:

the full-band embedded feature extraction unit is used for directly extracting full-band voice features of voice to be detected, the sub-band embedded feature extraction unit is used for directly extracting voice features of a plurality of sub-band regions of the voice to be detected, the coding network layer is used for performing dimensionality reduction and deformation processing on input features, the time attention unit is used for performing feature learning on time attention dimensionality on the input features, the frequency attention unit is used for performing feature learning on frequency attention dimensionality on the input features, the channel attention unit is used for performing feature learning on channel attention dimensionality on the input features, and the fusion attention network layer is used for performing feature fusion on the input features.

Referring to fig. 3, fig. 3 is a flowchart of another false speech detection method according to an embodiment of the present application. As shown in fig. 3, a detection method provided in an embodiment of the present application includes:

s301: and acquiring the voice to be detected.

S302: and inputting the voice to be detected into an embedded feature extraction network layer of a pre-trained voice detection model, and determining full-band voice features and a plurality of sub-band voice features.

The descriptions of S301 to S302 may refer to the descriptions of S101 to S102, and the same technical effects can be achieved, which are not described in detail.

S303: and inputting the full-band speech features and the plurality of sub-band speech features into a coding network layer of the speech detection model, and coding the full-band speech features and the plurality of sub-band speech features to obtain the coded full-band speech features and the plurality of coded sub-band speech features.

In this step, the full-band speech feature and the plurality of sub-band speech features are encoded in the encoding network layer of the speech detection model to which the full-band speech feature and the plurality of sub-band speech features are input, so that when the encoded full-band speech feature and the plurality of sub-band speech features are input to each attention unit, the full-band speech feature and the plurality of sub-band speech features are compressed and deformed by the max pooling layer to reallocate weights, respectively, so that the attention unit only focuses on information of the corresponding region.

The coding network layer is mainly used for carrying out dimensionality reduction deformation on input features and preparing for an input combined attention network layer of the next step, the coding network layer mainly comprises four sub-modules, and each sub-module comprises the following components: one-dimensional convolution kernel size of 32, one 2-dimensional BatchNorm, one two-dimensional convolution kernel size of 64, one activation function of Selu, one two-dimensional max pooling layer.

S304: and inputting the encoded full-band voice features and the encoded plurality of sub-band voice features into a combined attention network layer of the voice detection model, and determining full-band local features and a plurality of sub-band local features.

S305: and inputting the full-band local features and the sub-band local features into a fusion attention network layer of a voice detection model, and determining target voice fusion features.

S306: and determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics.

The descriptions of S304 to S306 may refer to the descriptions of S103 to S105, and the same technical effects can be achieved, which is not described in detail herein.

The detection method for the false voice provided by the embodiment of the application comprises the following steps: acquiring a voice to be detected; inputting the voice to be detected into an embedded feature extraction network layer of a pre-trained voice detection model, and determining full-band voice features and a plurality of sub-band voice features; inputting the full-band speech features and the plurality of sub-band speech features into a coding network layer of the speech detection model, and coding the full-band speech features and the plurality of sub-band speech features to obtain the coded full-band speech features and the coded plurality of sub-band speech features; and inputting the encoded full-band voice features and the encoded plurality of sub-band voice features into a combined attention network layer of the voice detection model, and determining full-band local features and a plurality of sub-band local features. Inputting the full-band local features and the sub-band local features into a fusion attention network layer of a voice detection model to determine target voice fusion features; and determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics.

Referring to fig. 4 and 5, fig. 4 is a schematic structural diagram of a false voice detection apparatus according to an embodiment of the present application, and fig. 5 is a second schematic structural diagram of a false voice detection apparatus according to an embodiment of the present application. As shown in fig. 4, the detection apparatus 400 includes:

an obtaining module 410, configured to obtain a voice to be detected;

a feature extraction module 420, configured to input the voice to be detected to an embedded feature extraction network layer of a pre-trained voice detection model, and determine a full-band voice feature and a plurality of sub-band voice features;

a local feature determining module 430, configured to input the full-band speech feature and the multiple sub-band speech features into a combined attention network layer of the speech detection model, and determine a full-band local feature and multiple sub-band local features; wherein the full-band local features and the sub-band local features are each determined by feature extraction in at least one attention dimension;

the feature fusion module 440 is configured to input the full-band local feature and the multiple sub-band local features to a fusion attention network layer of a speech detection model, and determine a target speech fusion feature;

the determining module 450 is configured to determine whether the speech to be detected is a false speech based on the target speech fusion feature.

Further, the feature extraction module 420 comprises a full-band embedded feature extraction unit and a sub-band embedded feature extraction unit at the network layer for embedded feature extraction; the voice to be detected is input to the embedded feature extraction network layer of the pre-trained voice detection model to determine full-band voice features and a plurality of sub-band voice features, and the feature extraction module 420 is specifically configured to:

Further, the local feature determination module 430 is configured to determine the full-band local feature by:

inputting the full-band voice features to a channel attention unit, and performing feature learning on the full-band voice features on channel attention dimensions to determine full-band third voice features;

Further, when the feature fusion module 440 is configured to input the full-band local feature and the multiple sub-band local features into the speech detection model for fusion, and determine the target speech fusion feature, the feature fusion module 440 is specifically configured to:

Further, as shown in fig. 5, the detecting apparatus 400 further includes an encoding module 460, where the encoding module 460 is configured to:

and inputting the full-frequency-band voice features and the sub-frequency-band voice features into a coding network layer of the voice detection model, and coding the full-frequency-band voice features and the sub-frequency-band voice features to obtain the coded full-frequency-band voice features and the coded sub-frequency-band voice features.

Further, the local feature determination module 430 is further configured to:

and inputting the encoded full-band voice features and the encoded plurality of sub-band voice features into a combined attention network layer of the voice detection model, and determining full-band local features and a plurality of sub-band local features.

Further, when the determining module 450 is configured to determine whether the speech to be detected is a false speech based on the target speech fusion feature, the determining module 450 is specifically configured to:

if so, determining that the voice to be detected is false voice;

if not, determining that the voice to be detected is real voice.

The embodiment of the application provides a detection device for false voice, which comprises: the acquisition module is used for acquiring the voice to be detected; the feature extraction module is used for inputting the voice to be detected to an embedded feature extraction network layer of a pre-trained voice detection model and determining full-band voice features and a plurality of sub-band voice features; a local feature determination module, configured to input the full-band speech feature and the multiple sub-band speech features into a combined attention network layer of the speech detection model, and determine a full-band local feature and multiple sub-band local features; wherein the full-band local features and the sub-band local features are each determined by feature extraction in at least one attention dimension; the feature fusion module is used for inputting the full-band local features and the sub-band local features into a fusion attention network layer of a voice detection model to determine target voice fusion features; and the judging module is used for determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics.

Therefore, the voice to be detected is directly input into the embedded feature extraction network layer for feature extraction, so that the loss of voice features is avoided, the effectiveness of the voice features is improved, full-band voice features and a plurality of sub-band voice features are input into the combined attention network layer, the local features of the voice are concerned in different attention dimensions, and the accuracy of distinguishing false voice is improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 6, the electronic device 600 includes a processor 610, a memory 620, and a bus 630.

The memory 620 stores machine-readable instructions executable by the processor 610, when the electronic device 600 runs, the processor 610 communicates with the memory 620 through the bus 630, and when the machine-readable instructions are executed by the processor 610, the steps of the method for detecting false speech in the method embodiments shown in fig. 1 and fig. 3 may be executed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the step of the method for detecting false speech in the method embodiments shown in fig. 1 and fig. 3 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units into only one type of logical function may be implemented in other ways, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some communication interfaces, indirect coupling or communication connection between devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for detecting false speech, the method comprising:

acquiring a voice to be detected;

inputting the full-band voice features and the sub-band voice features into a combined attention network layer of the voice detection model, and determining full-band local features and sub-band local features; wherein the full-band local features and the sub-band local features are each determined by feature extraction in at least one attention dimension;

determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics;

determining the full-band local features by:

2. The detection method according to claim 1, wherein the embedded feature extraction network layer comprises a full-band embedded feature extraction unit and a sub-band embedded feature extraction unit; the embedding feature extraction network layer that will detect speech input to the speech detection model that trains in advance, confirm full frequency band speech characteristic and a plurality of sub-band speech characteristic, include:

3. The detection method according to claim 1, wherein the determining a target speech fusion feature by inputting the full-band local feature and the plurality of sub-band local features into a speech detection model comprises:

4. The method of claim 1, wherein before inputting the full-band speech feature and the plurality of sub-band speech features into a combined attention network layer of the speech detection model and determining a full-band local feature and a plurality of sub-band local features, the method further comprises:

inputting the full-band speech features and the plurality of sub-band speech features into a combined attention network layer of the speech detection model, and determining full-band local features and a plurality of sub-band local features, including:

5. The detection method according to claim 1, wherein the determining whether the speech to be detected is false speech based on the target speech fusion feature comprises:

judging whether the false voice score of the voice to be detected is greater than or equal to a preset false voice score;

if so, determining the voice to be detected as false voice;

if not, determining that the voice to be detected is real voice.

6. An apparatus for detecting false speech, the apparatus comprising:

the acquisition module is used for acquiring the voice to be detected;

the judging module is used for determining whether the voice to be detected is false voice or not based on the target voice fusion characteristics;

the local feature determination module is further configured to determine the full-band local features by:

inputting the full-band voice features into a spectrum attention unit, and performing feature learning on the full-band voice features on a spectrum attention dimension to determine full-band second voice features;

7. The detection apparatus according to claim 6, wherein the feature extraction module comprises a full-band embedded feature extraction unit and a sub-band embedded feature extraction unit at a network layer for the embedded feature extraction; the voice to be detected is input to the embedded feature extraction network layer of the pre-trained voice detection model, and a full-frequency-band voice feature and a plurality of sub-frequency-band voice features are determined, wherein the feature extraction module is specifically used for:

inputting the voice to be detected to the sub-band embedded feature extraction unit, dividing the voice to be detected into a plurality of sub-band regions according to frequency, and determining the sub-band voice feature corresponding to each sub-band region from each sub-band region.

8. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of detecting false speech according to any one of claims 1 to 5.

9. A computer-readable storage medium, having stored thereon a computer program for performing, when being executed by a processor, the steps of the method for detecting false speech according to any one of claims 1 to 5.