CN108198574B

CN108198574B - Sound change detection method and device

Info

Publication number: CN108198574B
Application number: CN201711475093.3A
Authority: CN
Inventors: 李晋; 殷兵; 柳林; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2020-12-08
Anticipated expiration: 2037-12-29
Also published as: CN108198574A

Abstract

The method and the device for detecting the sound change acquire the voice data to be detected which is to be authenticated and matched with a target object; determining the voiceprint characteristic information to be detected matched with the voice data to be detected and a voice forgery judgment result by using a preset sound change detection model; determining the similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object to obtain the voiceprint similarity; and determining whether the voice data to be detected is artificially forged voice-changing voice data or not according to the voice forgery judgment result and the voiceprint similarity. The voice print characteristic information to be detected matched with the voice data to be detected and the voice forgery judgment result are determined by the voice change detection model, so that the voice data to be detected is detected, and the detection efficiency of the voice data to be detected and the accuracy of the detection result are greatly improved.

Description

Sound change detection method and device

Technical Field

The invention belongs to the field of information processing, and particularly relates to a sound change detection method and device.

Background

With the development of modern voice signal processing technology, the identity authentication method based on voiceprint recognition is favored by more and more users, but under the condition of mass data interference, besides the unavoidable condition of two similar natural voices, the voice-changing voice of artificial counterfeiting can also occur, which can seriously affect the accuracy of the voiceprint recognition technology.

At present, for the voice-changing voice forged by people, the voice frequency spectrum characteristic diagram is generally compared by means of empirical analysis, the frequency spectrum difference of the voice which is not forged by people and is forged by people is observed and compared, and whether the voice detected by a target is forged by people or not is qualitatively judged. However, the voice is detected by manual observation and qualitative judgment, the voice to be detected needs to be manually subjected to feature point comparison and spectrogram analysis one by one, the time consumption is long, the accuracy is not high, and the method cannot be applied to large-scale voice detection.

Therefore, an efficient and accurate sound-changing detection scheme is urgently needed at present to meet the requirement of large-scale voice detection.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for detecting a change of voice, so as to solve the technical problem of low efficiency and accuracy of the existing voiceprint recognition technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of detecting a change of voice, comprising:

acquiring voice data to be tested to be authenticated and matched with a target object;

determining the voiceprint characteristic information to be detected matched with the voice data to be detected and a voice forgery judgment result by using a preset sound change detection model;

determining the similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object to obtain the voiceprint similarity;

and determining whether the voice data to be detected is artificially forged voice-changing voice data or not according to the voice forgery judgment result and the voiceprint similarity.

A sound change detection apparatus comprising:

the device comprises a to-be-tested data acquisition unit, a target object authentication matching unit and a to-be-tested data matching unit, wherein the to-be-tested data acquisition unit is used for acquiring to-be-tested voice data to be authenticated and matched with the target object;

the detection result determining unit is used for determining the voiceprint characteristic information to be detected matched with the voice data to be detected and the voice forgery judgment result by using a preset sound change detection model;

a voiceprint similarity determining unit, configured to determine similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object, so as to obtain voiceprint similarity;

and the forgery result determining unit is used for determining whether the voice data to be detected is artificially forged voice-changing voice data or not according to the voice forgery judgment result and the voiceprint similarity.

According to the technical scheme, the voice change detection method and the voice change detection device provided by the invention have the advantages that the preset voice change detection model is utilized to determine the voiceprint characteristic information to be detected and the voice forgery judgment result which are matched with the voice data to be detected, the similarity between the voiceprint characteristic information to be detected and the registered voiceprint characteristic information of the target object is determined, then, whether the voice data to be detected is the manually forged voice change data or not is determined together according to the voice forgery judgment result and the voiceprint similarity, and compared with a mode of manually observing and qualitatively judging whether the voice data to be detected is artificially forged or not, the detection efficiency of the voice data to be detected and the accuracy of the detection result are greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting a change of voice according to an embodiment of the present application;

fig. 2 is another flowchart of a change sound detection method provided in an embodiment of the present application;

fig. 3 is a flowchart of a sound change detection method provided in an embodiment of the present application;

FIG. 4 is a flowchart of a training process of a variant detection model provided in an embodiment of the present application;

fig. 5 shows a schematic structural diagram of a CNN model provided in the present application;

fig. 6 is a schematic structural diagram of a voice detection apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Before introducing the sound change detection method disclosed in the embodiment of the present application, a brief introduction is first made to a conceptual process of the sound change detection method disclosed in the embodiment of the present application, specifically as follows:

in the existing sound change detection scheme, a voice frequency spectrum characteristic diagram is compared mainly by means of empirical analysis, the frequency spectrum difference of voices which are not artificially forged and are artificially forged is observed and compared, and whether the target detection voice is artificially forged or not is qualitatively judged. The detection mode needs to depend on a large amount of practical operation of detection personnel, has higher requirements on the skills of the detection personnel, consumes more time for detection, is not suitable for large-scale voice detection, and has lower detection precision.

In view of the above problems, the present invention determines, by using a preset change detection model, voiceprint feature information to be detected that matches to-be-detected voice data to be authenticated and matched with a target object, and a voice forgery determination result, determines a similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object, obtains a voiceprint similarity, and finally determines whether the to-be-detected voice data is artificially forged change voice data according to the voice forgery determination result and the voiceprint similarity, thereby implementing intelligent quantitative detection of voice data, and greatly improving detection efficiency and detection accuracy.

Next, a method of detecting a change in sound disclosed in an embodiment of the present application will be described.

Referring to fig. 1, fig. 1 is a flowchart of a sound variation detection method according to an embodiment of the present application.

As shown in fig. 1, the method includes:

s100: and acquiring voice data to be tested to be authenticated and matched with the target object.

The target object generally refers to an object capable of uttering voice, such as a speaker or the like. A matching relation exists between the target object and the voice data of the target object, the target object matched with the voice data can be identified based on the matching relation, and identity identification of the target object is further achieved.

However, in practical applications, the obtained voice data to be detected may be originated from a non-target object or may be forged, so in order to ensure the accuracy of target object identity recognition based on the voice data, it is necessary to detect the voice data to be detected that is to be authenticated and matched with the target object, and determine whether the data to be detected is artificially forged voice-variant data.

S110: and determining the voiceprint characteristic information to be detected matched with the voice data to be detected and a voice forgery judgment result by using a preset sound change detection model.

The voice change detection model is obtained by training the training voice data which is marked with the class label of the voice generating object and whether the voice is forged, meanwhile, the class judgment of the voice generating object and the judgment whether the voice is forged are considered, the voice data to be detected are comprehensively detected, and the accuracy of the detection result can be effectively improved.

And the detection result comprises the voiceprint characteristic information to be detected matched with the voice data to be detected and a voice forgery judgment result. The voice print feature information to be detected reflects the authenticity of the voice data to be detected from the category angle of the voice generating object, and the voice counterfeiting judgment result reflects the authenticity of the voice data to be detected from the angle of whether the voice is forged or not.

In the embodiment, preset sound change detection models are used for determining the voiceprint characteristic information to be detected and the voice counterfeiting judgment result which are matched with the voice data to be detected, and the spectrum difference of voice which is not artificially counterfeited and is artificially counterfeited is not compared in a manual observation mode, so that whether the voice data to be detected is artificially counterfeited or not is qualitatively judged, the intelligence and the automation of the detection process are improved, the detection time is greatly shortened, and the detection precision is improved.

S120: and determining the similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object to obtain the voiceprint similarity.

The registered voiceprint feature information of the target object is the voiceprint feature information matched with the target object, can be used for representing the identity information of the target object, and is also the real voiceprint feature information of the target object. After determining the voiceprint feature information to be detected matched with the voice data to be detected, combining the registered voiceprint feature information of the target object, and determining the similarity between the voiceprint feature information and the registered voiceprint feature information, namely the voiceprint similarity. The larger the voiceprint similarity is, the higher the possibility that the voice data to be detected is the voice data matched with the target object is shown; the smaller the voiceprint similarity is, the lower the possibility that the voice data to be detected is the voice data matched with the target object is.

The voiceprint feature information registered by the target object may be that, when the target object is registered, the acoustic change detection model acquires real voice data input by the target object, and outputs voiceprint feature information matched with the real voice data as the voiceprint feature information registered by the target object.

S130: and determining whether the voice data to be detected is artificially forged voice-changing voice data or not according to the voice forgery judgment result and the voiceprint similarity.

The voice forgery determination result can be used for representing whether the voice data to be detected is forged or not, or representing the possibility that the voice data to be detected is forged or not. And determining whether the voice data to be detected is artificially forged voice-changing voice data or not by combining the voice forgery judgment result and the voiceprint similarity, so that a more accurate detection result can be obtained.

In the method for detecting a change of voice provided by this embodiment, preset change of voice detection models are used to determine voiceprint feature information to be detected that matches the voice data to be detected, and voice forgery determination results, so as to determine similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object, and then according to the voice forgery determination results and the voiceprint similarity, it is determined whether the voice data to be detected is manually forged change of voice data, so that compared with a mode of manually observing and qualitatively determining whether the voice data to be detected is artificially forged, detection efficiency of the voice data to be detected and accuracy of the detection results are greatly improved.

Referring to fig. 2, fig. 2 is another flowchart of a change sound detection method according to an embodiment of the present application.

As shown in fig. 2, the method includes:

s200: and acquiring voice data to be tested to be authenticated and matched with the target object.

The step S200 is similar to the step S100 in the foregoing embodiment, and reference may be made to the foregoing embodiment for details, which are not repeated herein.

S210: and inputting the voice data to be detected into a preset sound change detection model.

The voice change detection model is obtained by training the training voice data which is marked with the class label of the voice generation object and whether the voice forges the label. When the voice data to be detected is detected by using the acoustic change detection model, the voice data to be detected needs to be input into the acoustic change detection model. The voice change detection model generally includes an input layer, a hidden layer and an output layer, and the voice data to be detected is input into a preset voice change detection model, specifically, the voice data to be detected is input into the input layer of the preset voice change detection model.

S220: acquiring a feature vector output by a public hidden layer of the acoustic change detection model, and determining the voiceprint feature information to be detected matched with the voice data to be detected according to the feature vector;

in this embodiment, the preset change detection model may include two output channels, where a first output channel outputs a class label of the speech generation object, a second output channel outputs whether the speech is forged, and a last hidden layer of the change detection model serves as a common hidden layer of the two output channels.

The acoustic change detection model can be trained in a MultiTask training MultiTask mode, namely two output channels of the acoustic change detection model are respectively two training tasks, and the acoustic change detection model meeting requirements of the two tasks is obtained through co-training of the two tasks.

After the voice data to be detected is input into a preset sound change detection model, a public hidden layer of the sound change detection model outputs a corresponding feature vector, and the voiceprint feature information to be detected matched with the voice data to be detected can be determined according to the feature vector.

S230: and acquiring whether the voice output by the second output channel of the sound change detection model is a result of voice forgery or not, and determining a voice forgery judgment result according to the result of voice forgery or not.

And outputting a voice forgery judgment result matched with the voice data to be detected in the second output channel while outputting a corresponding feature vector by the public hidden layer of the sound change detection model, and determining the voice forgery judgment result matched with the voice data to be detected according to the voice forgery judgment result matched with the voice data to be detected.

S240: determining the similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object to obtain the voiceprint similarity;

s250: and determining whether the voice data to be detected is artificially forged voice-changing voice data or not according to the voice forgery judgment result and the voiceprint similarity.

Steps S240 to S250 are similar to steps S120 to S130 in the foregoing embodiments, and reference may be made to the foregoing embodiments for details, which are not repeated herein.

According to the voice change detection method, the public hidden layer of the voice change detection model is utilized to obtain the characteristic vector matched with the voice data to be detected, whether the voice is forged or not is obtained through the second output channel, the voiceprint characteristic information to be detected and the voice forging judgment result matched with the voice data to be detected are further determined, the voiceprint characteristic information registered by the target object is combined, and whether the voice data to be detected is the manually forged voice change data or not is determined according to the similarity between the voiceprint characteristic information to be detected and the registered voiceprint characteristic information of the target object and the voice forging judgment result, so that the quantitative detection of the voice data to be detected is realized, the detection efficiency is improved, and the accuracy of the detection result is further improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a sound variation detection method according to an embodiment of the present application.

As shown in fig. 3, the method includes:

s300: and acquiring voice data to be tested to be authenticated and matched with the target object.

The step S300 is similar to the step S100 or S200 in the foregoing embodiment, and reference may be made to the foregoing embodiment for details, which are not repeated herein.

S310: and segmenting the voice data to be detected to obtain a plurality of voice fragments to be detected.

For example, the voice data to be tested is divided into M voice fragments to be tested, wherein M is larger than 1.

In an example, when the voicing detection model is a Convolutional Neural Network (CNN) model, step S310 may include:

A1) and carrying out Fourier transform on the voice data to be tested to obtain transformed Fourier characteristics.

A2) And windowing the transformed Fourier features to obtain a plurality of spectrogram fragments as voice fragments to be detected.

Specifically, the dimension of the transformed fourier feature is recorded as d, the transformed fourier feature is windowed according to the window length l, and M voice segments to be detected are obtained, wherein the size of each voice segment to be detected is l × d.

In another example, when the acoustic change detection model is a Deep Neural Network (DNN) model or a Long Short-Term Memory (LSTM) model, step S310 may include:

B1) carrying out Fourier transform on the voice data to be tested to obtain transformed Fourier characteristics;

B2) and framing the transformed Fourier characteristics to obtain a plurality of voice fragment frames serving as voice fragments to be detected.

S320: and inputting each voice segment to be detected into a preset sound change detection model.

The voice change detection model is obtained by training the training voice data which is marked with the class label of the voice generation object and whether the voice forges the label.

S330: and acquiring a feature vector which is output by a public hidden layer of the acoustic change detection model and is matched with each voice segment to be detected.

In this embodiment, the preset change detection model includes two output channels, a first output channel outputs a category label of a speech generation object, a second output channel outputs whether a speech is forged, and a last hidden layer of the change detection model serves as a common hidden layer of the two output channels.

For example, after each voice segment to be detected is input into a preset sound variation detection model, the public hidden layer of the sound variation detection model outputs a feature vector h respectively matched with each voice segment to be detected_i，i∈[1,M]。

S340: and determining the voice print characteristic information to be detected matched with the voice data to be detected according to the characteristic vector matched with each voice fragment to be detected.

The voice data to be tested is composed of all the voice fragments to be tested, so the feature vector h matched with all the voice fragments to be tested_iAnd the characteristic vector is matched with the voice data to be detected, and further, the voice print characteristic information (c-vector) to be detected matched with the voice data to be detected can be determined according to the characteristic vector matched with each voice fragment to be detected.

In one example, the matched voiceprint feature information (c-vector) of the voice data to be tested is calculated by using the following formula:

wherein i is the label of the voice segment to be tested, N is the number of the voice segment to be tested in the voice data to be tested, h_iAnd c is the characteristic vector matched with the ith voice segment to be detected, and c is the voice print characteristic information (c-vector) matched with the voice data to be detected.

S350: and acquiring whether the voice corresponding to each voice segment to be detected output by a second output channel of the acoustic change detection model is forged or not.

Outputting each characteristic vector h matched with the voice to be detected by a public hidden layer of the acoustic change detection model_iAnd simultaneously, outputting whether the voice corresponding to each voice segment to be detected is forged or not in a second output channel.

In one example, the voice forgery result is a voice forgery score s_iWhether the voice is forged or not score s_iA lower value indicates a higher probability that the speech segment is artificially forged.

In another example, whether the voice of the voicing detection model is forged includes a "yes" Tag (TRUE) and a "no" tag (FALSE), where the "yes" Tag (TRUE) characterizes the voice as forged and the "no" tag (FALSE) characterizes the voice as not forged. Accordingly, whether the voice is forged score s_iThe judgment score of the voice segment to be detected corresponding to the 'No' label (FALSE) is determined.

S360: and determining a voice forgery judgment result corresponding to the voice data to be detected according to whether the voice corresponding to each voice segment to be detected is forged or not.

Because the voice data to be detected is composed of the voice segments to be detected, whether the voice corresponding to the voice segments to be detected is forged or not is judged_iNecessarily corresponding to the voice data to be tested, and further, according to whether the voice corresponding to each voice segment to be tested forges the result s_iAnd determining a voice forgery judgment result S corresponding to the voice data to be detected₁。

In one example, the following formula is used to calculate the voice falsification judgment result S corresponding to the voice data to be tested₁：

Wherein i is the label of the voice segment to be tested, N is the number of the voice segment to be tested in the voice data to be tested, s_iWhether the voice corresponding to the ith voice segment to be tested is forged or not is judged, S₁Speech forge judgment corresponding to speech data to be testedAnd (6) determining the result.

S370: and determining the similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object to obtain the voiceprint similarity.

In one example, the voiceprint similarity is calculated using the following formula:

wherein, c^TIs a transposed matrix of c, c' is the registered voiceprint feature information of the target object, S₂And determining the voiceprint similarity of the voiceprint characteristic information to be detected and the registered voiceprint characteristic information of the target object. Specifically, c and c' are both feature vectors.

S380: and determining whether the voice data to be detected is artificially forged voice-changing voice data or not according to the voice forgery judgment result and the voiceprint similarity.

In one example, the voice falsification determination result S₁The lower the voice forgery determination score, the higher the possibility that the voice data is artificially forged; the voiceprint similarity S₂Scoring the voiceprint similarity;

in this example, the step S380 includes:

C1) and performing weighted fusion on the voice forgery judgment score and the voiceprint similarity score, and taking the result as the forgery similarity score of the voice data to be detected.

The weighted fusion mode may be weighted addition, weighted multiplication, or the like.

Specifically, taking weighted addition as an example, the following formula is used to calculate the forgery similarity score of the voice data to be detected:

S＝k·S₁+(1-k)·S₂

wherein S is₁Scoring the speech forgery decision, S₂Is the score of voiceprint similarity, S is the score of forgery similarity, k is the weight coefficient, k belongs to [0,1 ]]。

OptionalWhen k is 0, S is S₂Namely, the voiceprint similarity score is used as the forgery similarity score of the voice data to be detected; when k is 1, S is S₁Namely, the voice forgery determination score is used as the forgery similarity score of the voice data to be detected.

C2) And determining whether the voice data to be detected is artificially forged voice-changing voice data or not according to the size relationship between the forged similarity score and a preset forged similarity threshold value.

The threshold value of the counterfeit similarity can be set through experience, for example, a set is formed by selecting a plurality of voice data which are known to be counterfeit manually or not, a corresponding score of the counterfeit similarity is obtained for each voice data in the set through the process, and then the threshold value of the counterfeit similarity is set manually according to the score of the counterfeit similarity of each voice data and the result of whether the voice data is counterfeit manually or not according to experience.

In an example, when the forgery similarity score is greater than a preset forgery similarity threshold, the voice data to be tested is determined not to be artificially forged voice-variant data and can pass the identity authentication of a target object; and when the counterfeiting similarity score is not greater than a preset counterfeiting similarity threshold value, determining that the voice data to be detected is artificially forged voice-changing voice data and cannot pass the identity authentication of the target object.

According to the voice change detection method, the voice data to be detected is segmented to obtain a plurality of voice fragments to be detected, a public hidden layer of a voice change detection model is utilized to obtain a feature vector matched with each voice fragment to be detected and a corresponding voice forgery result, and voiceprint feature information registered by a target object is combined to determine a voiceprint similarity score and a voice forgery judgment score of the voice data to be detected, so that whether the voice data to be detected is the voice data to be artificially forged is determined, quantitative detection of the voice data to be detected is further achieved through an accurate data result, and the accuracy of the detection result is improved.

Referring to fig. 4, fig. 4 is a flowchart of a training process of a change sound detection model according to an embodiment of the present application.

The preset change of voice detection model in the foregoing embodiment is obtained by training with training voice data that is labeled with a class label of a voice generation object and whether a voice is a counterfeit label.

As shown in fig. 4, the training process includes:

s400: the method comprises the steps of obtaining an original training data set, wherein the original training data set comprises training voice data generated by a plurality of voice generation objects, and each training voice data is marked with a class label of the voice generation object and a label whether voice is forged or not.

The class labels of the speech generating objects are used to distinguish between different speech generating objects, in particular the number of class labels of a speech generating object is the same as the number of speech generating objects.

In an example, for a speech generating object, when any one of the category labels is 1, it indicates that the category label matches with the speech generating object, and at this time, the other category labels are all 0, indicating that the other category labels do not match with the speech generating object.

Whether the voice is forged or not is used to represent whether the voice data is forged or not.

In one example, the voice falsification flags include a "yes" flag (TRUE) and a "no" Flag (FALSE), and for any training voice data, when the "yes" flag (TRUE) is 1 and the "no" Flag (FALSE) is 0, the training voice data is denoted as falsified voice data; when the yes label (TRUE) is 0 and the no label (FALSE) is 1, it indicates that the training voice data is not the falsified voice data.

S410: and training the sound variation detection model by using the original training data set to obtain the trained sound variation detection model.

The training process of the acoustic change detection model is realized through the above steps S400-S410.

However, the acoustic change detection model obtained through the training in the above steps has huge spatial parameters of training data, and inevitably has some noise objects or parameters, so that the accuracy of the acoustic change detection model in detecting the voice data to be detected is influenced, and the model has huge parameter quantity and low calculation efficiency. Therefore, in other embodiments, after step S410, the training process may further include:

s420: and determining the classification accuracy of the trained acoustic change detection model to each speech generation object corresponding to the training data used in the training process of the trained acoustic change detection model.

In an example, the trained change sound detection model is determined as a pending change sound detection model, the original training data set is determined as a target training data set, and the classification accuracy of the pending change sound detection model for each speech generation object in the target training data set is obtained.

Specifically, the classification accuracy of each speech generating object can be calculated using the following formula:

R_r＝M_r/M_all

wherein R is_rGenerating a classification accuracy of an object r for speech, M_allThe number of training data segments corresponding to the speech generating object r is Mr, and the number of training data segments with correct detection results after the training data segments corresponding to the speech generating object r are detected by using the trained unvoiced detection model.

S430: and determining the voice generation object with the classification accuracy lower than the set classification accuracy threshold value as a noise object in the previous training process.

For example, the classification accuracy R when the speech generating object R_rBelow a set classification accuracy threshold R_thIt indicates that the speech generating object r is a noise object.

S440: and performing iterative clipping and iterative training on the sound variation detection model according to the number of the noise objects in the previous round of training.

Specifically, when the number of the noise objects is determined not to satisfy the set noise object number condition, the noise change detection model is iterated, deleting the network parameters and class labels related to the noise object in the previous training process to obtain a parameter-adjusted sound variation detection model, and using the training data left after the training data corresponding to the noise object is removed from the previous round of training data, continuously training the sound variation detection model after the parameters are adjusted until the sound variation detection model obtained after the nth training is utilized to calculate the classification accuracy of each voice generation object corresponding to the training data used in the nth training process, and determining that the number of noise objects with classification accuracy lower than the set classification accuracy threshold satisfies the set noise object number condition, the change sound detection model obtained after the nth round of training is used as a final change sound detection model.

In one example, setting the noise object number condition may include: p < P_max. Wherein P is the ratio of the number of noise objects in the previous training process to the number of all speech generating objects in the target training data set, P_maxIs a preset ratio threshold.

When P is more than or equal to P_maxThen, determining that the number of the noise objects does not meet the set noise object number condition, and deleting network parameters and category labels related to the noise objects in the previous training process in the variable sound detection model to be determined to obtain a variable sound detection model with adjusted parameters; deleting the training data corresponding to the noise object in the target training data set to obtain a new target training data set; and continuously training the parameter-adjusted sound variation detection model by using the new target training data set to obtain a trained sound variation detection model serving as an undetermined sound variation detection model, and re-executing the step of obtaining the classification accuracy of the undetermined sound variation detection model on each voice generation object in the target training data set until the sound variation detection model obtained after the nth training is used for calculating the classification accuracy of each voice generation object corresponding to the training data used in the nth training process, and determining the number of noise objects of which the classification accuracy is lower than the set classification accuracy threshold.

When P < P_maxAnd taking the to-be-determined sound variation detection model (namely the sound variation detection model obtained after the nth round of training) as a final sound variation detection model.

In other examples, setting the noise object number condition may further include: the number of noise objects in the previous training round is smaller than a preset number threshold, and other optional conditions are met.

Steps S420 to S440 are actually further optimization processes of the sound change detection model obtained in step S410, so as to implement clipping of the sound change detection model, eliminate negative effects of noise objects and parameters on the detection result, further simplify the model parameters, and improve the calculation efficiency and the accuracy of the detection result.

In the training process of the acoustic change detection model provided by the embodiment, a multi-task training mode is adopted, and meanwhile, based on the class label of the voice generation object and whether the voice forges the label, the acoustic change detection model is trained by using the original training data set to obtain the trained acoustic change detection model, so that the robustness of the acoustic change detection model is improved; and moreover, the trained acoustic change detection model is subjected to network cutting and optimization training, so that the scale of the acoustic change detection model is greatly reduced, the negative influence of noise objects and parameters on the detection result is eliminated, and the detection efficiency and the accuracy of the detection result are further improved.

In an embodiment, the preset acoustic change detection model may adopt a Convolutional Neural Network (CNN) model, a Deep Neural Network (DNN) model, or a Long Short-Term Memory (LSTM) model. Taking the CNN model as an example of the acoustic change detection model, fig. 5 shows a structural schematic diagram of the CNN model provided by the embodiment of the present invention.

As shown in fig. 5, the CNN model includes an input layer (input layer), an output layer (output layer), a hidden layer (hidden layer), and a common layer (common layer), and the CNN model Label layer (Label) includes two parts, a class Label (Speaker Labels) of the speech generating object on the left side, and a Voice conversion Flag (true/false) Label on the right side. The CNN model further includes a convolutional layer (conv), an activation layer (relu), and a pooling layer (pooling), wherein the pooling layer (pooling) includes a maximum pooling layer (max-pool) and an average pooling layer (ave-pool).

In practical application, the acquired voice data to be tested, which is authenticated and matched with a target object, is input into an input layer (input layer) of a trained CNN model, and a feature vector matched with the voice data to be tested is output in a common layer (common layer) and a voice falsification result is output in an output channel after the operation process of middle hidden layers (hidden layers) such as a convolutional layer (conv), an active layer (relu) and a pooling layer (pooling) is performed. The result of whether the voice output by the output channel is forged may be a result of determining whether the voice is forged with the label "no" or "yes", such as a determination score value.

The embodiment of the invention also provides a sound change detection device, which is used for realizing the sound change detection method provided by the embodiment of the invention, and the content of the sound change detection device described below can be referred to in correspondence with the content of the sound change detection method described above.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a sound variation detecting apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the apparatus includes:

a to-be-tested data acquisition unit 100, configured to acquire to-be-tested voice data to be authenticated and matched with a target object;

a detection result determining unit 200, configured to determine, by using a preset acoustic change detection model, to-be-detected voiceprint feature information and a voice forgery determination result that are matched with the to-be-detected voice data;

a voiceprint similarity determining unit 300, configured to determine a similarity between the voiceprint feature information to be detected and the registered voiceprint feature information of the target object, so as to obtain a voiceprint similarity;

a falsification result determining unit 400, configured to determine whether the voice data to be detected is artificially falsified voice data according to the voice falsification determination result and the voiceprint similarity.

In an example, the sound change detection model is obtained by training with training voice data labeled with a class label of a voice generation object and whether a voice is forged or not.

In one example, the preset change detection model comprises two output channels, wherein the first output channel outputs a class label of a voice generation object, the second output channel outputs whether a voice is forged or not, and the last hidden layer of the change detection model serves as a common hidden layer of the two output channels;

the detection result determination unit includes:

the voice data input unit is used for inputting the voice data to be detected into a preset sound change detection model;

the voice print determination unit to be detected is used for acquiring a characteristic vector output by a public hidden layer of the acoustic change detection model and determining voice print characteristic information to be detected matched with the voice data to be detected according to the characteristic vector;

and the to-be-detected forgery determination unit is used for acquiring whether the voice output by the second output channel of the sound change detection model is a forgery result or not and determining a voice forgery judgment result according to the voice forgery result or not.

In another example, the data input unit under test includes:

the to-be-detected data segmentation unit is used for segmenting the to-be-detected voice data to obtain a plurality of to-be-detected voice fragments;

the voice detection device comprises a voice segment input unit to be detected, a voice detection unit and a voice detection unit, wherein the voice segment input unit to be detected is used for inputting each voice segment to be detected into a preset sound change detection model;

the to-be-tested voiceprint determination unit comprises:

the segment vector acquisition unit is used for acquiring a feature vector which is output by a public hidden layer of the acoustic change detection model and is matched with each voice segment to be detected;

the segment voiceprint determining unit is used for determining the voiceprint feature information to be detected matched with the voice data to be detected according to the feature vector matched with each voice segment to be detected;

the forgery determination unit to be measured includes:

the segment result acquiring unit is used for acquiring whether the voice corresponding to each voice segment to be detected output by the second output channel of the acoustic change detection model is forged or not;

and the fragment forgery determination unit is used for determining a voice forgery judgment result corresponding to the voice data to be detected according to whether the voice corresponding to each voice fragment to be detected is forged or not.

In another example, the data to be tested segmentation unit includes:

the characteristic transformation unit is used for carrying out Fourier transformation on the voice data to be tested to obtain transformed Fourier characteristics;

and the speech spectrum windowing unit is used for windowing the transformed Fourier features to obtain a plurality of speech spectrum fragments as the speech fragments to be detected.

In yet another example, the voice forgery determination result is a voice forgery determination score, and a lower voice forgery determination score indicates a higher possibility that the voice data is artificially forged; the voiceprint similarity is a voiceprint similarity score;

the forgery result determination unit includes:

the voice forgery judgment unit is used for judging whether the voice data to be detected is similar to the voice data to be detected or not;

and the voice-variant determining unit is used for determining whether the voice data to be detected is artificially forged voice-variant data or not according to the forged similarity score and the magnitude relation of a preset forged similarity threshold.

The voice change detection device provided by this embodiment determines, by using a preset voice change detection model, voiceprint feature information to be detected that matches the voice data to be detected, and a voice forgery determination result, determines a similarity between the voiceprint feature information to be detected and voiceprint feature information registered in the target object, and then determines, according to the voice forgery determination result and the voiceprint similarity, whether the voice data to be detected is manually forged voicechange voice data, so that, compared with a mode of manually observing and qualitatively determining whether the voice data to be detected is artificially forged, the detection efficiency of the voice data to be detected and the accuracy of the detection result are greatly improved.

In other embodiments, the sound change detection apparatus of the present invention further includes a model training unit including:

a training data acquisition unit, configured to acquire an original training data set, where the original training data set includes training speech data generated by a plurality of speech generating objects, and each training speech data is labeled with a category label of a speech generating object and whether a speech is forged or not;

and the acoustic change model training unit is used for training the acoustic change detection model by utilizing the original training data set to obtain the trained acoustic change detection model.

In an example, the model training unit further comprises:

the classification accuracy determining unit is used for determining the classification accuracy of the trained acoustic change detection model to each voice generation object corresponding to the training data used in the training process;

a noise object determination unit for determining a speech generation object whose classification accuracy is lower than a set classification accuracy threshold as a noise object in the previous round of training;

an iterative clipping training unit for iterating the noise change detection model when the number of the noise objects is determined not to satisfy the set noise object number condition, deleting the network parameters and class labels related to the noise object in the previous training process to obtain a parameter-adjusted sound variation detection model, and using the training data left after the training data corresponding to the noise object is removed from the previous round of training data, continuously training the sound variation detection model after the parameters are adjusted until the sound variation detection model obtained after the nth training is utilized to calculate the classification accuracy of each voice generation object corresponding to the training data used in the nth training process, and determining that the number of noise objects with classification accuracy lower than the set classification accuracy threshold satisfies the set noise object number condition, the change sound detection model obtained after the nth round of training is used as a final change sound detection model.

The acoustic change detection device provided by the embodiment comprises a model training unit, a voice recognition unit and a voice recognition unit, wherein the model training unit is used for training an acoustic change detection model by using the original training data set based on a multi-task training mode and on whether a voice generates a class label of an object and a voice forges the label, so that the robustness of the acoustic change detection model is improved; and the noise-variation detection device also comprises a model cutting unit which is used for carrying out network cutting and optimization training on the trained noise-variation detection model, so that the scale of the noise-variation detection model is greatly reduced, the negative influence of noise objects and parameters on the detection result is eliminated, and the detection efficiency and the accuracy of the detection result are further improved.

Finally, it is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary hardware platform, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. With this understanding in mind, the technical solutions of the present application may be embodied in whole or in part in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method of detecting a change in sound, comprising:

determining the voiceprint characteristic information to be detected matched with the voice data to be detected and a voice forgery judgment result by using a preset sound change detection model; the voice change detection model is obtained by training the training voice data which is marked with the class label of the voice generation object and whether the voice forges the label;

2. The method according to claim 1, wherein the preset change detection model comprises two output channels, a first output channel outputs a class label of a speech generating object, a second output channel outputs whether a speech is forged or not, and a last hidden layer of the change detection model is used as a common hidden layer of the two output channels;

the method for determining the voiceprint characteristic information to be detected matched with the voice data to be detected and the voice counterfeiting judgment result by using the preset acoustic change detection model comprises the following steps:

inputting the voice data to be detected into a preset sound change detection model;

acquiring a feature vector output by a public hidden layer of the acoustic change detection model, and determining the voiceprint feature information to be detected matched with the voice data to be detected according to the feature vector;

and acquiring whether the voice output by the second output channel of the sound change detection model is a result of voice forgery or not, and determining a voice forgery judgment result according to the result of voice forgery or not.

3. The method of claim 2, wherein the inputting the voice data to be tested into a preset change detection model comprises:

segmenting the voice data to be detected to obtain a plurality of voice fragments to be detected;

inputting each voice segment to be detected into a preset sound change detection model;

the obtaining of the feature vector output by the public hidden layer of the acoustic change detection model and the determining of the voiceprint feature information to be detected matched with the voice data to be detected according to the feature vector comprise:

acquiring a feature vector which is output by a public hidden layer of the sound change detection model and is matched with each voice segment to be detected;

determining the voice print characteristic information to be detected matched with the voice data to be detected according to the characteristic vector matched with each voice fragment to be detected;

the obtaining whether the voice output by the second output channel of the acoustic change detection model is a result of voice forgery or not, and determining a voice forgery judgment result according to the result of voice forgery or not includes:

acquiring whether the voice corresponding to each voice segment to be detected output by a second output channel of the sound change detection model is forged or not;

and determining a voice forgery judgment result corresponding to the voice data to be detected according to whether the voice corresponding to each voice segment to be detected is forged or not.

4. The method according to claim 3, wherein the segmenting the voice data to be tested to obtain a plurality of voice fragments to be tested comprises:

performing Fourier transform on the voice data to be detected to obtain transformed Fourier characteristics;

and windowing the transformed Fourier features to obtain a plurality of spectrogram fragments as voice fragments to be detected.

5. The method according to claim 1, wherein the voice forgery determination result is a voice forgery determination score, and a lower voice forgery determination score indicates a higher possibility that voice data is artificially forged; the voiceprint similarity is a voiceprint similarity score;

determining whether the voice data to be detected is artificially forged voice-changing voice data according to the voice forgery judgment result and the voiceprint similarity, wherein the determining comprises the following steps:

carrying out weighted fusion on the voice forgery judgment score and the voiceprint similarity score, and taking the result as the forgery similarity score of the voice data to be detected;

and determining whether the voice data to be detected is artificially forged voice-variant voice data or not according to the size relationship between the forged similarity score and a preset forged similarity threshold value.

6. The method of claim 1, wherein the training process of the change detection model comprises:

acquiring an original training data set, wherein the original training data set comprises training voice data generated by a plurality of voice generating objects, and each training voice data is labeled with a class label of the voice generating object and a label whether the voice is forged or not;

and training the sound variation detection model by using the original training data set to obtain the trained sound variation detection model.

7. The method of claim 6, wherein the training process of the change detection model further comprises:

determining the classification accuracy of the trained acoustic change detection model to each voice generation object corresponding to training data used in the training process of the trained acoustic change detection model;

determining a voice generation object with classification accuracy lower than a set classification accuracy threshold value as a noise object in the previous training process;

when the number of the noise objects is determined not to meet the condition of setting the number of the noise objects, iterating the noise change detection model, deleting the network parameters and class labels related to the noise object in the previous training process to obtain a parameter-adjusted sound variation detection model, and using the training data left after the training data corresponding to the noise object is removed from the previous round of training data, continuously training the sound variation detection model after the parameters are adjusted until the sound variation detection model obtained after the nth training is utilized to calculate the classification accuracy of each voice generation object corresponding to the training data used in the nth training process, and determining that the number of noise objects with classification accuracy lower than the set classification accuracy threshold satisfies the set noise object number condition, the change sound detection model obtained after the nth round of training is used as a final change sound detection model.

8. A change sound detection apparatus, characterized by comprising:

the detection result determining unit is used for determining the voiceprint characteristic information to be detected matched with the voice data to be detected and the voice forgery judgment result by using a preset sound change detection model; the voice change detection model is obtained by training the training voice data which is marked with the class label of the voice generation object and whether the voice forges the label;

9. The apparatus of claim 8, wherein the preset change detection model comprises two output channels, a first output channel outputs a class label of the speech generating object, a second output channel outputs whether the speech is forged, and a last hidden layer of the change detection model serves as a common hidden layer of the two output channels;

the detection result determination unit includes:

10. The apparatus of claim 8, further comprising a model training unit, the model training unit comprising:

11. The apparatus of claim 10, wherein the model training unit further comprises: