CN115512683B

CN115512683B - Speech processing method, device, computer equipment and storage medium

Info

Publication number: CN115512683B
Application number: CN202211150068.9A
Authority: CN
Inventors: 刘巍巍; 甘颖新; 董晗; 石丽雅; 王欣; 李梦仰; 祁正伟; 姜卫军; 刘蔚; 窦嵩玉; 梁春晓; 雷茵; 李煜; 辛艳; 周敏; 胡亚军; 赵鹏; 刘建中
Original assignee: Pla 61623
Current assignee: Pla 61623
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2024-05-24
Anticipated expiration: 2042-09-21
Also published as: CN115512683A

Abstract

The present application relates to a speech processing method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: extracting audio features of the original voice and the interference voice to obtain the audio features of the original voice and the audio features of the interference voice; the interference voice is determined according to a preset voice frequency condition; according to the audio characteristics of the original voice, carrying out phoneme stacking processing on the audio characteristics of the interference voice to obtain the countermeasure voice; performing power spectrum density processing on the countermeasure voice to obtain processed target countermeasure voice; and adjusting the volume of the original voice and the target countermeasure voice, and performing track splicing on the original voice and the target countermeasure voice to obtain the countermeasure spoofing voice. By adopting the method, the frequency change caused by the influence of the environment on the anti-deception voice can be avoided, the problem that the transmission mode is limited can be solved, and the practicability of the anti-deception voice is improved.

Description

Speech processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technology, and in particular, to a voice processing method, apparatus, computer device, storage medium, and computer program product.

Background

With the development of speech technology, speech information analysis in natural language processing technology of speech is applied in more and more aspects, and the accuracy is also higher and higher. In a daily application scenario, in order to prevent a non-target voice receiving person from acquiring information included in a call process, encryption processing needs to be performed on original voice to be transmitted, so that other persons except the target voice receiving person cannot learn the information of the original voice, and the voice encryption method is called a voice AI (ARTIFICIAL INTELLIGENCE artificial intelligence) anti-deception method.

The current voice AI anti-deception method generally adds audio frequency outside the receiving range of human ears into original voice to generate anti-deception voice, and carries out communication transmission on the anti-deception voice, so that a voice processor of a non-target voice receiving person cannot accurately identify the content in voice communication and cannot extract effective information.

However, the current voice AI anti-spoofing method generates frequency variation against the influence of the environment, and the transmission mode has limitations, so that the practicability against the anti-spoofing voice is poor.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a speech processing method, apparatus, computer device, computer readable storage medium, and computer program product that can avoid frequency variations in the resistance to deceptive speech from environmental influences, and that can solve the problem of limitations in the transmission scheme, and that can improve the usability of the resistance to deceptive speech.

In a first aspect, the present application provides a method of speech processing. The method comprises the following steps:

Extracting audio features of original voice and interference voice to obtain the audio features of the original voice and the audio features of the interference voice; the interference voice is determined according to a preset voice frequency condition;

according to the audio characteristics of the original voice, carrying out phoneme stacking processing on the audio characteristics of the interference voice to obtain the countermeasure voice;

performing power spectrum density processing on the countermeasure voice to obtain processed target countermeasure voice;

and adjusting the volume of the original voice and the target countermeasure voice, and performing track splicing on the original voice and the target countermeasure voice to obtain countermeasure spoofing voice.

In one embodiment, before the extracting the audio features of the original voice and the interfering voice to obtain the audio features of the original voice and the audio features of the interfering voice, the method further includes:

acquiring an interference voice set, and carrying out voice frequency recognition on a plurality of candidate interference voices contained in the interference voice set to obtain voice frequency recognition results of the candidate interference voices;

Determining a target voice frequency recognition result meeting the voice frequency condition according to a preset voice frequency condition and each voice frequency recognition result, and taking a candidate interference voice corresponding to the target voice frequency recognition result as a target candidate interference voice;

and carrying out language recognition on the target candidate interference voice and the original voice, and determining the interference voice based on the languages of the target candidate interference voice and the original voice.

In one embodiment, the determining the interfering speech based on the target candidate interfering speech and the language of the original speech includes:

and comparing the languages of the target candidate interference voice and the original voice, and determining the target candidate interference voice as the interference voice if the languages of the target candidate interference voice are different from the languages of the original voice.

In one embodiment, the audio features of the original speech include a first phoneme feature and time information corresponding to the first phoneme feature, the audio features of the interfering speech include a second phoneme feature, and the performing phoneme stacking processing on the audio features of the interfering speech according to the audio features of the original speech to obtain the antagonistic speech includes:

Determining target second phoneme features matched with the first phoneme features in the second phoneme features contained in the audio features of the interference voice according to the first phoneme features;

According to the target second phoneme characteristics matched with the first phoneme characteristics according to the time information corresponding to each first phoneme characteristic, constructing an audio unit;

And generating countermeasure voices by the audio units based on the sequence of the time information corresponding to the first phoneme features contained in the audio units.

In one embodiment, the performing power spectral density processing on the countermeasure voice to obtain a processed target countermeasure voice includes:

Acquiring power spectral densities of the original voice and the countermeasure voice;

Determining a masking threshold according to the coincidence ratio of the power spectrum densities of the original voice and the countermeasure voice;

And processing the power spectrum density of the countermeasure voice according to the masking threshold and the power spectrum density of the original voice to obtain the target countermeasure voice meeting the preset power spectrum density condition.

In one embodiment, the adjusting the volume of the original voice and the target countermeasure voice, performing track splicing on the original voice and the target countermeasure voice to obtain the countermeasure spoofing voice, includes:

the original voice is subjected to volume adjustment according to a first target trend to obtain first original voice, and the target countermeasure voice is subjected to volume adjustment according to a second target trend to obtain first target countermeasure voice;

according to the first original voice and the first target countermeasure voice, performing track splicing to obtain initial countermeasure spoofed voice;

Determining a voice recognition accuracy and a spoofing rate according to the original voice and the initial spoofing resisting voice, and determining the initial spoofing resisting voice as the spoofing resisting voice when the voice recognition accuracy and the spoofing rate meet the preset voice spoofing condition.

In a second aspect, the application further provides a voice processing device. The device comprises:

the analysis module is used for extracting audio features of the original voice and the interference voice to obtain the audio features of the original voice and the audio features of the interference voice; the interference voice is determined according to a preset voice frequency condition;

the first processing module is used for carrying out phoneme stacking processing on the audio features of the interference voice according to the audio features of the original voice to obtain the countermeasure voice;

The second processing module is used for performing power spectrum density processing on the countermeasure voices to obtain processed target countermeasure voices;

and the third processing module is used for adjusting the volume of the original voice and the target countermeasure voice, and performing track splicing on the original voice and the target countermeasure voice to obtain the countermeasure spoofing voice.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The voice processing method, the voice processing device, the computer equipment, the storage medium and the computer program product are characterized in that a voice processor extracts audio features of original voice and interference voice to obtain the audio features of the original voice and the audio features of the interference voice; the interference voice is determined according to a preset voice frequency condition; according to the audio characteristics of the original voice, carrying out phoneme stacking processing on the audio characteristics of the interference voice to obtain the countermeasure voice; performing power spectrum density processing on the countermeasure voice to obtain processed target countermeasure voice; and adjusting the volume of the original voice and the target countermeasure voice, and performing track splicing on the original voice and the target countermeasure voice to obtain the countermeasure spoofing voice. By adopting the method, the original voice is processed by a series of methods to obtain the anti-deception voice, so that the anti-deception voice can not be influenced by the environment, the frequency stability is kept, the robustness is good, the anti-deception box attack capability is good, namely, the anti-deception voice can be transmitted in a channel mode, and the practicability of the anti-deception voice is improved.

Drawings

FIG. 1 is a flow chart of a method of speech processing in one embodiment;

FIG. 2 is a flow chart of a method for determining interfering speech in one embodiment;

FIG. 3 is a flow chart of a method of processing a stacking of phonemes in one embodiment;

FIG. 4 is a flow chart of a power spectral density processing method in one embodiment;

FIG. 5 is a flow chart of a method of processing volume according to an embodiment;

FIG. 6 is a flow chart of a challenge sample optimization step in one embodiment;

FIG. 7 is a schematic diagram of an adaptive dynamic speech challenge sample generation flow in one embodiment;

FIG. 8 is a block diagram of a speech processing device in one embodiment;

fig. 9 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, a speech processing method is provided, and a speech processor to which the method is applied is taken as an example for explanation, and the method includes the following steps:

Step S102, extracting audio features of the original voice and the interference voice to obtain the audio features of the original voice and the audio features of the interference voice.

Wherein the interference voice is determined according to a preset voice frequency condition.

In the implementation, when voice transmission is performed between target voice transmission personnel, in order to prevent the transmitted target voice from being stolen by non-target voice receiving personnel, information reflected by the target voice is obtained, so that the voice processor firstly encrypts the target voice before transmitting the target voice. The voice processor reversely applies the cocktail party effect, and performs a series of processing on the target voice to be transmitted to obtain the encrypted target voice to be transmitted. The cocktail party effect refers, among other things, to a person's ability to choose hearing, in which case attention is focused on a person's conversation and other conversations or noise in the background are ignored. Therefore, the target voice to be transmitted is the original voice, and the voice used for encrypting the original voice is the interference voice. The voice processor performs audio feature extraction on the interference voice and the original voice based on a preset audio feature extraction algorithm to respectively obtain the audio features of the original voice and the audio features of the interference voice. Specifically, the voice processor splits the original voice and the interference voice respectively to obtain an audio unit contained in the original voice and an audio unit contained in the interference voice. The audio unit of the original voice is used as the audio characteristic of the original voice, and the audio unit of the interference voice is used as the audio characteristic of the interference voice. Each audio unit contains a phoneme and time information corresponding to the phoneme. Thus, the original speech x can be expressed by formula (1), and the disturbance speech δ' can be expressed by formula (2).

Wherein,I-th phoneme of original speech x,/>The time at which the ith phoneme of the original speech x occurs,N-th phoneme of original speech x,/>Is the time at which the nth phoneme of the original speech x occurred. /(I)I-th phoneme of disturbance speech delta >/>Time of occurrence of ith phoneme for disturbing speech δ'/>To interfere with the mth phoneme of the speech delta',To interfere with the time at which the mth phoneme of the speech δ' occurs.

Optionally, the speech processor may have a sample testing stage prior to processing the target speech. This sample testing stage is to train the speech processor. The trained voice processor can encrypt the original voice according to the set program module. The target speech (i.e., the original speech) during this training phase may be randomly selected.

Alternatively, the preset audio feature extraction algorithm may use Mel-frequency cepstrum coefficient (MFCC, mel-scaleFrequency Cepstral Coefficients), perceptual linear prediction (Perceptual LINEAR PREDICTIVE, abbreviated as PLP), i_vector (network interrupt vector), and the method for each specific audio feature extraction algorithm may be preconfigured according to actual requirements, which is not limited in this embodiment of the present application.

Step S104, according to the audio characteristics of the original voice, carrying out phoneme stacking processing on the audio characteristics of the interference voice to obtain the countermeasure voice.

In an implementation, the speech processor performs a phoneme stacking process on the audio features of the interfering speech based on the audio features of the original speech to obtain the antagonistic speech. Specifically, the speech processor performs a phoneme-level speech processing on the audio units of the interfering speech based on phonemes and times corresponding to the phonemes contained in each of the audio units of the original speech, and generates a countermeasure speech based on the phonemes of the interfering speech.

Step S106, performing power spectrum density processing on the countermeasure voices to obtain processed target countermeasure voices.

In implementations, to prevent the original speech and the counter-speech from frequency masking each other, the speech processor adjusts the power spectral density of the counter-speech. The frequency masking phenomenon refers to a phenomenon that in psychoacoustics, a signal with larger loudness can make other signals with frequency close to that of the signal difficult to detect. The voice processor calculates the power spectral density of the countermeasure voice and the original voice according to a short-time Fourier transform algorithm. The power spectral density characterizes energy signal spectral density information of the speech. The speech processor sets a masking threshold based on image information reflected in the power spectral densities of the countermeasure speech and the original speech. Then, the voice processor adjusts the power spectrum density of the countermeasure voice according to the coincidence relation between the masking threshold and the power spectrum density of the original voice, and the target countermeasure voice is obtained.

Step S108, the volume of the original voice and the target countermeasure voice is adjusted, and the original voice and the target countermeasure voice are subjected to track splicing, so that the countermeasure spoofing voice is obtained.

In implementations, the speech processor adjusts the volume of the original speech and the target countermeasure speech, respectively. The voice processor performs track splicing on the original voice with the adjusted volume and the target countermeasure voice with the adjusted volume to obtain initial countermeasure spoofed voice. Then, the voice processor detects the deception rate and the voice recognition rate of the initial deception voice to obtain the deception rate and the voice recognition rate of the initial deception voice. The voice processor judges whether the deception rate and the voice recognition rate of the initial deception voice meet preset deception requirements, and if the deception requirements are met, the voice processor determines the initial deception voice as the deception voice.

In the above voice processing method, the voice processor determines the interference voice through a preset voice frequency condition. Then, the voice processor carries out phoneme feature recognition, phoneme stacking processing and power spectrum density processing on the interference voice to obtain target countermeasure voice. The voice processor adjusts the volume of the original voice and the target countermeasure voice. Then, the voice processor performs track splicing on the original voice with the volume adjusted and the target countermeasure voice with the volume adjusted to obtain initial countermeasure spoofed voice. The voice processor judges whether the initial anti-deception voice meets the preset voice deception condition. If the initial anti-fraud voice meets the preset voice fraud condition, the voice processor obtains the anti-fraud voice. By adopting the method, the original voice is processed by a series of methods to obtain the anti-deception voice, so that the anti-deception voice can not be influenced by the environment, the frequency stability is kept, the robustness is good, the anti-deception box attack capability is good, namely, the anti-deception voice can be transmitted in a channel mode, and the practicability of the anti-deception voice is improved.

In one embodiment, as shown in fig. 2, before step 102, the speech processing method further includes:

step S202, an interference voice set is obtained, voice frequency recognition is carried out on a plurality of candidate interference voices contained in the interference voice set, and voice frequency recognition results of the candidate interference voices are obtained.

In an implementation, the speech processor has stored therein an interfering speech set. The interference voice set is a voice set which comprises encryption processing of original voice. The voice processor acquires an interference voice set, and performs frequency recognition on each candidate interference voice in the interference voice set to obtain a frequency recognition result of each candidate interference voice in the interference voice set.

Step S204, determining a target voice frequency recognition result meeting the voice frequency condition according to the preset voice frequency condition and each voice frequency recognition result, and taking the candidate interference voice corresponding to the target voice frequency recognition result as the target candidate interference voice.

In practice, in order to prevent the anti-spoofing voice from being channel filtered when channel transmission is performed, it is necessary to filter the interfering voice required for generating the anti-spoofing voice. Specifically, the voice processor determines each target voice frequency recognition result meeting the preset voice frequency condition in the interference voice set according to the preset voice frequency condition and the voice frequency recognition result of each candidate interference voice. Then, the voice processor takes the candidate interference voice corresponding to each target voice frequency recognition result as a target candidate interference voice set. The preset voice frequency condition can be a voice frequency which can be heard by human ears, and the voice frequency is usually in a range of 20Hz-20000Hz.

In step S206, language recognition is performed on the target candidate interfering speech and the original speech, and the interfering speech is determined based on the languages of the target candidate interfering speech and the original speech.

In implementation, the recognition unit in the speech processor performs language recognition on each target candidate interference speech and the original speech respectively to obtain language recognition results of each target candidate interference speech and the original speech. Then, the speech processor analyzes the speech recognition result of the original speech and the speech recognition result of one target candidate interfering speech in the target candidate interfering speech set. If the language recognition results of the target candidate interfering voice and the original voice meet the preset language recognition condition, the voice processor determines the target candidate interfering voice as an interfering voice delta'.

Optionally, if the language recognition results of the target candidate interfering voice and the original voice do not meet the preset language recognition condition, the voice processor analyzes the language recognition results of the other target candidate interfering voices and the language recognition results of the original voice until the language recognition results of the selected target candidate interfering voice and the original voice meet the preset language recognition condition. The speech processor determines the selected target candidate interfering speech as interfering speech delta'.

In this embodiment, the voice processor performs frequency screening on the interference voice set according to a preset voice frequency condition to obtain a target candidate interference voice set. Then, the voice processor performs language screening on the target candidate voice set based on the language of the original voice to obtain interference voice. The anti-spoofing voice generated based on the interfering voice can be transmitted through the channel, and the environmental impact to which the anti-spoofing voice is subjected is small, thereby improving the voice recognition rate of the anti-spoofing voice.

In one embodiment, the determining the interfering speech in step S206 based on the language of the target candidate interfering speech and the language of the original speech includes:

and comparing the languages of the target candidate interference voice and the original voice, and if the languages of the target candidate interference voice are different from those of the original voice, determining the target candidate interference voice as the interference voice.

In the implementation, the speech processor performs language screening on the target candidate interference speech set by using the principle of native language priority to obtain interference speech. The mother language refers to the characteristics that the brain in psychology is more easy to pay attention to the mother language and naturally filters other languages. Specifically, the recognition module in the speech processor determines whether the languages of the original speech and the target candidate interfering speech are the same according to the language recognition result of one target candidate interfering speech in the target candidate interfering speech set and the language recognition result of the original speech. If the languages of the original speech and the target candidate interfering speech are different, the speech processor determines the target candidate interfering speech as interfering speech delta'.

Optionally, if the language recognition results of the target candidate interfering voice and the original voice are the same, the voice processor selects other target candidate interfering voices and the language recognition results of the original voice to analyze until the selected target candidate interfering voice and the language recognition results of the original voice are different, and determines the selected target candidate interfering voice as the interfering voice delta'.

In this embodiment, the speech processor performs language screening on each target candidate interference speech based on the language of the original speech, so as to obtain the interference speech. The voice processor generates anti-deception voices based on the interference voices, and improves voice recognition rate of the anti-deception voices.

In one embodiment, the audio features of the original speech include a first phoneme feature and time information corresponding to the first phoneme feature, and the audio features of the interfering speech include a second phoneme feature, as shown in fig. 3, and the specific implementation process of step S104 includes:

Step S302, determining target second phoneme characteristics matched with the first phoneme characteristics from the second phoneme characteristics contained in the audio characteristics of the interference voice according to the first phoneme characteristics.

In an implementation, the speech processor traverses phonemes of all audio units of the interfering speech based on the phoneme of the first audio unit in the original speech, and determines a phoneme matching the first audio unit in the original speech as a target phoneme among the phonemes of all audio units of the interfering speech. After completing the phoneme matching of the first audio unit in the original speech, the speech processor continues to match the phonemes of the next audio unit in the first audio unit until the phoneme matching of all the original speech audio units is completed.

Specifically, the matching process of the audio unit of the original voice and the audio unit of the interfering voice is as follows:

The speech processor determines, for each audio unit in the original speech, whether the phonemes and times in the audio unit of the original speech are equal to the phonemes and times in the audio units of the interfering speech. If the phonemes and the time are equal, the distance between the audio unit indicating the original speech and the audio unit of the interfering speech is 0, at which time the audio unit of the original speech and the audio unit of the interfering speech are perfectly matched. The speech processor takes a phoneme in the interfering speech audio unit as a target phoneme. If the original speech audio unit and each interfering speech audio unit are not exactly equal in phonemes and times, they are divided into the following three cases, specifically:

If the original speech audio unit and the interfering speech audio unit are equal in phonemes but not in time, indicating that the distance between the original speech audio unit and the interfering speech audio unit is 1, then the time in the interfering speech audio unit needs to be adjusted. The voice processor adjusts the time of the interference voice audio unit according to the time in the original voice audio unit so that the time of the interference voice audio unit is equal to the time in the original voice audio unit. After the speech processor adjusts, the speech processor determines that the distance between the original speech audio unit and the interfering speech audio unit is 0. At this time, the audio unit of the original speech is perfectly matched with the audio unit of the disturbing speech, which has equal phonemes and time. The speech processor takes the phonemes of the interfering speech audio unit as target phonemes. If the original speech audio unit and each audio unit of the interfering speech are unequal in phonemes, the times are equal, or the original speech audio unit and each audio unit of the interfering speech are unequal in phonemes, the speech processor does not adjust the times in the audio units of the interfering speech. The speech processor determines that the distance between the original speech audio unit and the interfering speech audio unit is 1, the interfering speech not matching the audio unit of the original speech. The speech processor continues to perform phoneme matching for the next original speech audio unit until phoneme matching for all audio units in the original speech is completed.

And matching each audio unit between the original voice and the interference voice based on the distance between the original voice and the interference voice. Specifically, the distance between the original voice and the interference voice is the sum of the distances between each audio unit of the original voice and each audio unit of the interference voice, and when the distance between the original voice and the interference voice is minimum, the phonemes of each audio unit of the original voice and the phonemes of the interference voice are matched to the greatest extent.

Definition of distance D (x, δ ') of original voice x and interfering voice σ', as shown in the following equation (3), distance D (x _i,δ_j ') of audio unit of original voice x and audio unit of interfering voice σ', as shown in the following equation (4):

D(x,δ′)＝||d(x_i,δ_j′)||₁(0<i≤n,i≤j) (3)

The meaning of each parameter in the formula (3) and the formula (4) is described in detail in step S102 of the embodiment of the present application, so the disclosure is not repeated here.

Step S304, constructing and obtaining an audio unit according to the time information corresponding to each first phoneme feature and the target second phoneme feature matched with the first phoneme feature.

In practice, the speech processor dynamically adjusts the time of the audio unit in the interfering speech according to the time sequence corresponding to the phonemes in the original speech audio unit to form an audio unit against speech, so as to minimize the distance between the original speech and the interfering speech. Specifically, the voice processor adjusts the time of the audio unit of the interfering voice according to the time of the audio unit of the original voice, so that the time of the audio unit of the original voice is equal to the time of the audio unit of the interfering voice. I.e. atUnder fixed conditions, the speech processor needs to dynamically adjust/>The audio unit against speech is constructed such that D (x, δ ") =min D (x, δ'). Wherein, min D (x, δ') is the minimum distance between the original speech and the interfering speech, i.e. the speech processor needs to dynamically adjust the time of the audio unit in the interfering speech to minimize the distance between the original speech and the interfering speech.

Optionally, the speech processor assigns a time sequence in the interfering speech audio unit based on the time sequence of the original speech audio unit to form an audio unit against speech. The method for dynamically adjusting the time of the audio unit in the interfering voice according to each specific embodiment may be preconfigured according to the actual requirement, and the embodiment of the present application is not limited herein.

In step S306, the countermeasure voices are generated for the audio units based on the sequence of the time information corresponding to the first phoneme feature included in the audio units.

In an implementation, the speech processor sequentially sorts the audio units of the countermeasure speech according to a time sequence of the audio units of the countermeasure speech, to obtain a sequence of audio units of the countermeasure speech. The speech processor then generates a challenge speech delta "from the sequence of challenge speech audio units as shown in equation (5) below:

wherein, To combat the mth phoneme of speech p,/>To combat the time at which the mth phoneme of speech p occurs, δ "_m is the mth audio unit of speech p.

In this embodiment, the speech processor adjusts the audio features of the interfering speech according to the phonemes and time of the audio units in the audio features of the original speech, so that the distance between the original speech and the interfering speech is the shortest, even if the phonemes of each audio unit of the original speech can be maximally matched with the phonemes of the interfering speech. The voice processor generates the adjusted interference voice as the countermeasure voice. Based on the counter spoofing voice generated by the counter voice, the voice recognition accuracy of the counter spoofing voice can be improved without reducing the counter spoofing voice rate.

In one embodiment, as shown in fig. 4, the step 106 specifically includes:

step S402, obtaining the power spectral densities of the original speech and the countermeasure speech.

In practice, the speech processor performs a short-time fourier transform calculation on the original speech x to obtain the spectrum of the original speech. Then, the voice processor uses a hanning window (hanningwindow) to window the spectrum of the original voice so as to obtain the spectrum of the processed original voice. Specifically, the speech processor sets the window length to 2048 and the frame shift to 512. The speech processor sets s _x (k) to represent the k frame spectrum of the processed original speech, and then calculates the logarithmic power spectrum density of the k frame of the original speech according to s _x (k), and the specific calculation formula is as follows:

The principle that the log power spectral density of the kth frame of the original voice is calculated by the voice processor is the same as that of the original voice, the log power spectral density of the antagonism voice is calculated by the voice processor according to the kth frame spectrum of the processed antagonism voice delta', and the log power spectral densities of the original voice and the antagonism voice are obtained by the voice processor after the calculation is completed as shown in the following formula (7).

Optionally, the spectrum method of the original voice obtained by the voice processor may not only adopt short-time fourier transform, but also adopt general fourier transform, wavelet transform, etc., and the spectrum obtaining method specific to each specific type may be preconfigured according to actual requirements, and the embodiment of the present application is not limited herein.

Optionally, when the voice processor performs windowing, not only the hanning window may be used for windowing, but also hamming (hamming) windows, rectangular windows and the like, and the method for each specific windowing process may be preconfigured according to actual requirements, which is not limited in this embodiment of the present application.

Step S404, determining a masking threshold according to the coincidence degree of the power spectrum densities of the original voice and the countermeasure voice.

In practice, in order to avoid the masking of the challenge speech and the original speech from each other, resulting in the transmission information in the original speech being unencrypted or missing, the speech processor needs to perform a power spectral density processing on the challenge speech. Specifically, the speech processor determines an image overlap ratio from images of the log power spectral densities of the original speech and the countermeasure speech. The image coincidence degree is determined by the intersection points of two logarithmic power spectrum densities, when the number of the intersection points is more, the image coincidence degree is higher, and when the number of the intersection points is less, the image coincidence degree is lower. The image characterizes the original speech and the energy signal spectral density information of the counter-speech. The speech processor then determines a masking threshold based on the image overlap ratio. Specifically, in the case of the lowest overlap ratio (i.e., the lowest intersection point of the log power spectral density image of the original voice and the log power spectral density image of the antagonistic voice), the voice processor takes the intermediate value of the two curves as the masking threshold θ, and draws the image of the masking threshold, which is a straight line with a changing abscissa and a constant ordinate. Wherein the abscissa indicates the time when the speech occurs and the ordinate indicates the logarithmic power spectral density of the speech. Further, the intersection of the determined image of the masking threshold with the log power spectral density image of the original speech and the counter-speech is minimal.

Step S406, processing the power spectrum density of the countermeasure voice according to the masking threshold and the power spectrum density of the original voice to obtain the target countermeasure voice meeting the preset power spectrum density condition.

In implementations, the speech processor determines whether the power spectral density of the original speech is greater than a masking threshold. When the power spectrum density of the original voice is larger than the masking threshold, the voice processor adjusts the power spectrum density of the countermeasure voice to enable the power spectrum density of the countermeasure voice to approach 0; when the power spectrum density of the original voice is smaller than the masking threshold, the voice processor takes the power spectrum density of the countermeasure voice to obtain the target countermeasure voice. That is, the speech processor sets the masking threshold to θ, adjusts p _δ″ (k) so that p _δ″(k)＝p_δ″ (k) is p _δ″(k)→0,p_x (k) θ when p _x (k) θ is not less than θ, thereby obtaining the target countermeasure speech δ' ".

In this embodiment, the speech processor calculates the power spectral densities of the original speech and the counter speech, and then the speech processor sets the masking threshold according to the coincidence of the power spectral densities of the original speech and the counter speech. Then, the voice processor adjusts the power spectrum density of the countermeasure voice according to the relation between the masking threshold and the power spectrum density of the original voice, and the target countermeasure voice is obtained. According to the target anti-deception voice generated by the target anti-deception voice, the voice recognition accuracy of the anti-deception voice can be improved, the influence of the environment on the anti-deception voice is reduced, and the phenomenon that transmission information in the original voice is not encrypted or the transmission information is missing is avoided.

In one embodiment, as shown in FIG. 5, step 108 includes:

step S502, volume adjustment is performed on the original voice according to a first target trend to obtain a first original voice, and volume adjustment is performed on the target countermeasure voice according to a second target trend to obtain a first target countermeasure voice.

In an implementation, the voice processor adjusts according to a first target trend with the volume of the original voice as a starting point, so as to obtain a plurality of first original voices with decreasing volume including the original voice. Wherein the first target trend is a decreasing trend. And simultaneously, the voice processor adjusts the voice according to the second target trend by taking the volume of the target countermeasure voices as a starting point to obtain a plurality of first target countermeasure voices with increasing volume including the target countermeasure voices. Wherein the second target trend is an increasing trend.

Step S504, according to the first original voice and the first target countermeasure voice, performing track splicing to obtain an initial countermeasure spoofed voice.

In the implementation, the voice processor performs track splicing on each first original voice and each first target countermeasure voice to obtain initial countermeasure spoofing voice.

Step S506, determining the voice recognition accuracy and the deception rate according to the original voice and the initial deception voice, and determining the initial deception voice as the deception voice when the voice recognition accuracy and the deception rate meet the preset voice deception conditions.

In the implementation, the original voice and the initial anti-deception voice are input into the recognition unit of the voice processor, and the recognition unit in the voice processor performs recognition to obtain the recognition results of the original voice and the initial anti-deception voice. Then, the recognition unit of the voice processor inputs the recognition results of the original voice and the initial anti-deception voice into the recognition unit in the voice processor to carry out recognition, so as to obtain the deception rate of the initial anti-deception voice. And acquiring a manual recognition result of the initial anti-deception voice and inputting the result into a recognition unit of the voice processor. The manual recognition result is obtained by directly recognizing the target voice receiver through the human ear. The recognition unit in the voice processor inputs the manual recognition result into the discrimination unit in the voice processor to discriminate, and the voice recognition accuracy of the initial anti-deception voice is obtained. As shown in fig. 6, the speech processor will repeat steps S502 to S504 to obtain the speech recognition accuracy and fraud rate of each first original speech and each first target countermeasure speech with different signal to noise ratios (the signal to noise ratio is the energy ratio of the first original speech and the first target countermeasure speech). Wherein the signal-to-noise ratio of the first original speech and the first target challenge speech can be expressed asP _x and P _δ″′ are the energies of the first original speech and the first target countermeasure speech, respectively. The voice processor determines the initial anti-deception voice as the anti-deception voice when the voice recognition accuracy and the deception rate are judged to be optimal, i.e., when the voice recognition accuracy and the deception rate are balanced and both the value of the voice recognition accuracy and the value of the deception rate are the highest.

In this embodiment, the voice processor adjusts the volume of the original voice and the target countermeasure voice. The voice processor performs track splicing on the original voice with the adjusted volume and the target countermeasure voice with the adjusted volume to obtain initial countermeasure spoofed voice. Then, the voice processor determines the voice recognition accuracy and the deception rate of the initial deception voice according to the original voice with the adjusted volume and the target deception voice with the adjusted volume. The voice processor judges whether the initial anti-deception voice meets the preset voice deception condition according to the voice recognition accuracy and deception rate of the initial anti-deception voice. When the initial anti-deception speech meets the preset speech deception condition, the speech processor determines the initial anti-deception speech as the anti-deception speech, so that the speech recognition accuracy and deception rate of the anti-deception speech can be improved, and the influence of the environment on the frequency of the anti-deception speech can be reduced.

In this embodiment, an example of a voice processing method is provided, as shown in fig. 7, where the method includes:

Step S701, performing language identification on the target candidate interference voice and the original voice, and determining the interference voice based on the languages of the target candidate interference voice and the original voice; the target candidate interference voice is determined according to a preset voice frequency condition.

Step S702, extracting audio features of the original voice and the interference voice to obtain the audio features of the original voice and the audio features of the interference voice;

Step S703, performing phoneme stacking processing on the audio features of the interference voice according to the audio features of the original voice to obtain the countermeasure voice;

step S704, calculating power spectrum densities of the original voice and the countermeasure voice;

Step S705, determining a masking threshold according to the coincidence degree of the power spectrum densities of the original voice and the countermeasure voice; processing the power spectrum density of the countermeasure voice according to the masking threshold and the power spectrum density of the original voice to obtain a target countermeasure voice meeting the preset power spectrum density condition;

Step S706, the volume of the original voice and the target countermeasure voice is adjusted, and the original voice and the target countermeasure voice are subjected to track splicing, so that the countermeasure spoofing voice is obtained.

Specifically, the speech processor stores a set of interfering voices ("spoofed content" in fig. 7). The speech processor screens the set of interfering voices according to a preset voice frequency condition to obtain interfering voices delta' (language confusion voices in fig. 7). The screening process in the speech processor includes language recognition of the original speech and the target candidate interfering speech (in order to distinguish between the interfering speech and the language of the original speech) and determination of the target candidate interfering speech (frequency screening of the set of interfering speech to obtain the target candidate interfering speech). The voice processor respectively carries out phoneme recognition on the original voice and the interference voice to obtain the audio characteristics of the original voice and the audio characteristics of the interference voice. Based on the audio characteristics of the original speech, the speech processor performs a phoneme stacking process on the interfering speech to obtain the antagonistic speech δ ". Then, the voice processor performs power spectrum density calculation on the original voice and the countermeasure voice to obtain a power spectrum density calculation result of the original voice and a power spectrum density calculation result of the countermeasure voice. The Power spectral density calculation is "PSD (Power SPECTRAL DENSITY Power spectral density) calculation" in fig. 7. The voice processor performs frequency optimization according to the power spectral density calculation result of the original voice and the power spectral density calculation result of the countermeasure voice to obtain target countermeasure voice delta'. The original speech and the target countermeasure speech are input to a speech processor, which performs countermeasure sample optimization processing on the original speech and the target countermeasure speech to obtain countermeasure spoofed speech ("directional countermeasure audio" in fig. 7).

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited in the present application, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a voice processing device for realizing the above related voice processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the speech processing device provided below may refer to the limitation of the speech processing method described above, and will not be repeated here.

In one embodiment, as shown in FIG. 8, there is provided a speech processing apparatus 800 comprising: an analysis module 801, a first processing module 802, a second processing module 803, and a third processing module 804, wherein:

The analysis module 801 is configured to perform audio feature extraction on the original speech and the interfering speech to obtain an audio feature of the original speech and an audio feature of the interfering speech; the interfering speech is determined according to a predetermined speech frequency condition.

The first processing module 802 is configured to perform phoneme stacking processing on the audio features of the interfering speech according to the audio features of the original speech, so as to obtain the antagonistic speech.

The second processing module 803 is configured to perform power spectral density processing on the countermeasure voice, so as to obtain a processed target countermeasure voice.

The third processing module 804 adjusts the volume of the original voice and the target countermeasure voice, and performs track splicing on the original voice and the target countermeasure voice to obtain the countermeasure spoofed voice.

In an exemplary embodiment, before the analysis module performs the operation, the determination module includes:

The recognition sub-module is used for acquiring the interference voice set, and carrying out voice frequency recognition on a plurality of candidate interference voices contained in the interference voice set to obtain voice frequency recognition results of the candidate interference voices.

The first determining sub-module is used for determining a target voice frequency recognition result meeting the voice frequency condition according to a preset voice frequency condition and each voice frequency recognition result, and taking candidate interference voices corresponding to the target voice frequency recognition result as target candidate interference voices.

And the second determining module is used for carrying out language identification on the target candidate interference voice and the original voice and determining the interference voice based on the languages of the target candidate interference voice and the original voice.

In an exemplary embodiment, the second determining module includes:

and the third determination submodule is used for comparing the languages of the target candidate interference voice and the original voice, and determining the target candidate interference voice as the interference voice if the languages of the target candidate interference voice are different from the languages of the original voice.

In an exemplary embodiment, the first processing module 802 includes:

And the fourth determining submodule is used for determining target second phoneme characteristics matched with each first phoneme characteristic in each second phoneme characteristic contained in the audio characteristics of the interference voice according to each first phoneme characteristic.

And the construction submodule is used for constructing and obtaining the audio unit according to the target second phoneme characteristic matched with the first phoneme characteristic according to the time information corresponding to each first phoneme characteristic.

And the generation sub-module is used for generating the countermeasure voice for each audio unit based on the sequence of the time information corresponding to the first phoneme characteristic contained in each audio unit.

In an exemplary embodiment, the second processing module 803 includes:

And the acquisition sub-module is used for acquiring the power spectrum densities of the original voice and the countermeasure voice.

And a fifth determination submodule, configured to determine a masking threshold according to the overlap ratio of the power spectral densities of the original speech and the countermeasure speech.

The first processing sub-module is used for processing the power spectrum density of the countermeasure voice according to the masking threshold and the power spectrum density of the original voice to obtain the target countermeasure voice meeting the preset power spectrum density condition.

In an exemplary embodiment, the third processing module 804 includes:

And the adjusting sub-module is used for adjusting the volume of the original voice according to the first target trend to obtain a first original voice, and adjusting the volume of the target countermeasure voice according to the second target trend to obtain a first target countermeasure voice.

And the second processing sub-module is used for performing track splicing according to the first original voice and the first target countermeasure voice to obtain initial countermeasure spoofing voice.

And a sixth determining sub-module for determining a speech recognition accuracy and a fraud rate according to the original speech and the initial fraud prevention speech, and determining the initial fraud prevention speech as the fraud prevention speech when the speech recognition accuracy and the fraud rate satisfy a preset speech fraud condition.

The various modules in the speech processing device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 9. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of speech processing, the method comprising:

performing language identification on the target candidate interference voice and the original voice, and determining the interference voice based on the languages of the target candidate interference voice and the original voice;

Extracting audio features of the original voice and the interference voice to obtain the audio features of the original voice and the audio features of the interference voice; the interference voice is determined according to a preset voice frequency condition;

The volume of the original voice and the target countermeasure voice is adjusted, and the original voice and the target countermeasure voice are subjected to track splicing to obtain countermeasure spoofing voice;

the determining the interference voice based on the target candidate interference voice and the language of the original voice comprises the following steps:

Comparing the languages of the target candidate interference voice and the original voice, and determining the target candidate interference voice as the interference voice if the language of the target candidate interference voice is different from the language of the original voice;

and performing phoneme stacking processing on the audio features of the interfering voice according to the audio features of the original voice to obtain antagonistic voice, wherein the method comprises the following steps:

And performing phoneme-level voice processing on the audio units of the interference voice based on phonemes contained in each audio unit of the original voice and the time corresponding to the phonemes, and generating countermeasure voice based on the phonemes of the interference voice.

2. The method according to claim 1, wherein the audio features of the original speech include a first phoneme feature and time information corresponding to the first phoneme feature, the audio features of the interfering speech include a second phoneme feature, and the performing phoneme stacking processing on the audio features of the interfering speech according to the audio features of the original speech to obtain the antagonistic speech includes:

3. The method of claim 1, wherein said subjecting the challenge speech to power spectral density processing results in a processed target challenge speech, comprising:

4. A method according to claim 3, wherein the overlap ratio is determined by the intersection of the power spectral densities of the original speech and the challenge speech; the more the intersection points, the higher the overlap ratio.

5. The method of claim 1, wherein said performing volume adjustment on said original speech and said target counter-speech, performing track splicing on said original speech and said target counter-speech, and obtaining counter-spoofed speech, comprises:

6. The method of claim 5, wherein the volume adjusting the original speech according to the first target trend to obtain the first original speech, and the volume adjusting the target countermeasure speech according to the second target trend to obtain the first target countermeasure speech, comprises:

Adjusting according to a first target trend with the volume of the original voice as a starting point to obtain a plurality of first original voices with decreasing volume including the original voice;

and adjusting the volume of the target countermeasure voices serving as a starting point according to a second target trend to obtain a plurality of first target countermeasure voices with increasing volume including the target countermeasure voices.

7. A speech processing apparatus, the apparatus comprising:

The acquisition module is used for acquiring an interference voice set, and carrying out voice frequency recognition on a plurality of candidate interference voices contained in the interference voice set to obtain voice frequency recognition results of the candidate interference voices; determining a target voice frequency recognition result meeting the voice frequency condition according to a preset voice frequency condition and each voice frequency recognition result, and taking a candidate interference voice corresponding to the target voice frequency recognition result as a target candidate interference voice;

The determining module is used for carrying out language identification on the target candidate interference voice and the original voice and determining the interference voice based on the languages of the target candidate interference voice and the original voice;

The third processing module is used for adjusting the volume of the original voice and the target countermeasure voice, and performing track splicing on the original voice and the target countermeasure voice to obtain countermeasure spoofing voice;

The determining module is specifically configured to compare the language of the target candidate interfering voice with the language of the original voice, and determine the target candidate interfering voice as the interfering voice if the language of the target candidate interfering voice is different from the language of the original voice;

The first processing module is specifically configured to perform phoneme-level speech processing on the audio units of the interfering speech based on phonemes included in each audio unit of the original speech and times corresponding to the phonemes, and generate an antagonistic speech based on the phonemes of the interfering speech.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the speech processing method of any of claims 1 to 6 when executing the computer program.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the speech processing method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the speech processing method of any of claims 1 to 6.