CN114627876A

CN114627876A - Intelligent voice recognition security defense method and device based on audio dynamic adjustment

Info

Publication number: CN114627876A
Application number: CN202210498651.2A
Authority: CN
Inventors: 王滨; 李超豪; 王星; 闫琛; 王伟; 钱锦
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-06-14
Anticipated expiration: 2042-05-09
Also published as: CN114627876B

Abstract

The application provides an intelligent voice recognition security defense method and device based on audio dynamic adjustment, and the method comprises the following steps: acquiring audio data to be protected; carrying out speed doubling operation on the audio data to be protected by using an initial speed doubling value selected from a preset speed doubling value range to obtain audio data subjected to speed doubling operation; determining a target speed multiplying value which enables the first recognition result and the second recognition result to be inconsistent from the preset speed multiplying value range according to the comparison result of the first recognition result and the second recognition result and the initial speed multiplying value; and outputting the audio data subjected to speed doubling operation by using the target speed doubling value. The method can realize the high-concealment and high-fidelity audio security defense effect.

Description

Intelligent voice recognition security defense method and device based on audio dynamic adjustment

Technical Field

The application relates to the field of voice recognition security, in particular to an intelligent voice recognition security defense method and device based on audio dynamic adjustment.

Background

With the development of voice recognition technology, an intelligent voice recognition system gradually becomes one of important intelligent components equipped in the internet of things, so that voice interaction becomes an important scene in human-computer interaction of the internet of things. The intelligent voice recognition system can be used for scenes such as voice intelligent translation, voice control assistants and the like, and the life and the working efficiency of a user are greatly improved by automatically transcribing the input audio file.

For an input audio, the intelligent speech recognition system first performs signal preprocessing on the input audio to reduce noise in the original audio and remove extraneous frequency components. The processed audio signal is then further divided into audio frames of shorter length. Then, the intelligent speech recognition system extracts acoustic features, such as Mel Frequency Cepstral Coefficients (MFCCs) from the audio frames, and maps the extracted acoustic features to a text sequence with the highest probability based on a pre-trained speech recognition model.

However, the intelligent voice recognition system improves the convenience of life and work of people, and is also used for illegal intelligent monitoring and other malicious behaviors by an attacker, thereby posing great threat to the privacy and property safety of legal users.

Disclosure of Invention

In view of this, the present application provides an intelligent voice recognition security defense method and apparatus based on audio dynamic adjustment.

Specifically, the method is realized through the following technical scheme:

according to a first aspect of the embodiments of the present application, there is provided an intelligent voice recognition security defense method based on audio dynamics adjustment, including:

acquiring audio data to be protected;

carrying out speed doubling operation on the audio data to be protected by using an initial speed doubling value selected from a preset speed doubling value range to obtain audio data subjected to speed doubling operation;

determining a target speed multiplying value which enables the first recognition result and the second recognition result to be inconsistent from the preset speed multiplying value range according to the comparison result of the first recognition result and the second recognition result and the initial speed multiplying value; the first recognition result is the recognition result of the intelligent voice recognition model on the audio data to be protected, and the second recognition result is the second recognition result of the intelligent voice recognition model on the audio data after speed doubling operation;

and outputting the audio data subjected to speed doubling operation by using the target speed doubling value.

According to a second aspect of the embodiments of the present application, there is provided an intelligent voice recognition security defense device based on audio dynamics adjustment, including:

the device comprises an acquisition unit, a storage unit and a protection unit, wherein the acquisition unit is used for acquiring audio data to be protected;

the speed doubling operation unit is used for carrying out speed doubling operation on the audio data to be protected by utilizing an initial speed doubling value selected from a preset speed doubling value range to obtain the audio data subjected to speed doubling operation;

a determining unit, configured to determine, according to a comparison result between a first recognition result and a second recognition result and the initial speed doubling value, a target speed doubling value that makes the first recognition result inconsistent with the second recognition result from the preset speed doubling value range; the first recognition result is the recognition result of the intelligent voice recognition model on the audio data to be protected, and the second recognition result is the second recognition result of the intelligent voice recognition model on the audio data after speed doubling operation;

and the output unit is used for outputting the audio data subjected to the speed doubling operation by using the target speed doubling value.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus including:

a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine-executable instructions to implement the above-described method.

The intelligent voice recognition security defense method based on audio dynamic adjustment of the embodiment of the application sets a speed doubling value range according to the comprehension capacity of human ears on the playing speed of audio, performs speed doubling operation on audio data to be protected by using the speed doubling value determined in the speed doubling value range, determines a target speed doubling value which enables the intelligent voice recognition model to have inconsistent recognition results on the audio data before and after the speed doubling operation according to the recognition results of the intelligent voice recognition model on the audio data before and after the speed doubling operation in the speed doubling value range, performs speed doubling operation on the audio data to be protected by using the target speed doubling value to obtain the audio data which can be normally understood by human ears and can be wrongly recognized by the intelligent voice recognition model, realizes that reasonable speed doubling values can be adaptively selected for different audio data, and under the condition of no need of extra hardware, the security defense function of the illegal intelligent voice recognition system can be realized by preprocessing the audio on the software layer, the deployment is convenient, the expansibility is strong, the audio content is not required to be modified, and the high-hidden high-fidelity security defense effect is realized.

Drawings

FIG. 1 is a flow chart illustrating a method for intelligent voice recognition security defense based on audio dynamics tuning according to an exemplary embodiment of the present application;

FIG. 2 is a flowchart illustrating an intelligent voice recognition security defense method based on audio dynamics tuning according to an exemplary embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for intelligent voice recognition security defense based on audio dynamics tuning according to an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram illustrating an intelligent voice recognition security defense apparatus based on audio dynamics tuning according to an exemplary embodiment of the present application;

fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In order to make those skilled in the art better understand the technical solutions provided by the embodiments of the present application, a brief description will be given below of some terms related to the embodiments of the present application.

1. Intelligent speech recognition system: refers to a speech recognition system that is capable of automatically recognizing audio files and outputting transcribed text.

2. Speed doubling operation: refers to an operation of changing the audio playback speed without changing the audio tone.

3. Speed doubling value: the ratio of the playing speed of the audio after the double speed operation to the original audio playing speed is referred to.

In order to make the aforementioned objects, features and advantages of the embodiments of the present application more comprehensible, embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic flow chart of an intelligent voice recognition security defense method based on audio dynamics adjustment according to an embodiment of the present application is provided, and as shown in fig. 1, the intelligent voice recognition security defense method based on audio dynamics adjustment may include the following steps:

and S100, acquiring audio data to be protected.

Illustratively, the audio data to be protected may include, but is not limited to, an audio verification code.

For example, the audio verification code to be generated (i.e., the audio data used to generate the audio verification code) or the audio verification code to be propagated may be used as the audio data to be protected in the process of generating or propagating the audio verification code.

Step S110, performing speed doubling operation on the audio data to be protected by using an initial speed doubling value selected from a preset speed doubling value range, to obtain audio data after speed doubling operation.

In the embodiment of the application, for the acquired audio data to be protected, a speed doubling value can be selected from a preset speed doubling value range to serve as an initial speed doubling value, and the audio data to be protected is subjected to speed doubling operation to obtain the audio data subjected to speed doubling operation.

For example, the preset speed doubling value range may be an empirical value range set according to the comprehension capability of the human ear to the audio data (i.e., the intelligibility of the audio) under a normal condition, that is, after the audio data to be protected is subjected to speed doubling operation according to any speed doubling value within the preset speed doubling value range, the human ear can normally understand the audio data after the speed doubling operation (i.e., the user can understand the audio data after the speed doubling operation).

Step S120, determining a target speed multiplying value which makes the first recognition result inconsistent with the second recognition result from a preset speed multiplying value range according to a comparison result of the first recognition result and the second recognition result and the initial speed multiplying value; the first recognition result is the recognition result of the intelligent voice recognition model on the audio data to be protected, and the second recognition result is the second recognition result of the intelligent voice recognition model on the audio data after speed doubling operation.

In the embodiment of the application, in order to avoid malicious monitoring and cracking of the audio data to be protected, when the audio data to be protected is subjected to speed doubling operation, the protection effect of the speed doubling operation can be verified according to the recognition results of the intelligent voice recognition model on the audio data before and after the speed doubling operation, so that the recognition result (referred to as a first recognition result in this document) of the audio data subjected to the speed doubling operation by the intelligent voice recognition model is not consistent with the recognition result (referred to as a second recognition result in this document) of the original audio data to be protected by the intelligent voice recognition model.

Accordingly, in the case that the audio data to be protected is subjected to the double-speed operation in the manner described in step S110 to obtain the audio data after the double-speed operation, the audio data after the double-speed operation may be subjected to the speech recognition according to the intelligent speech recognition model to obtain a recognition result (i.e., the second recognition result), and the second recognition result and the recognition result of the intelligent speech recognition model on the original audio data (i.e., the first recognition result) are compared to determine whether the first recognition result is consistent with the second recognition result.

For example, a target speed doubling value for making the first recognition result inconsistent with the second recognition result may be determined from a preset speed doubling value range according to a comparison result between the first recognition result and the second recognition result and the initial speed doubling value.

And step S130, outputting the audio data subjected to the speed doubling operation by using the target speed doubling value.

In the embodiment of the present application, when the target multiple speed value is determined as described above, the audio data after the multiple speed operation using the target multiple speed value may be output.

It can be seen that, in the process flow of the method shown in fig. 1, a speed doubling value range is set according to the comprehension capability of human ears on the playing speed of audio, the speed doubling value determined from the speed doubling value range is utilized to perform speed doubling operation on audio data to be protected, and according to the recognition results of the intelligent voice recognition model on the audio data before and after the speed doubling operation, a target speed doubling value which makes the recognition results of the intelligent voice recognition model on the audio data before and after the speed doubling operation inconsistent is determined from the speed doubling value range, the speed doubling operation is performed on the audio data to be protected by utilizing the target speed doubling value to obtain the audio data which can be normally understood by human ears and can be wrongly recognized by the intelligent voice recognition model, so that reasonable speed doubling values can be adaptively selected for different audio data, and under the condition of no need of additional hardware, the safety defense function of an illegal intelligent voice recognition system can be realized by preprocessing the audio at a software level, the deployment is convenient, and expansibility is strong, and need not to modify audio content, has realized the high security defense effect of hiding the high fidelity.

In some embodiments, in step S120, determining a target multiple speed value that makes the first recognition result inconsistent with the second recognition result from a preset multiple speed value range according to the comparison result between the first recognition result and the second recognition result and the initial multiple speed value may include:

under the condition that the first recognition result is consistent with the second recognition result, updating the initial speed multiplication value within a preset speed multiplication value range to obtain a target speed multiplication value;

and determining the initial speed multiplication value as the target speed multiplication value when the first recognition result is inconsistent with the second recognition result.

For example, in order to search for a speed doubling value which makes the intelligent speech recognition model recognize the audio data incorrectly within a preset speed doubling value range, under the condition that the selected speed doubling value is used for performing speed doubling operation on the audio data to be protected to obtain the audio data after the speed doubling operation, the recognition results (i.e. the first recognition result and the second recognition result) of the intelligent speech recognition model on the audio data before and after the speed doubling operation can be compared.

Under the condition that the first recognition result is consistent with the second recognition result, the selected speed multiplication value can be updated in a preset speed multiplication value range, namely, a new speed multiplication value is selected from the preset speed multiplication value range, the speed multiplication operation is carried out on the audio data to be protected by using the updated speed multiplication value, whether the updated second recognition result is consistent with the first recognition result or not is determined, if so, the selected speed multiplication value is continuously updated in the preset speed multiplication value range, and the currently selected speed multiplication value is determined as the target speed multiplication value under the condition that the second recognition result is inconsistent with the first recognition result.

For example, considering that the larger the normal speed multiplication value is, the higher the probability that the intelligent speech recognition model will recognize the audio data after speed multiplication operation is usually, therefore, under the condition that the second recognition result corresponding to the currently selected speed multiplication value is consistent with the first recognition result, that is, the selected speed multiplication value needs to be updated, a higher speed multiplication value can be selected within the preset speed multiplication value range.

In the case where the first recognition result does not coincide with the second recognition result, the currently selected multiple speed value may be determined as the target multiple speed value.

For example, if the currently selected multiple speed value is the initial multiple speed value, the initial multiple speed value may be determined as a target multiple speed value, in which case, the audio data after the multiple speed operation performed by using the target multiple speed value is the audio data after the multiple speed operation obtained in step S110.

In other embodiments, in step S120, determining a target multiple speed value that makes the first recognition result inconsistent with the second recognition result from a preset multiple speed value range according to the comparison result between the first recognition result and the second recognition result and the initial multiple speed value may include:

performing iterative updating on the currently used speed doubling value within the preset speed doubling value range by using a specified updating strategy; wherein the specified update policy comprises: if the first recognition result is inconsistent with the current second recognition result, adjusting the currently used speed doubling value in a preset speed doubling value range; if the first recognition result is consistent with the current second recognition result, adjusting the currently used speed doubling value upwards within a preset speed doubling value range;

and under the condition that a preset jumping-out condition is reached, determining a target speed doubling value.

For example, considering that the closer to 1 the finally selected multiple speed value is, the better the human ear can understand the audio data after the multiple speed operation, therefore, in order to optimize the user experience, when the audio data to be protected is protected through the multiple speed operation, under the condition that the intelligent identification results of the audio data before and after the multiple speed operation are not consistent, the multiple speed value closer to 1 can be obtained by searching as much as possible, so that the human ear can understand the audio data after the multiple speed operation under the condition that the security of the audio data is improved, and the user experience is optimized.

For example, the currently used speed value may be iteratively updated within a preset speed value range by using a specified updating policy according to a comparison result (such as a consistency or inconsistency) between a recognition result (i.e., a first recognition result) of the original to-be-protected audio data by the intelligent speech recognition model and a recognition result (i.e., a second recognition result) of the audio data after the speed doubling operation by the intelligent speech recognition model, and the initial speed value, until a preset jump-out condition is reached, and a target speed value is determined.

Exemplarily, in the process of performing iterative update on the currently used multiple speed value by using the specified update policy, under the condition that the selected multiple speed value (i.e., the current multiple speed value) makes the first identification result inconsistent with the second identification result, the multiple speed value can be updated at a smaller multiple speed value selected within a preset multiple speed value range, that is, the currently used multiple speed value is adjusted downward; under the condition that the first recognition result is consistent with the second recognition result through the selected speed doubling value, a larger speed doubling value can be selected in a preset speed doubling value range to update the speed doubling value, namely, the currently used speed doubling value is upwards adjusted until a jump-out condition is reached, and the finally selected speed doubling value which enables the first recognition result to be inconsistent with the second recognition result is determined as a target speed doubling value.

In an example, the iteratively updating, by using the specified updating policy, the speed value currently used in the preset speed value range may include:

under the condition that the lower limit of the preset speed doubling value range is larger than or equal to 1, if the first recognition result is inconsistent with the current second recognition result, the upper limit of the current speed doubling value range is adjusted downwards according to a preset adjustment interval, the currently used speed doubling value is updated to the median of the adjusted speed doubling value range, the second recognition result is updated according to the updated speed doubling value, and the speed doubling value range is continuously adjusted according to the comparison result of the first recognition result and the updated second recognition result;

if the first recognition result is consistent with the current second recognition result and the currently used speed doubling value is equal to the lower limit of the current speed doubling value range, updating the currently used speed doubling value to the middle value of the current speed doubling value range, updating the second recognition result according to the updated speed doubling value, and continuously adjusting the speed doubling value range according to the comparison result of the first recognition result and the updated second recognition result;

if the first recognition result is consistent with the current second recognition result, and the currently used speed doubling value is larger than the lower limit of the current speed doubling value range and smaller than the upper limit of the current speed doubling value range, adjusting the lower limit of the current speed doubling value range upwards according to a preset adjustment interval, updating the currently used speed doubling value to the middle value of the adjusted speed doubling value range, updating the second recognition result according to the updated speed doubling value, and continuously adjusting the speed doubling value range according to the comparison result of the first recognition result and the updated second recognition result;

the determining the target speed doubling value when the preset jumping-out condition is reached may include:

and under the condition that the lower limit of the adjusted speed value range is larger than or equal to the upper limit of the adjusted speed value range, determining the median of the adjusted speed value range as the target speed value.

For example, considering that the speed value is less than 1, the larger the selected speed value is, the closer it is to 1, and the better the human ear can understand the audio data after the speed operation; and in the case that the multiple speed value is greater than 1, the smaller the selected multiple speed value is, the closer it is to 1, and the better the human ear can understand the audio data after the multiple speed operation.

Correspondingly, under the condition that the lower limit of the preset speed doubling value range is more than or equal to 1, if the second recognition result corresponding to the currently selected speed doubling value is inconsistent with the first recognition result, that is, the currently selected speed doubling value can make the intelligent voice recognition model recognize the error, at this time, in order to improve the human ear's comprehension ability of the audio data after speed doubling operation, the upper limit of the current speed doubling value range can be adjusted downwards according to the preset adjustment interval, that is, the upper limit of the speed doubling value range is reduced, for example, the upper limit of the speed doubling value range is updated to the difference between the median of the current speed doubling value range and the preset adjustment interval, the current speed doubling value is updated to the median of the adjusted speed doubling value range, the second recognition result is updated according to the updated speed doubling value, and the first recognition result and the updated second recognition result are compared, and continuously adjusting the value range of the speed multiplication value according to the comparison result of the first identification result and the updated second identification result.

If the first recognition result is consistent with the current second recognition result, different speed-doubling value updating strategies can be further adopted according to whether the currently selected speed-doubling value is the lower limit or the upper limit of the current speed-doubling value range.

For example, if the first recognition result is consistent with the current second recognition result, and the currently used multiple speed value is equal to the lower limit of the current multiple speed value range, the currently used multiple speed value may be updated to the middle value of the current multiple speed value range, the second recognition result may be updated according to the updated multiple speed value, and the multiple speed value range may be continuously adjusted according to the comparison result between the first recognition result and the updated second recognition result.

If the first recognition result is consistent with the current second recognition result, and the currently used multiple speed value is larger than the lower limit of the current multiple speed value range and smaller than the upper limit of the current multiple speed value range, the lower limit of the current multiple speed value range can be adjusted upwards according to a preset adjustment interval, for example, the lower limit of the multiple speed value range is updated to the sum of the median of the current multiple speed value range and the preset adjustment interval, the currently used multiple speed value is updated to the median of the adjusted multiple speed value range, the second recognition result is updated according to the updated multiple speed value, and the multiple speed value range is continuously adjusted according to the comparison result of the first recognition result and the updated second recognition result.

In the case that the lower limit of the adjusted speed multiplication value range is greater than or equal to the upper limit of the adjusted speed multiplication value range, it may be determined that the jump-out condition is reached, in this case, the search and update of the speed multiplication value may be ended, and the median of the adjusted speed multiplication value range is determined as the target speed multiplication value, and a specific implementation manner thereof may be described below with reference to an example.

In another example, the above iteratively updating, by using the specified update policy, the currently used multiple speed value within a preset multiple speed value range may include:

under the condition that the upper limit of a preset speed doubling value range is less than or equal to 1, if the first identification result is inconsistent with the current second identification result, adjusting the lower limit of the current speed doubling value range upwards according to a preset adjustment interval, updating the currently used speed doubling value to the middle value of the adjusted speed doubling value range, updating the second identification result according to the updated speed doubling value, and continuously adjusting the speed doubling value range according to the comparison result of the first identification result and the updated second identification result;

if the first recognition result is consistent with the current second recognition result and the currently used speed doubling value is equal to the upper limit of the current speed doubling value range, updating the currently used speed doubling value to the middle value of the current speed doubling value range, updating the second recognition result according to the updated speed doubling value, and continuously adjusting the speed doubling value range according to the comparison result of the first recognition result and the updated second recognition result;

if the first recognition result is consistent with the current second recognition result, and the currently used speed doubling value is larger than the lower limit of the current speed doubling value range and smaller than the upper limit of the current speed doubling value range, downwards adjusting the upper limit of the current speed doubling value range according to a preset adjustment interval, updating the currently used speed doubling value to the middle value of the adjusted speed doubling value range, updating the second recognition result according to the updated speed doubling value, and continuously adjusting the speed doubling value range according to the comparison result of the first recognition result and the updated second recognition result;

For example, considering that in the case that the double speed value is less than 1, the smaller the double speed value is, the greater the difference between the double speed value and the playing speed of the original audio to be protected is, the poorer the comprehensibility of the human ear to the audio data after the double speed operation is generally.

Therefore, in the case that the upper limit of the speed multiplication value range is less than or equal to 1, the upper limit or the lower limit of the speed multiplication value range is adjusted in a strategy opposite to the case that the lower limit of the speed multiplication value range is greater than or equal to 1.

Namely, under the condition that the first recognition result is inconsistent with the second recognition result, the lower limit of the value range of the current speed doubling value can be adjusted upwards; and under the condition that the first recognition result is consistent with the second recognition result, the upper limit of the current speed doubling value range can be adjusted downwards.

For example, for the case that the upper limit of the speed doubling value range is less than or equal to 1, the speed doubling value updating strategy is similar to the case that the lower limit of the speed doubling value range is greater than or equal to 1, and the embodiment of the present application is not described herein again.

In another example, the iteratively updating the currently used multiple speed value within the preset multiple speed value range by using the specified updating policy may include:

under the condition that the lower limit of a preset speed multiplication value range is less than 1 and the upper limit of the preset speed multiplication value range is greater than 1, dividing the preset speed multiplication value range into a first sub-value range and a second sub-value range, wherein the upper limit of the first sub-value range is 1, and the lower limit of the second sub-value range is 1;

respectively carrying out iterative updating on the currently used speed doubling value in the first sub-value range and carrying out iterative updating on the currently used speed doubling value in the second sub-value range by using an appointed updating strategy;

the determining the target speed doubling value when the preset jump-out condition is reached includes:

under the condition that a preset jumping-out condition is achieved, respectively obtaining a first target speed doubling value in the first sub-value range and obtaining a second target speed doubling value in the second sub-value range; determining the second target speed-multiplying value as a target speed-multiplying value under the condition that the reciprocal of the first target speed-multiplying value is larger than the second target speed-multiplying value;

determining the first target speed-multiplying value as a target speed-multiplying value under the condition that the reciprocal of the first target speed-multiplying value is smaller than the second target speed-multiplying value;

in the case where the reciprocal of the first target double speed value is equal to the second target double speed value, the first target double speed value or the second target double speed value is determined as the target double speed value.

For example, under the condition that the speed multiplication value range spans 1, that is, under the condition that the upper limit of the speed multiplication value range is greater than 1 and the lower limit is less than 1, the speed multiplication value range can be searched in a segmented manner to obtain the target speed multiplication value.

For example, the speed multiplication value range may be divided into a sub-value range with an upper limit of 1 (referred to as a first sub-value range herein) and a sub-value range with a lower limit of 1 (referred to as a second sub-value range herein), and a target speed multiplication value (referred to as a first target speed multiplication value herein) and a target speed multiplication value (referred to as a second target speed multiplication value herein) are respectively determined in the first sub-value range and the second sub-value range by using a specified update policy according to a comparison result between the first recognition result and the second recognition result and the initial speed multiplication value.

For example, a speed value having a smaller rate of change between the first target speed value and the second target speed value may be determined as the target speed value.

For example, in the case that the double speed value is less than 1, the rate of change of the double speed value is the ratio of 1 to the double speed value; in the case where the double speed value is greater than 1, the rate of change of the double speed value is the ratio of the double speed value to 1.

Accordingly, in the case where the reciprocal of the first target double speed value is greater than the second target double speed value, that is, the rate of change in the second target double speed value is smaller, the second target double speed value may be determined as the target double speed value.

In a case where the reciprocal of the first target speed multiplication value is smaller than the second target speed multiplication value, that is, the rate of change of the first target speed multiplication value is smaller, the first target speed multiplication value may be determined as the target speed multiplication value;

in the case where the reciprocal of the first target double speed value is equal to the second target double speed value, that is, the rate of change of the first target double speed value is the same as the rate of change of the second target double speed value, the first target double speed value or the second target double speed value may be determined as the target double speed value.

In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present application, the technical solutions provided by the embodiments of the present application are described below with reference to specific examples.

In the embodiment, an intelligent voice recognition security defense method based on audio dynamic adjustment is provided for security threats brought by an illegal intelligent voice recognition system. The method dynamically adjusts the audio frequency in key processes such as audio frequency generation or transmission (such as audio frequency verification codes) and the like, for example, the speed doubling operation is carried out on the whole audio frequency, so that the adjusted audio frequency can be normally understood by a legal user, and the recognition error of an illegal intelligent voice recognition system can be caused, thereby realizing the effective protection of the played audio frequency.

As shown in fig. 2 and fig. 3, the intelligent voice recognition security defense method based on audio dynamics adjustment provided by this embodiment may include the following processes:

s1, for a given input audio x, the whole audio x is subjected to speed doubling operation, and the speed doubling value is S.

Illustratively, assume the initial input audio is the original audio x₀(i.e., the audio data to be protected as described above), the initial multiple speed value is 1.

Exemplary algorithms or tools for performing double speed operations on audio data to be protected may include, but are not limited to, FFmpeg, soundport, Waveform Similarity overlay-add (wsola) algorithm or Phase Vocoder (PV-TSM) algorithm, etc.

S2, the audio data after the double speed operation is input to the intelligent speech recognition model, and the recognition result (i.e., the second recognition result) is recorded and output.

Illustratively, alternative intelligent speech recognition models (which may be referred to as test intelligent speech recognition models) may include, but are not limited to, one or more of the deepSpeech, Kaldi, etc. models.

S3: inputting the recognition result of the intelligent voice recognition model into a speed doubling value adjusting module based on a dichotomy, and if the second recognition result is inconsistent with the recognition result of the original audio data (namely the first recognition result), mark = 1; otherwise, mark = 0.

For example, the speed value search range of the speed value adjustment module based on the bisection method may be determined according to the speed value range of the adopted speed operation algorithm or tool.

For example, assume that the speed doubling value ranges from 1 to 3 (i.e., the lower limit is 1 and the upper limit is 3), and is spaced at 0.01 intervals (i.e., the preset adjustment interval is 0.01). As shown in fig. 3, a specific implementation flow of the speed doubling value adjusting module based on the bisection method may include:

s3.1: and setting the value range of the speed multiplication value as [ Low, High ].

For example, in this embodiment, the initial value of Low is 1, and the initial value of High is 3, that is, the lower limit of the speed doubling value range is equal to or greater than 1.

S3.2: if mark =1, let High = Mid-0.01; if mark =0 and s is not equal to Low, let Low = Mid + 0.01; if mark =0 and s is equal to Low, then Low is unchanged. Let Mid = ROUND ((Low + High)/2, 2).

For example, if mark =1, that is, the second recognition result is inconsistent with the first recognition result, the upper limit of the current speed doubling value range may be adjusted to be the median value minus 0.01 of the current speed doubling value range.

If mark =0, that is, the second recognition result is consistent with the first recognition result, and s is not equal to Low, that is, the currently used speed multiplication value is not equal to the lower limit of the current speed multiplication value range, the lower limit of the current speed multiplication value range may be adjusted to be the median of the current speed multiplication value range plus 0.01.

If mark =0, that is, the second recognition result is consistent with the first recognition result, and s = Low, that is, the currently used speed multiplication value is equal to the lower limit of the current speed multiplication value range, the current speed multiplication value range may not be adjusted.

For example, in the case that the value range of the adjusted multiple speed value is determined in the above manner, a median of the value range of the adjusted multiple speed value may be determined.

Wherein, the Round (, 2) function represents that 2 bits behind the decimal point of the retained x are retained, that is, when the median of the numeric range of the adjusted multiple speed value is calculated, 2 bits behind the decimal point are retained for the calculation result.

S3.3: if Low is more than or equal to High, determining that the jump-out condition is reached, and outputting the Mid as a target speed doubling value; otherwise, Mid is assigned to S, and the process goes to step S1.

S4, after the jumping-out condition is reached, the output target speed multiplication value is used for carrying out speed multiplication operation on the audio data to be protected, and therefore output audio x is obtained₁。

It should be noted that, in the embodiment of the present application, if the last output Mid is greater than the initial value of High, it indicates that the original audio (i.e., the audio data to be protected) cannot generate an output audio that can resist the illegal intelligent speech recognition system within the available range of the speed doubling value.

The methods provided herein are described above. The following describes the apparatus provided in the present application:

referring to fig. 4, a schematic structural diagram of an intelligent voice recognition security defense device based on audio dynamics adjustment according to an embodiment of the present application is shown in fig. 4, where the intelligent voice recognition security defense device based on audio dynamics adjustment may include:

an obtaining unit 410, configured to obtain audio data to be protected;

a speed doubling operation unit 420, configured to perform speed doubling operation on the audio data to be protected by using an initial speed doubling value selected from a preset speed doubling value range, to obtain audio data after speed doubling operation;

a determining unit 430, configured to determine, according to a comparison result between a first recognition result and a second recognition result and the initial multiple speed value, a target multiple speed value that makes the first recognition result inconsistent with the second recognition result from the preset multiple speed value range; the first recognition result is a recognition result of the intelligent voice recognition model on the audio data to be protected, and the second recognition result is a second recognition result of the intelligent voice recognition model on the audio data after the double-speed operation;

an output unit 440 for outputting the audio data subjected to the speed doubling operation using the target speed doubling value.

In some embodiments, the determining unit 430 determines, according to the comparison result between the first recognition result and the second recognition result and the initial multiple speed value, a target multiple speed value that makes the first recognition result inconsistent with the second recognition result from the preset multiple speed value range, including:

under the condition that the first recognition result is consistent with the second recognition result, updating the initial speed multiplication value within the preset speed multiplication value range to obtain the target speed multiplication value;

determining the initial speed-doubling value as the target speed-doubling value if the first recognition result is inconsistent with the second recognition result.

performing iterative updating on the currently used speed doubling value within the preset speed doubling value range by using a specified updating strategy; wherein the specifying an update policy comprises: if the first recognition result is inconsistent with the current second recognition result, adjusting the currently used speed doubling value in the preset speed doubling value range; if the first recognition result is consistent with the current second recognition result, adjusting the currently used speed doubling value upwards within the preset speed doubling value range;

and under the condition that a preset jumping-out condition is reached, outputting the target speed doubling value.

In some embodiments, the determining unit 430 performs iterative update on the currently used multiple speed value within the preset multiple speed value range by using a specified update policy, including:

under the condition that the lower limit of the preset speed doubling value range is larger than or equal to 1, if the first identification result is inconsistent with the current second identification result, the upper limit of the current speed doubling value range is adjusted downwards according to a preset adjustment interval, the currently used speed doubling value is updated to the middle value of the adjusted speed doubling value range, the second identification result is updated according to the updated speed doubling value, and the speed doubling value range is continuously adjusted according to the comparison result of the first identification result and the updated second identification result;

if the first recognition result is consistent with the current second recognition result, and the currently used speed multiplication value is larger than the lower limit of the current speed multiplication value range and smaller than the upper limit of the current speed multiplication value range, adjusting the lower limit of the current speed multiplication value range upwards according to the preset adjustment interval, updating the currently used speed multiplication value to the middle value of the adjusted speed multiplication value range, updating the second recognition result according to the updated speed multiplication value, and continuously adjusting the speed multiplication value range according to the comparison result of the first recognition result and the updated second recognition result;

the determining unit 430 determines the target speed doubling value when a preset jump-out condition is reached, including:

under the condition that the upper limit of the preset speed multiplication value range is less than or equal to 1, if the first identification result is inconsistent with the current second identification result, adjusting the lower limit of the current speed multiplication value range upwards according to a preset adjustment interval, updating the currently used speed multiplication value to the middle value of the adjusted speed multiplication value range, updating the second identification result according to the updated speed multiplication value, and continuously adjusting the speed multiplication value range according to the comparison result of the first identification result and the updated second identification result;

if the first recognition result is consistent with the current second recognition result and the currently used speed multiplication value is equal to the upper limit of the current speed multiplication value range, updating the currently used speed multiplication value to the middle value of the current speed multiplication value range, updating the second recognition result according to the updated speed multiplication value, and continuously adjusting the speed multiplication value range according to the comparison result of the first recognition result and the updated second recognition result;

if the first recognition result is consistent with the current second recognition result, and the currently used speed multiplication value is larger than the lower limit of the current speed multiplication value range and smaller than the upper limit of the current speed multiplication value range, then downwards adjusting the upper limit of the current speed multiplication value range according to the preset adjustment interval, updating the currently used speed multiplication value into the middle value of the adjusted speed multiplication value range, updating the second recognition result according to the updated speed multiplication value, and continuously adjusting the speed multiplication value range according to the comparison result of the first recognition result and the updated second recognition result;

under the condition that the lower limit of the preset speed multiplication value range is less than 1 and the upper limit of the preset speed multiplication value range is greater than 1, dividing the preset speed multiplication value range into a first sub-value range and a second sub-value range, wherein the upper limit of the first sub-value range is 1 and the lower limit of the second sub-value range is 1;

under the condition that a preset jumping-out condition is achieved, respectively obtaining a first target speed doubling value in the first sub-value range and obtaining a second target speed doubling value in the second sub-value range;

determining the second target speed-multiplied value as the target speed-multiplied value in a case where the reciprocal of the first target speed-multiplied value is greater than the second target speed-multiplied value;

determining the first target speed-multiplied value as the target speed-multiplied value in a case where a reciprocal of the first target speed-multiplied value is smaller than the second target speed-multiplied value;

determining the first target speed-multiplied value or the second target speed-multiplied value as the target speed-multiplied value in a case where an inverse of the first target speed-multiplied value is equal to the second target speed-multiplied value.

Correspondingly, the application also provides a hardware structure of the device shown in fig. 4. Referring to fig. 5, the hardware structure may include: a processor and a machine-readable storage medium having stored thereon machine-executable instructions executable by the processor; the processor is configured to execute machine-executable instructions to implement the methods disclosed in the above examples of the present application.

Based on the same application concept as the method, embodiments of the present application further provide a machine-readable storage medium, where a plurality of machine-executable instructions are stored, and when the machine-executable instructions are executed by a processor, the method disclosed in the above example of the present application can be implemented.

The machine-readable storage medium may be, for example, any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. An intelligent voice recognition security defense method based on audio dynamic adjustment is characterized by comprising the following steps:

acquiring audio data to be protected;

and outputting the audio data subjected to speed multiplication operation by using the target speed multiplication value.

2. The method according to claim 1, wherein the determining a target speed multiplication value that makes the first recognition result inconsistent with the second recognition result from the preset speed multiplication value range according to the comparison result between the first recognition result and the second recognition result and the initial speed multiplication value comprises:

and determining the initial speed value as the target speed value when the first recognition result is inconsistent with the second recognition result.

3. The method according to claim 1, wherein the determining a target speed multiplication value that makes the first recognition result inconsistent with the second recognition result from the preset speed multiplication value range according to the comparison result between the first recognition result and the second recognition result and the initial speed multiplication value comprises:

and under the condition that a preset jumping-out condition is reached, determining the target speed doubling value.

4. The method according to claim 3, wherein the iteratively updating the currently used speed doubling value within the preset speed doubling value range by using the specified updating policy includes:

if the first identification result is consistent with the current second identification result and the currently used double speed value is equal to the lower limit of the value range of the current double speed value, updating the currently used double speed value to be the median of the value range of the current double speed value, updating the second identification result according to the updated double speed value, and continuously adjusting the value range of the double speed value according to the comparison result of the first identification result and the updated second identification result;

determining the target speed doubling value under the condition that a preset jumping-out condition is reached comprises the following steps:

and under the condition that the lower limit of the value range of the adjusted double speed value is more than or equal to the upper limit of the value range of the adjusted double speed value, determining the median of the value range of the adjusted double speed value as the target double speed value.

5. The method according to claim 3, wherein the iteratively updating the currently used speed doubling value within the preset speed doubling value range by using the specified updating policy includes:

6. The method according to claim 3, wherein the iteratively updating the currently used speed doubling value within the preset speed doubling value range by using the specified updating policy includes:

determining the first target speed value or the second target speed value as the target speed value in a case where an inverse of the first target speed value is equal to the second target speed value.

7. An intelligent voice recognition security defense device based on audio dynamic adjustment, comprising:

the determining unit is used for determining a target speed doubling value which enables the first recognition result and the second recognition result to be inconsistent from the preset speed doubling value range according to a comparison result of the first recognition result and the second recognition result and the initial speed doubling value; the first recognition result is the recognition result of the intelligent voice recognition model on the audio data to be protected, and the second recognition result is the second recognition result of the intelligent voice recognition model on the audio data after speed doubling operation;

8. The apparatus according to claim 7, wherein the determining unit determines a target speed-doubling value that makes the first recognition result inconsistent with the second recognition result from the preset speed-doubling value range according to the comparison result between the first recognition result and the second recognition result and the initial speed-doubling value, and comprises:

9. The apparatus according to claim 7, wherein the determining unit determines a target speed-doubling value that makes the first recognition result inconsistent with the second recognition result from the preset speed-doubling value range according to the comparison result between the first recognition result and the second recognition result and the initial speed-doubling value, and comprises:

10. The apparatus according to claim 9, wherein the determining unit performs iterative update on the currently used multiple speed value within the preset multiple speed value range by using a specified update policy, including:

the determining unit determines the target speed doubling value when a preset jumping-out condition is reached, and the determining unit comprises:

11. The apparatus according to claim 9, wherein the determining unit performs iterative update on the currently used multiple speed value within the preset multiple speed value range by using a specified update policy, including:

12. The apparatus according to claim 9, wherein the determining unit performs iterative update on the currently used multiple speed value within the preset multiple speed value range by using a specified update policy, including:

respectively carrying out iterative updating on the currently used double-speed value in the first sub-value range and carrying out iterative updating on the currently used double-speed value in the second sub-value range by using a specified updating strategy;