CN117116275A

CN117116275A - Multi-mode fused audio watermarking method, device and storage medium

Info

Publication number: CN117116275A
Application number: CN202311372299.9A
Authority: CN
Inventors: 吕少卿; 俞鸣园; 王克彦; 曹亚曦; 孙俊伟; 费敏健
Original assignee: Zhejiang Huachuang Video Signal Technology Co Ltd
Current assignee: Zhejiang Huachuang Video Signal Technology Co Ltd
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2023-11-24
Anticipated expiration: 2043-10-23
Also published as: CN117116275B

Abstract

The application discloses a multi-mode fusion audio watermark adding method, equipment and a storage medium. The method comprises the following steps: acquiring audio data; acquiring audio characteristics of the audio data and acquiring biological characteristics corresponding to the audio data; fusing the audio features and the biological features to obtain an audio watermark; an audio watermark is embedded in the audio data. By the method, the audio watermark which is difficult to detect can be generated, so that the safety of the audio data is improved.

Description

Multi-mode fused audio watermarking method, device and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method, an apparatus, and a storage medium for adding a multi-modal fused audio watermark.

Background

With the rapid development of the internet, the acquisition of audio and video contents becomes increasingly convenient. Therefore, the method is a vital work for protecting the security of the generated audio data and preventing the audio data from being leaked or unauthorized in the application scenes of video conferences, live broadcasting, recorded broadcasting, education and the like. To effectively copyright protect audio content, an audio watermark may be typically added to the audio content.

The current generated audio watermark is simpler and is easy to crack, so that the audio watermark in the audio data can be easily erased.

Disclosure of Invention

The application mainly solves the technical problem of providing a multi-mode fusion audio watermark adding method, equipment and a storage medium, which can generate an audio watermark which is difficult to detect, thereby improving the security of audio data.

In order to solve the technical problems, the first technical scheme adopted by the application is as follows: the method for adding the multi-mode fused audio watermark comprises the following steps: acquiring audio data; acquiring audio characteristics of the audio data and acquiring biological characteristics corresponding to the audio data; fusing the audio features and the biological features to obtain an audio watermark; an audio watermark is embedded in the audio data.

In order to solve the technical problems, a second technical scheme adopted by the application is as follows: providing a computer device comprising a processor, a memory, and a communication circuit; the communication circuit and the memory are respectively coupled to a processor, the memory is used for storing a computer program, and the processor is used for reading and executing the computer program to realize the content described in the first technical scheme.

In order to solve the technical problems, a third technical scheme adopted by the application is as follows: there is provided a computer-readable storage medium storing a computer program that can be read and executed by a processor to realize what is described in the first technical aspect above.

The beneficial effects of the application are as follows: compared with the prior art, the method has the advantages that the audio watermark is obtained by acquiring the audio data, the audio features and the biological features of the audio data and fusing the audio features and the biological features, and the obtained audio watermark is a watermark which is robust and difficult to detect and delete, so that the high-efficiency safety protection of the audio data is realized. In addition, the application can fuse biological characteristics and audio characteristics in various application scenes, such as video conference, recording and broadcasting, live broadcasting, education video and other application scenes, so that the application can also realize high-efficiency protection of audio data in different scenes.

Drawings

Fig. 1 is a schematic diagram of the system components of an embodiment of the audio watermarking system of the present application;

fig. 2 is a schematic flow chart of a first embodiment of an audio watermarking method according to the present application;

Fig. 3 is a schematic diagram of a second flow chart of an embodiment of the audio watermarking method of the present application;

fig. 4 is a schematic diagram of a third flow chart of an embodiment of the audio watermarking method of the present application;

FIG. 5 is a schematic circuit diagram of an embodiment of a computer device of the present application;

fig. 6 is a schematic circuit diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

With the rapid development of the internet, the acquisition of audio and video contents becomes increasingly convenient. Therefore, the method is a vital work for protecting the security of the generated audio data and preventing the audio data from being leaked or unauthorized in the application scenes of video conferences, live broadcasting, recorded broadcasting, education and the like. If the online played audio content is collected by illegal use of the recording equipment, the copyright of the audio and video content can not be well protected; particularly, the audio-video conference scene requiring confidentiality is more at risk of being divulged outwards by recording the conference. Therefore, in order to effectively protect the copyright of the audio content, an audio watermark may be generally added to the audio content.

The inventor discovers through long-term research that the mode of the audio watermark generated at present is simpler and is easy to crack, so that the audio watermark in the audio data can be easily erased. In order to improve or solve the above technical problems, the present application proposes at least the following embodiments.

As shown in fig. 1, an audio watermarking system 1 as described in an embodiment of the present application may comprise an audio generating terminal 20, a server 10 and an audio output terminal 30. The audio generating terminal 20 may be a terminal that generates audio data generated by collecting sounds of the audio generating terminal 20, such as a anchor terminal at the time of live broadcasting, a speaker terminal of an online conference, a teacher terminal of an online lesson, and the like. The audio output terminal 30 may be a terminal that plays audio data generated at the audio generating terminal 20. To protect the audio data, the server 10 or the audio generating terminal 20 may watermark the audio data. The server 10 then transmits the watermarked audio data to the audio output terminal 30 to play the watermarked audio, thereby protecting the audio.

The embodiment of the audio watermarking system can not only watermark the whole section of audio data, but also watermark the audio data generated in real time. For example, after the presenter of the online conference speaks, the server or the presenter terminal may process the audio data to add the watermark after receiving the audio data spoken by the presenter, and send the processed audio data to the terminals of the respective participants of the conference to play the watermarked audio. Therefore, the safety of the audio data in various scenes such as online conferences, live broadcasting, recorded broadcasting and the like can be ensured.

As shown in fig. 2, an embodiment of the audio watermarking method of the present application may take the server 10 as an execution body, and the method described may include: s100: acquiring audio data; s200: acquiring audio characteristics of the audio data and acquiring biological characteristics corresponding to the audio data; s300: fusing the audio features and the biological features to obtain an audio watermark; s400: an audio watermark is embedded in the audio data.

The audio watermark depth fusion method has the advantages that the audio watermark is obtained by obtaining the audio data, obtaining the audio characteristics and the biological characteristics of the audio data and fusing the audio characteristics and the biological characteristics, and the obtained audio watermark depth fusion method is a robust watermark which is difficult to detect and delete, so that the high-efficiency safety protection of the audio data is realized. In addition, the application can fuse biological characteristics and audio characteristics in various application scenes, such as video conference, recording and broadcasting, live broadcasting, education video and other application scenes, so that the application can also realize high-efficiency protection of audio data in different scenes.

Embodiments of the audio watermarking method of the present application are described in detail below.

S100: audio data is acquired.

The audio data may be audio data obtained directly from a piece of audio, or may be audio data extracted from a piece of video. The audio data may be a file storing audio, such as a wav file, an mp3 file, an avi file, etc.

In this embodiment, the audio watermark added to the audio data is determined according to the audio characteristics and features of the audio data, that is, the audio watermarks corresponding to different audio data are different, so in order to improve the quality of the generated audio watermark, the audio data may be processed first, so that the occurrence of the situation that the quality of the generated audio watermark is low due to the fact that the quality of the audio data is low is reduced. For example, operations such as noise reduction, normalization and the like can be performed on the audio data, so that appropriate audio data can be provided for subsequent steps.

As described above, the present embodiment can determine the audio watermark based on the audio characteristics and features of the audio data. The audio characteristics of the audio data may refer to attributes or properties inherent to the audio data itself. Such as the tempo, tone, timbre, volume change, duration, noise level, etc. of the audio to which the audio data corresponds. Features of audio data may refer to pieces of information or attributes extracted from audio that are used to describe, classify, identify, etc. tasks. Features may be data converted from audio data that is more easily processed by a machine learning model or other algorithm. In this embodiment, the audio data may be processed to obtain audio characteristics and features of the audio data, and the audio watermark may be determined by using the audio characteristics and features of the audio data. Of course, in some embodiments the audio watermark may also be determined using only audio characteristics or audio features. The characteristics of the audio data may be obtained by directly processing the audio data, or may be obtained by using the audio characteristics of the audio data, and specifically, the following steps included after S100 may be referred to.

S200: and acquiring the audio characteristics of the audio data and acquiring the biological characteristics corresponding to the audio data.

In order to increase the complexity of the generated audio watermark and make the watermark difficult to crack, the embodiment generates the watermark by combining the audio information and the biological information of the audio, so that after the audio data is acquired, the audio data can be processed to obtain the audio characteristics and the corresponding biological characteristics of the audio data. The audio features and biometric features may be information that is more easily processed by a machine learning model or other algorithm, such as numbers, vectors, matrices, etc.

The biometric feature may be a physiological, behavioral feature of the human body, for example, voiceprint information, fingerprint information, iris information, etc. of the speaker in the audio. When the source of the audio data is a piece of audio, the biometric feature may be a biometric feature corresponding to the audio acquired from another module. When the source of the audio data is a piece of video, the biometric may be a biometric of a speaker in the video obtained after the other modules process the video.

Optionally, in the embodiment of the application, the plurality of biological characteristics can be processed to generate the audio watermark, so that the generated audio watermark has high complexity and is difficult to crack, thereby protecting the audio watermark.

According to the difference of the feature extraction process, the audio features can be divided into features (such as zero crossing rate) directly extracted from the audio data, features (such as spectrum heart quality) obtained by converting signals into frequencies, features (such as melody) obtained by a specific model, and features (such as mel-frequency cepstrum coefficients) obtained by changing the quantization feature scale by auditory perception heuristic of human ears. In particular, the audio features may be energy features, time domain features, frequency domain features, music theory features, perceptual features of the audio data. The time domain features may be dynamic change modes of the audio data in the time dimension, such as attack time, zero crossing rate, autocorrelation, etc. The frequency domain features may be features of the audio data in frequency, such as spectral centroid, mel-frequency cepstral coefficients, spectral flatness, spectral flux, etc.

By acquiring the audio features and the biological features of the audio data, the audio features and the biological features can be processed so that the generated audio watermark is related to the audio features and the biological features to improve the complexity of the audio watermark. Because the audio features and the biological features belong to features of different modes, the embodiment can process the features of different multiple modes, and the generated audio watermark can be related to the multi-mode features, so that the complexity of the audio watermark can be improved, and the generated watermark has stronger individuation and security.

Referring to fig. 3, in order to improve the matching degree between audio data and an audio watermark, so that the audio watermark is difficult to detect after being added into the audio data, audio characteristics may be obtained according to the audio characteristics of the audio data. The audio characteristics of the audio data may be different for each period, for example, a period of audio data may have a higher volume for the first 10 seconds and a subsequent period may have a lower volume. Or the spectral characteristics of the audio are significant over a period of time, and the time series pattern is more significant over another period of time. And different processing methods can be selected for different audio characteristics, so that the extracted audio features are more accurate, and the following steps included in S200 can be seen specifically:

s210: the structure and network parameters of the convolutional neural network and the long-short-term memory network are determined based on the audio characteristics of the audio data.

In this embodiment, time domain features and frequency domain features may be extracted from audio data as audio features to generate an audio watermark. For example, the time domain features and the frequency domain features may be integrated into audio features. Since a single feature extraction method may not accurately extract each type of audio feature of audio data, respective feature extraction networks may be provided. In this embodiment, the frequency domain features of the audio data may be extracted through a convolutional neural network, and the time domain features of the audio data may be extracted through a long-short-term memory network.

The convolutional neural network is used to extract spectral characteristics of the audio data to obtain frequency domain features of the audio data. In convolutional neural networks, the convolutional kernels slide over the audio spectrum, so that the frequency domain features of different regions can be extracted. These spectral features may then be integrated through fully connected layers or other types of layers to form a global frequency domain feature.

The long-term and short-term memory network is used for capturing dynamic change modes of the audio data in the time dimension. In long and short term memory networks, audio data is considered a time series. The data at each time point is input into a unit of the long-short term memory network, and the unit can generate a new state according to the current input and the past state. This new state can be seen as a representation of the temporal characteristics of the audio data up to the current point in time, i.e. the time domain characteristics.

In this embodiment, to extract frequency domain features and time domain features of audio data more accurately, the structures and network parameters of the convolutional neural network and the long-short-term memory network may be determined based on the audio characteristics of the audio data. For example, when the audio data is analyzed, if the spectrum characteristic of the audio data is more obvious, the number of convolution layers of the convolution neural network can be increased, or the size and the step length of the convolution kernel can be adjusted so as to extract the spectrum characteristic more accurately, and the frequency domain characteristic can be extracted more accurately. For example, when the time series mode of the audio data is more remarkable, the number of hidden units of the long-short-term memory network can be increased, or the weights of the input gate and the output gate of the long-short-term memory network can be adjusted so as to more accurately capture the time series mode, and thus the time domain features can be more accurately extracted.

Alternatively, the structure and network parameters of the feature extraction network may be automatically determined based on reinforcement learning agents and using audio characteristics and past experience. The past experience may be the structure and network parameters of the feature extraction network corresponding to the exact audio features previously derived based on the audio characteristics. When the audio data is in reinforcement learning, the agent may select an action based on its current observations and policies and receive a reward from the environment. Rewards are results generated based on the actions of the agents. If the result gets good, the agent can get a positive reward; if the result is worse, the agent gets a negative reward. The agent may use this rewards information to continually update the policies in selecting actions in order to make better decisions in the future. Therefore, the agent can dynamically decide how to adjust the structure and network parameters of the convolutional neural network and the long-term and short-term memory network by continuously trying and learning so as to achieve the optimal feature extraction effect.

S220: the method comprises the steps of obtaining time domain features based on the structure and network parameters of a convolutional neural network, and obtaining frequency domain features based on the structure and network parameters of a long-short-term memory network.

After the audio data is acquired, the structure and network parameters of the feature extraction network that extracts the audio features may be determined based on the audio characteristics of the audio data. After determining the structures and network parameters of the convolutional neural network and the long-short-term memory network, the time domain features can be obtained based on the structures and network parameters of the convolutional neural network, and the frequency domain features can be obtained based on the structures and network parameters of the long-short-term memory network.

By analyzing the audio data and obtaining the audio characteristics of the audio data, the time domain characteristics and the frequency domain characteristics of the audio data are obtained based on the audio characteristics, so that rich and fine characteristic information can be conveniently extracted, the matching degree of the audio characteristics and the audio data can be improved, and the matching degree of the audio watermark and the audio data can be improved.

It should be noted that in the embodiment of the present application, other types of audio features may be extracted by a feature extraction method, for example, music theory features, perception features, and the like may be extracted, and these features are integrated into audio features, so as to improve the matching degree between the audio features and the audio data.

Optionally, after extracting the audio features of the audio data with the structure and network parameters of the currently determined feature extraction network, the audio features and the biometric features may be fused to generate the audio watermark. In this embodiment, the audio data after the audio watermark is added may be evaluated to determine the matching degree of the audio watermark and the audio data, for example, whether the audio watermark is suitable for the audio data, and whether the matching degree reaches a threshold. If the matching degree of the audio watermark and the audio data does not reach the threshold value, the audio data can be subjected to feature extraction again, and the audio watermark is generated by utilizing the re-extracted audio features. When the feature extraction is performed on the audio data again, the structure and network parameters of the feature extraction network can be adjusted based on the matching degree of the audio watermark and the audio data. In particular, reinforcement learning agents may be utilized to adjust the structure and network parameters of the feature extraction network. That is, the reinforcement learning agent may utilize the degree of matching of the audio watermark to the audio data to determine how to adjust the structure and network parameters of the feature extraction network.

Alternatively, the degree of matching of the audio watermark to the audio data may be determined according to a variety of criteria. For example, it may be imperceptible, i.e. the difference in audibility between the audio data after watermarking and the audio data before watermarking should be as small as possible to ensure that the watermark is imperceptible to the listener. For example, may be robust, i.e. the watermark should be resistant to various attacks and noise, such as MP3 compression, accelerated playback or downsampling, etc. For example, it may be extraction accuracy, i.e. watermark information should be able to be extracted accurately and completely when extracting the watermark from the audio data. For example, it may be computationally efficient, i.e. the watermarking and extraction computational process should be as efficient as possible to meet real-time or near real-time requirements. For example, capacity, i.e. the amount of information that a watermark can carry. When the audio watermark is high in imperceptibility, robustness, computational efficiency and capacity in the audio data, the matching degree of the audio watermark and the audio data is correspondingly high.

The audio characteristics and the biological characteristics of the audio data are acquired, so that the audio watermark can be conveniently acquired by utilizing the audio characteristics and the biological characteristics, the complexity of the audio watermark is improved, the matching degree of the audio watermark and the audio data is improved, the audio watermark is difficult to detect, the audio data is converted into a form which is easy to process by a computer through the audio characteristics and the biological characteristics, and the efficiency of generating the audio watermark can be improved.

In order to make the generated audio watermark difficult to detect, after the audio features and the biological features of the audio data are acquired, the audio features and the biological features may be fused, so that the complexity of the generated audio watermark may be increased, and in particular, the following steps after S200 may be referred to:

s300: and fusing the audio characteristics and the biological characteristics to obtain the audio watermark.

The digital watermarking technology is an information hiding technology, and the audio digital watermarking algorithm embeds the digital watermark into the audio data through a watermarking embedding algorithm, but has no great influence on the original tone quality of the audio data or the influence of the digital watermark cannot be perceived by human ears. The digital watermark can be completely extracted from the audio host data through a watermark extraction algorithm, and the watermark which is embedded and extracted is the audio watermark.

After the audio features and the biological features of the audio data are obtained, the audio features and the biological features can be fused to obtain the audio watermark. In some implementations, the audio features and the biometric features may simply be stitched together to achieve feature fusion. In other embodiments, the audio features and the biometric features may be depth blended to achieve feature fusion. The watermark after fusing the audio feature and the biological feature may be a fused feature after fusing, or may be a watermark obtained by processing the fused feature after fusing.

Alternatively, as previously mentioned, the audio features and biometric features may be information that is readily processed by a machine learning model or other algorithm, for example, the audio features and biometric features may be characterized in terms of vectors. How the audio features and the biometric features are fused may be seen in the following steps included in S300:

s310: the audio features and the biometric features are fused into an output vector having a particular distribution to obtain an audio watermark.

To obtain an audio watermark that matches the audio data, the audio data may be represented in a form that is more easily processed by a machine learning model or other algorithm. The audio data may be represented, for example, using data forms of specific values, vectors, matrices, etc.

In the present embodiment, the output vector having a specific distribution is used to simulate the distribution of audio data over the feature space. That is, the real audio data can be accurately restored by the output vector having a specific distribution. Thus, when the output vector with specific distribution is used for obtaining the audio watermark, the matching degree of the audio watermark and the audio data can be improved.

Alternatively, there are a number of implementations of feature fusion of audio features and biometric features to generate output vectors, such as feature stitching, feature summation, generation of an countermeasure network, etc. In this embodiment, the feature may be fused by using a generation countermeasure network to improve the complexity of the audio watermark, and the following steps may be specifically referred to as S310:

S311: the audio features and the biometric features are fused into an output vector having a particular distribution based on generating an countermeasure network.

The generation countermeasure network GAN (Generative Adversarial Networks) includes a generator and arbiter. The purpose of generating the countermeasure network is to learn the distribution of the audio data, i.e. the distribution of the audio features and the biometric features in the audio data. To learn the distribution, an output noise variable may be defined first, and then the output noise vector mapped to the generator. The data output by the generator may be input to the arbiter so that the arbiter determines whether the input data is from the generated model or the audio data.

In particular, the generator may be comprised of multiple layers of neural networks that can be learned to generate output vectors with specific distributions of fused audio features and biological features to simulate the joint distribution of the audio features and biological features to deeply fuse the audio features and biological features. In this process, the generator may receive as input a random noise vector that is transformed by a series of nonlinear transformations, combining the audio features and the biometric features, into an output vector with a particular distribution that fuses the audio features and the biometric features.

To ensure that the output vector generated by the generator is ready to simulate audio data, the output vector generated by the generator having a particular distribution may be input to the arbiter so that the arbiter determines whether the random vector generated by the generator having a particular distribution may accurately simulate real audio data. When the arbiter can easily distinguish the output vector of the generator from the audio data, the generator can be further trained so that the generator can better simulate the feature distribution in the audio data.

By utilizing the generation of the countermeasure network to achieve fusion of the audio features with the biometric features, the generator and the arbiter compete and cooperate with each other during the training process of generating the countermeasure network. The generator attempts to generate more and more realistic output vectors so that the arbiter cannot distinguish the output of the generator from the real audio data. The arbiter attempts to increase its discrimination capability to more accurately identify the output of the generator. Through continuous games, the performance of the generator and the arbiter is improved, so that the quality of the output vector with specific distribution by generating the fused audio characteristics and biological characteristics obtained by the countermeasure network is high.

Optionally, since the audio features and the biological features are deeply fused in this embodiment, and the emphasis points of the audio features and the biological features are different in each period of a piece of audio, and the features that have more important contributions to the final generated audio watermark may also be different, when the audio features and the biological features are fused, the importance of each feature may be determined to improve the quality and the robustness of the output vector generated by the fusion, and in particular, see the following steps included in S310:

s312: the weight of the audio feature and the biometric feature is determined using an attention mechanism.

The basic idea of the attention mechanism is to focus the features of the input to focus on more important features and feature areas so that the quality and robustness of the output vector with a specific distribution can be improved. The determination of the weights of the audio features and the biological features using the attention mechanism may be applied in the generation of the output vector with the specific distribution or may be performed after the generation of the output vector with the specific distribution by processing the output vector with the weights to obtain the weighted output vector.

In this embodiment, the weight of the audio feature and the biometric feature may be determined using an attention mechanism. Each feature has a corresponding weight. The weights may characterize feature focus. For example, features with higher weights should be of greater concern when feature fusion is performed.

By using the attention mechanism, certain internal relation or mode possibly existing between different features or modes can be effectively captured, and the weight among the features can be obtained, so that the quality of the fused features is improved. And because the attention mechanism is suitable for processing the multi-mode data, the processing capacity of the server on the multi-mode data can be enhanced.

Optionally, how the attention mechanism is used to determine the weights of the audio features and the biometric features may be seen in the following steps included in S312:

s3121: an inner product between the audio feature and the biometric feature is calculated.

Calculating the inner product between the audio feature and the biometric feature may refer to calculating the inner product between the audio feature and a certain biometric feature, or may refer to calculating the inner product between two biometric features. In the foregoing, the features may be characterized in terms of vectors, and the inner product may refer to the correlation between two feature vectors. For example, the inner product between the audio feature and the respective biometric feature may be calculated, i.e. the correlation between the audio feature and the respective biometric feature is obtained. For example, the inner product between each biometric feature and the remaining biometric features can be calculated, and the correlation between the biometric features can be obtained.

In one implementation, when the inner product between the two feature vectors obtained by calculation is larger, this means that the correlation between the features corresponding to the two feature vectors is stronger, meaning that the two features may be more similar in some way or have a certain relationship. Therefore, when the features are fused, the relationship among the features can be focused, so that the features are fused more accurately based on the relationship among the features, interaction among the multi-mode features is enhanced, and the quality of the generated fused features is improved.

S3122: the inner product is converted into a probability distribution to obtain weights for the audio features and the biometric features.

After the inner product between the audio feature and the biological feature is calculated, the inner product can be converted into probability distribution, so that the weights of the audio feature and the biological feature, namely the weights of the various features, can be obtained, and the importance of the various features and the favorable contribution to the final task can be obtained more intuitively. In one embodiment, the inner product may be converted to a probability distribution using a softmax function. Specifically, the softmax function is used to map the inner product value onto a probability distribution, so that each feature can obtain a weight corresponding to itself in a range of 0 to 1. And the softmax function may ensure that the sum of these weights is 1.

By determining the weights of the audio features and the biological features, the relation between the features can be reflected, the weight of each feature can be conveniently used for weighting each feature subsequently, the features can be fused by combining the weights, and therefore the quality of the fused features generated by fusion can be improved.

After determining the weights of the respective features, the respective weights obtained may be processed to obtain weighted features corresponding to the output vectors having a specific distribution, see in particular the following steps after S312:

s313: the weighting of the audio feature and the biometric feature is obtained using the weights.

In step S312, the weights of the various ones of the multi-modal features may be determined, which may characterize the relative importance of the various features. After the weights of the respective features are obtained, the weights may be combined with feature vectors corresponding to the respective features, so that weighted features may be obtained. The weighted features may reflect the interaction between the features, i.e. the interaction between the audio features and the biometric features. The weighted features may reflect features between the features that are truly important for the final task (i.e., generating an audio watermark that matches the audio data well), so that the quality of the fused features generated after the features are fused may be improved.

For example, the weight for each feature is obtained using an attention mechanism. Assuming 10 features, then 10 weights are obtained. The weights may then be combined with the feature vectors. The feature vector may be a list of values that characterize a plurality of features of the audio data. When there are 10 features, each feature has a corresponding value, and the feature vector may be a vector containing 10 values. Then, when the weighted features of the audio feature and the biological feature are obtained by using the weights, for example, the weight of the first feature is 0.5, the value of the first feature is 10, and then the weighted feature value may be 5. Each feature is calculated so that a weighted feature can be obtained.

S314: the weighting features are fused with the output vector having a particular distribution to obtain an audio watermark.

In some embodiments, output vectors and weighted features with specific distributions may be connected together by a full connection layer, after which a non-linear activation function may be used to transform to increase the expressive power of the resulting fusion features.

The weighting characteristics are fused with the output vectors with specific distribution, and the final fused characteristics are utilized to generate the audio watermark, so that the generated audio watermark is more relevant to important characteristics, and the quality of the generated audio watermark can be improved.

In one embodiment, after the audio features and the biometric features are fused, the output vector generated by the fusion with a particular distribution may be embedded as an audio watermark in the audio data. In another embodiment, the fused features of the audio features and the biometric features may be processed to generate an audio watermark that is embedded in the audio data. Wherein the fusion feature may be a generated output vector with a particular distribution. As shown in fig. 4, for a specific description of how a device such as a server generates an audio watermark using fusion features, reference may be made to the following steps included in S300:

s330: and processing the fusion characteristics by using a variation self-encoder to obtain the audio watermark.

As previously mentioned, the audio features and the biometric features may be fused using a generation countermeasure network and an attention mechanism to generate fused features, i.e., random vectors with a particular distribution. To generate an audio watermark corresponding to the fusion feature, a variation self-encoder may be utilized to generate. The variable self-encoder VAE (variable Auto-Encoders) includes an encoder and a decoder. The encoder may map the fusion features to potential space. In this mapping process, the encoder may output two parameters: representing the mean value in the potential space, representing the contrast in the potential space. These two parameters may define a gaussian distribution that is considered to be a potential representation of the fusion feature. A point may then be sampled from the gaussian distribution, which may be considered as a potential representation of the fusion feature. The decoder then maps the potential representation back to the original data space, generating a new data sample, i.e. an audio watermark. The goal of the decoder is to make the generated audio watermark as close as possible to the original audio features and biometric features. That is, the variation maps the fusion feature from the encoder to a potential space, samples in the potential space, and maps the sampling result back to the data space to generate the audio watermark using the sampling result.

The variational self-encoder employs a variational reasoning technique to minimize reconstruction errors and differences between the potential distribution and the prior distribution. The reconstruction error is the difference between the original data and the decoder output, the smaller the difference the better. The difference between the potential distribution and the prior distribution can be measured by a KL divergence (Kullback-Leibler Divergence), the smaller the divergence, the closer the potential distribution to the prior distribution. In training a VAE, two competing goals are involved: on the one hand, the VAE is able to generate a watermark as close as possible to the input data, which requires minimizing reconstruction errors; on the other hand, VAEs are able to learn a potential distribution that is simple in structure and approximates a standard normal distribution, which requires a minimization of KL divergence. Through such training, the VAE is able to learn the effective potential distribution of the input data while meeting both objectives.

In generating the watermark, the audio features and the biometric features are input into the trained VAE. The encoder maps these features to potential space and outputs corresponding gaussian distribution parameters. Then a point is sampled from this distribution, and the point is mapped back to the data space by a decoder to obtain a new watermark.

By generating an audio watermark using a variational self-encoder, a unique watermark can be generated based on arbitrary audio features and biological features. Since the watermark is generated by a complex VAE model, rather than a simple linear transformation or embedding, the watermark is not only closely related to the input audio data, but also has good concealment. In addition, in some embodiments, parameters of the VAE may be adjusted to control specific characteristics of the generated watermark, such as size, strength, etc., so that the generated watermark can meet different application requirements.

After obtaining the audio watermark, how the audio watermark can be hidden after embedding the audio watermark in the audio data without affecting the quality of the audio data can be seen in the following steps after S300:

s400: an audio watermark is embedded in the audio data.

After the audio features and the biological features are fused to generate the audio watermark, the audio watermark can be embedded into the audio data, so that the audio data carries the audio watermark, the safety of the audio data is improved, and the audio data is protected.

Embedding the audio watermark in the audio data may refer to adding the audio watermark signal to the audio data signal. If the audio watermark is improperly added, noise and the like caused by the audio watermark may be heard by human ears, so that the quality of the audio data is affected, and therefore, when the audio watermark is embedded into the audio data, the watermark needs to be properly embedded. Alternatively, improper watermarking of audio may refer to improper location of watermarking, or improper size, strength, etc. of the watermark being added.

In order that the quality of the watermarked audio data is not degraded and that the audio watermark is difficult to perceive in the audio data, before embedding the audio watermark into the audio data, a characteristic analysis may be performed on the audio data to determine watermark parameters of the audio watermark in accordance with the audio characteristics and/or features of the audio data. Alternatively, the watermark parameters may include where the audio watermark is added to the audio data, e.g. frequency interval, specific location, etc., and the watermark parameters may also include properties of the audio watermark, e.g. size, shape, etc. The frequency interval may refer to a frequency range, e.g. 100Hz to 200Hz, in which the audio watermark is added. The specific location is used to determine at which point or points of the frequency interval the watermark is embedded. For example, watermarking at 150 Hz. The size may refer to the strength or amplitude of the watermark. For example, in the louder audio portion, a relatively stronger or larger watermark may be added to ensure that the watermark is not easily detected after embedding. While in the audio portion of lower volume, a larger watermark may cause audible disturbances. For the shape of the watermark, the watermark may not be just a simple signal in the frequency domain, and there may be a specific shape or pattern to facilitate watermark concealment.

The characteristic analysis of the audio data may be based on the audio characteristics and/or the audio features obtained in step S200 to further analyze the audio data to improve the accuracy of the determined parameters of the audio watermark. For example, the characteristic analysis of the audio data may be based on the volume, pitch, frequency domain characteristics, etc. of the audio. For example, a louder audio portion may be more suitable for watermarking than a lower volume audio portion because the louder audio portion may better conceal the watermark. Also for example, since the rate of change of audio over time may affect the detectability of the watermark, the analysis of the characteristics of the audio data may also be based on the rate of change of audio. For another example, the characteristic analysis of the audio data may also be based on the tempo, timbre, harmony, etc. of the audio.

Thus, for deep analysis of audio data, a data module may be provided. The input sources of the data module may be the original audio data and the audio features obtained in step S200. Wherein common audio characteristics, such as volume, pitch, etc., can be extracted from the original audio data. As the audio features characterize finer and deeper audio characteristic information to obtain the content and structure of the audio, etc. To obtain where the audio watermark is added to the audio data, as well as the properties of the audio watermark, etc., the output of the data module may be the watermark parameters of the audio watermark such that the audio watermark is added to the audio data in accordance with the watermark parameters.

From the above, after the characteristic analysis of the audio data, the key frequency interval in the audio can be output for adding the audio watermark, for example, the frequency interval with large volume, good stability and no easy perception of human ears can be found in the audio data. Optionally, the watermark may be embedded by obtaining an optimal frequency interval based on the key frequency interval after obtaining the key frequency interval of the audio, e.g. a reinforcement learning model may be utilized to dynamically select an optimal frequency interval from the key frequency intervals that is suitable for embedding the watermark. When the reinforcement learning model is used to select the optimal frequency interval from the key frequency interval, the requirement of the audio data after the audio data and the watermark are combined can be determined. For example, in addition to considering the characteristics of the audio data, concealment of the watermark, variations in audio quality, and the like may also be considered. Therefore, the audio quality after the watermark is added is less in change, the watermark is difficult to detect and delete, namely, the audio quality after the watermark is added is ensured, and the aim of hiding the watermark is fulfilled.

In some embodiments, the watermark parameters may be determined based on specific requirements for the watermark in addition to characteristics of the audio data. For example, specific requirements require that the watermark reach a preset accuracy, which is also taken into account in generating the audio watermark.

Thus, after determining the watermark parameters, a step of watermarking may be performed, see in particular the following steps included in S400:

s420: the audio data is subjected to a characteristic analysis to determine parameters of the added audio watermark in the frequency distribution.

As previously mentioned, prior to adding the watermark to the audio data, a characteristic analysis may be performed on the audio data to determine parameters of the audio watermark based on the audio characteristics and/or audio features of the audio data. Parameters may include the size, shape, frequency interval, specific location, etc. of the audio watermark.

Optionally, how the parameters of the audio watermark are determined may be referred to the following steps comprised in S420:

s421: a frequency distribution of the audio data is obtained.

S422: based on the frequency distribution, a frequency interval is determined in which the audio watermark is added.

To facilitate the addition of an audio watermark to audio data, the audio data may be converted from the time domain to the frequency domain or time-frequency domain to obtain a frequency distribution of the audio data, such that the watermark may be added to the audio data in frequency bins. Wherein the time domain refers to the relationship of the audio signal to time. The frequency domain or time-frequency domain may represent a distribution of the audio signal in terms of frequency. In this embodiment, the audio data may be subjected to domain conversion using fourier transform or short-time fourier transform, or the like.

In order to make the added watermark difficult to detect, the frequency distribution of the audio data may be analyzed, and a frequency interval suitable for adding the audio watermark may be found in the frequency distribution, so as to hide the audio watermark in the frequency distribution of the audio data. The frequency interval in which the audio watermark is added can thus be determined based on the frequency distribution.

Optionally, a reinforcement learning model may be utilized to determine the frequency bins in which the audio watermark is added.

After the characteristic analysis of the audio data, parameters of the audio watermark may be obtained, after which the audio watermark may be embedded in the audio data according to the parameters, see in particular the following steps after S420.

S430: the audio watermark is embedded into the audio data according to the parameters of the audio watermark.

As mentioned above, the parameters of the audio watermark may include frequency intervals, i.e. intervals suitable for watermarking, such as intervals where the change in sound quality after watermarking is small and the watermark is difficult to delete and detect. When the audio watermark is embedded into the audio data, the audio watermark can be added into a frequency interval in the frequency distribution to obtain the frequency distribution of the added watermark, so that the added watermark has safety and imperceptibility and is difficult to detect and delete, thereby achieving the purpose of protecting the audio data.

Before the audio watermark is embedded into the audio data, the size, shape and position of the generated watermark can be adjusted according to the characteristics of the audio data, that is, the result of characteristic analysis on the audio data.

In addition, some pre-processing of the watermark is required to convert it to a frequency domain signal before adding it to the frequency distribution, facilitating the addition. For example, a hadamard transform may be used to convert the watermark into a frequency domain signal. In particular, when embedding an audio watermark in audio data, the preprocessed watermark signal may be added to the determined frequency bins or phase or amplitude encoded over these frequency bins to embed the watermark in the audio.

In some embodiments, before embedding the audio watermark into the audio data, in order to enable the parameters to achieve optimal results, an optimization algorithm may be designed to optimize the parameters of the audio watermark, for example using techniques such as gradient descent, simulated annealing, etc., to achieve a minimum loss of sound quality while ensuring the detectability and imperceptibility of the watermark.

S440: the watermarked frequency distribution is converted into audio data.

Converting the frequency distribution into audio data means converting the audio data to which the audio watermark is added from the frequency domain or the time-frequency domain into the time domain. The audio data converted in this step already carries an audio watermark. After embedding the watermark in the audio data, the audio data may be converted from the frequency domain or time-frequency domain to the time domain for ease of audio data transmission. That is, the watermarked frequency distribution is converted into the time domain, so that audio data with the audio watermark can be obtained. This step may also be converted by an inverse fourier transform or an inverse short-time fourier transform.

By adding the audio watermark into the audio data through the method, not only can the watermark be added on the premise of not damaging the audio quality, but also the watermark has higher safety, uniqueness, robustness and imperceptibility.

As shown in fig. 5, the computer device 100 described in the computer device embodiment of the present application may include a processor 110, a memory 120, and a communication circuit 130.

The Memory 120 is used to store a computer program, and may be a ROM (Read-Only Memory), a RAM (random access Memory), random Access Memory, or other type of storage device. In particular, the memory may include one or more computer-readable storage media, which may be non-transitory. The memory may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory is used to store at least one piece of program code.

The processor 110 is used to control the operation of the computer device 100, and the processor 110 may also be referred to as a CPU (Central Processing Unit ). The processor 110 may be an integrated circuit chip with signal processing capabilities. Processor 110 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The general purpose processor may be a microprocessor or the processor 110 may be any conventional processor or the like.

The processor 110 is configured to execute a computer program stored in the memory 120 to implement the audio watermarking method described in the audio watermarking method embodiment of the present application.

The computer device 100 may also include a communication circuit 130, the communication circuit 130 being a device or circuit by which the computer device 100 communicates with an external device to enable the processor 110 to interact with external devices via the communication circuit 130.

For detailed descriptions of functions and execution processes of each functional module or component in the embodiment of the computer device of the present application, reference may be made to the descriptions in the above embodiment of the audio watermarking method of the present application, which are not repeated herein.

In several embodiments provided by the present application, it should be understood that the disclosed computer device 100 and audio watermarking method may be implemented in other ways. For example, the various embodiments of computer device 100 described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

Referring to fig. 6, the above-described integrated units, if implemented in the form of software functional units and sold or used as independent products, may be stored in the computer-readable storage medium 200. Based on such understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product or all or part of the technical solution, which is stored in a storage medium, and includes several instructions/computer programs to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media such as a USB flash disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk, and electronic terminals such as a computer, a mobile phone, a notebook computer, a tablet computer, a camera, and the like having the storage media.

The description of the execution of the program data in the computer-readable storage medium may be described with reference to the above embodiments of the audio watermarking method according to the present application, which is not repeated herein.

The foregoing description is only illustrative of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present application.

Claims

1. A method for adding a multimodal fusion audio watermark, comprising:

acquiring audio data;

acquiring the audio characteristics of the audio data and acquiring biological characteristics corresponding to the audio data;

fusing the audio features and the biological features to obtain an audio watermark;

embedding the audio watermark into the audio data.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

said fusing said audio features and said biometric features, comprising:

the audio features and the biometric features are fused into an output vector having a particular distribution to obtain the audio watermark.

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the fusing the audio features and the biometric features into an output vector having a particular distribution, comprising:

the audio features and the biometric features are fused into the output vector having a particular distribution based on generating an countermeasure network.

4. The method of claim 2, wherein the step of determining the position of the substrate comprises,

said fusing said audio features and said biometric features, comprising:

determining weights for the audio feature and the biometric feature using an attention mechanism;

acquiring weighted features of the audio feature and the biological feature by using the weights;

and fusing the weighting characteristics with the output vector with specific distribution to obtain the audio watermark.

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

the determining weights of the audio feature and the biometric feature using an attention mechanism includes:

calculating an inner product between the audio feature and the biometric feature;

the inner product is converted into a probability distribution to obtain weights for the audio features and the biometric features.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

Said fusing said audio features and said biometric features to obtain an audio watermark, comprising:

fusing the audio feature and the biometric feature into a fused feature;

and processing the fusion characteristic by using a variation self-encoder to obtain the audio watermark.

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the embedding the audio watermark into the audio data comprises:

performing characteristic analysis on the audio data to determine parameters for adding the audio watermark;

embedding the audio watermark into the audio data according to the parameters of the audio watermark.

8. The method of claim 7, wherein the step of determining the position of the probe is performed,

the characteristic analysis is carried out on the audio data, and the parameter for adding the audio watermark is determined, which comprises the following steps:

obtaining a frequency distribution of the audio data;

determining a frequency interval for adding the audio watermark based on the frequency distribution;

said embedding said audio watermark into said audio data in accordance with parameters of said audio watermark comprising:

adding the audio watermark to the frequency interval in the frequency distribution to obtain the frequency distribution added with the watermark;

The watermarked frequency distribution is converted into audio data.

9. The method of claim 8, wherein the step of determining the position of the first electrode is performed,

the determining, based on the frequency distribution, a frequency interval in which the audio watermark is added, includes:

based on the frequency distribution, the frequency interval to which the audio watermark is added is determined using a reinforcement learning model.

10. The method of claim 8, wherein the step of determining the position of the first electrode is performed,

the parameters further include a size and shape of the audio watermark, the adding the audio watermark into the frequency bins in the frequency distribution comprising:

the audio watermark is added to the frequency bins in the frequency distribution in accordance with the size and the shape.

11. The method of claim 7, wherein the step of determining the position of the probe is performed,

the characteristic analysis of the audio data comprises the following steps:

and utilizing the audio characteristics and the audio characteristics of the audio data to perform characteristic analysis on the audio data.

12. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the audio features include time domain features and frequency domain features, and the acquiring the audio features of the audio data includes:

Determining the structure and network parameters of a convolutional neural network and a long-term and short-term memory network based on the audio characteristics of the audio data;

and acquiring the time domain features based on the structure and network parameters of the convolutional neural network, and acquiring the frequency domain features based on the structure and network parameters of the long-short-term memory network.

13. A computer device, comprising: a processor, a memory, and a communication circuit; the communication circuit and the memory are respectively coupled to the processor, the memory is used for storing a computer program, and the processor is used for reading and executing the computer program to realize the method according to any one of claims 1-12.

14. A computer readable storage medium, characterized in that a computer program is stored, which computer program is readable and executable by a processor for implementing the method according to any of claims 1-12.