WO2017016363A1

WO2017016363A1 - Method for processing digital audio signal

Info

Publication number: WO2017016363A1
Application number: PCT/CN2016/087445
Authority: WO
Inventors: 李庆成; 鹿毅忠
Original assignee: 李庆成; 鹿毅忠
Priority date: 2015-07-27
Filing date: 2016-06-28
Publication date: 2017-02-02
Also published as: CN106409301A

Abstract

A method for processing a digital audio signal. By embedding the other contents in a certain form into a digital audio signal, the purpose of secretly delivering digital information is realized; and the digital audio signal is enabled to carry pre-determined data mainly using the masking effect of a human auditory system. By means of the method, data that needs to be delivered can be embedded at a suitable position of a digital audio signal. When the digital audio signal is played, the audio signal, for expressing relevant data information, embedded at an embedding position can be masked, so that the audio signal is not perceived by a human ear but can be received by a device with an audio signal processing capability.

Description

Digital audio signal processing method

Technical field

The present invention relates to a digital audio signal processing technique, and more particularly to a method for digital audio signal processing based on psychoacoustics using a masking effect.

Background technique

The use of digital audio signals to carry information is a technology that is widely concerned and invested in research and development by the industry. With such a technique, one can obtain music information carried in the aforementioned music or television program by using a device having audio signal processing capability, such as a mobile communication terminal, while listening to music and watching television programs normally. An important feature to evaluate the maturity and suitability of this technology is that it should ensure that the data being carried can be accurately captured and transmitted, and that the digital audio signal itself is played without humans. Interference noise or noise that can be felt.

Chinese Patent Application No. 201410301832.7 discloses a technique of encoding and modulating digital information to be transmitted to form a sound encoded signal; and mixing the sound encoded signal with an audio signal in a preselected audio and video program for output. Although, with this technology, the "digital information to be transmitted" can be added to the normal sound by mixing; however, due to the unpredictability of the "digital information to be transmitted", the "digital information to be transmitted" passes through The coded signal formed by the coded modulation may be noise in the sound in a considerable number of cases. In other cases, it may be other sounds that can interfere with normally played sounds. In order to avoid such problems, the following improvements are proposed in the specification section of the above patent application:

"The digital information to be transmitted is encoded and modulated to form a sound encoded signal. The sound encoded signal can be written as a digital sound signal file, or can be converted into a sound analog signal by a digital-to-analog converter. The frequency of the sound analog signal can be selected to be above 18 kHz. In the frequency band below 20 kHz, the human ear is difficult to detect and does not affect the normal playback of the original TV sound or music signal. In the subsequent steps, the local receiving device of the user needs to receive and extract the digital information to be transmitted. Therefore, the voice coding information needs to have certain characteristics, that is, the signal energy distribution is only in a certain frequency range: 18 kHz or more and 20 kHz or less."

Obviously, in order to prevent the human ear from perceiving the sound code formed by the "digital information to be transmitted", the energy distribution of the portion of the sound coded information must be set within the frequency range of 18 kHz to 20 kHz.

Many people know that the entire range of sounds that can be heard by the human ear is 20 Hz to 20 kHz. The frequency of sounds that can be heard by adults with good hearing is often between 30 Hz and 16 kHz; the frequency of sounds that can be heard by older people with poor hearing is often between 50 Hz and 10 kHz. However, the frequency of sound that children can hear is usually higher. The sound in the frequency range of 18 Hz to 20 kHz used in the above technical solution is audible to many children. Therefore, even if the energy of the sound coded information is selectively distributed within the frequency range of 18 Hz to 20 kHz, a considerable number of people, especially children, can still hear it; this makes these people, especially children, listen to When using this technology to sound-code TV and radio programs, you will still suffer from noise or interference.

On the other hand, selectively distributing the energy of the sound coded information beyond the range of the human ear can be heard (20 Hz to 20 kHz), although the frequency response characteristics of most audio devices are based on the human ear. The sound range is designed and manufactured. For audio signals outside the frequency range of 20 Hz to 20 kHz, it is generally filtered out as noise or noise. Therefore, even if the sound coded information can be mixed into a normal audio signal, It cannot be played by audio equipment and therefore cannot be obtained by the receiving equipment.

In summary, the above various technologies are obviously not mature, and therefore it is impossible to obtain a wide range of applications.

Summary of the invention

It is an object of the present invention to provide a method for digital audio signal processing that utilizes psychoacoustic principles to process the digital audio signal and embed the information to be transmitted into the digital audio signal with specific target data. When the digital audio signal is broadcast by the audio device, the embedded target data can also be broadcasted together, and can be received and extracted by the device having the audio signal processing capability without being perceived by the human ear.

The above object of the present invention is achieved by using such a technical solution:

Framing the first digital audio signal into a plurality of audio frame data and performing windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain a plurality of corresponding audio frame data respectively First spectrum data;

Mapping the plurality of first spectral data to an auditory critical band (Bark domain) and calculating the listening a masking threshold of each subband in the critical band; the number of the masking thresholds is in one-to-one correspondence with the number of subbands described above;

Selecting a frequency point smaller than the foregoing masking threshold as the embedded position among the plurality of first spectrum data;

The target data is quantized by a quantizer capable of performing blind detection on the quantized result, and the discrete Fourier coefficients at the embedded position are assigned by the result of the quantization process, thereby obtaining a plurality of corresponding plurality of first spectral data. Second spectrum data;

Performing discrete Fourier transform on the plurality of second spectral data to obtain a second digital audio signal.

With the above method of the present invention, target data to be transmitted can be embedded at a suitable position of the first digital audio signal in accordance with psychoacoustic principles. When the first digital audio signal is played, the signal embedded in the embedded position for expressing the target data can be masked so as not to be perceived by the human ear, but the embedded signals can be provided with the audio signal. The processing capability of the device is listening and restoring.

Another object of the present invention is to provide a method for extracting data from a digital audio signal, by which the received digital audio signal can be processed while the digital audio signal is broadcast by the audio device, using psychoacoustic principles. Extract the target data embedded in it.

Framing the received first digital audio signal into a plurality of audio frame data, and performing windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain a plurality of corresponding audio frame data respectively First spectrum data;

Mapping the plurality of first spectral data to an auditory critical band, and calculating a masking threshold of each subband in the auditory critical band; the number of the masking thresholds is one-to-one corresponding to the number of the subbands;

Selecting a frequency point of the plurality of first spectrum data that is smaller than a corresponding masking threshold as an embedded position;

Performing inverse quantization processing on the discrete Fourier coefficients at the embedded position by using a quantizer capable of blind detection of the quantized result, to obtain a target data sequence embedded in the first digital audio signal; wherein the target data sequence is The above specific audio data and/or encoded data are serially arranged in a predetermined order; the particular audio frequency domain signals correspond to a particular loudness and/or a particular pitch and/or timbre.

According to the above method of the present invention, when the first digital audio signal is received, the target data sequence carried by the first digital audio signal by using the masking effect is extracted from the first digital audio signal, and the corresponding target data is further recovered; In the process, although the embedded target data sequence can be broadcasted by the audio device together with the digital audio signal, it is not perceived by the human ear.

detailed description

In a first type of embodiment of the invention, some target data needs to be embedded into the target digital audio signal.

In order to embed the above-mentioned target data in a digital audio signal, it is necessary to frame the digital audio signal into a plurality of audio frame data, and on this basis, each audio frame data is windowed. Then, frequency domain discrete Fourier transform is performed on each audio frame data subjected to windowing, and a plurality of first spectrum data respectively corresponding to the respective audio frame data can be obtained.

After obtaining the plurality of first spectrum data, the first spectrum data is respectively mapped to an auditory critical band, and a masking threshold of each sub-band in the auditory critical band is calculated; the number of the masking thresholds and the sub-audit critical band are The number of bands is corresponding.

In the plurality of first spectrum data, a frequency point smaller than the masking threshold is selected as an embedding position of the target data; and then, the quantized device that can perform blind detection on the quantized result is used to quantize the target data, and used a result obtained by the quantization process, the discrete Fourier coefficients of the embedded position are assigned (replaced), so that each second spectral data corresponding to each of the foregoing first spectral data can be obtained;

A second digital audio signal can be obtained by performing discrete Fourier transform on the plurality of second spectral data. The target data described above is embedded in the newly obtained second digital audio signal.

It should be noted that when the first digital audio signal is processed by framing, windowing, etc., the length of each audio frame and the size of the window may be determined by the relevant technician according to specific design requirements, and at least two types may be used. plan selection. For example, a scheme is similar to voice recognition technology, that is, there is an overlap between frames and frames; in this manner, a general window length is 25 to 35 ms, and a frame shift is 10 ms (of course, Greater than or less than 10ms). Another scheme is to use a method in which there is no overlap between frames, and the window length is directly specified as the number of sampling points in the time domain, generally 2 N (N is a positive integer) power; for example: 256 Or 512 picks The sample point is a window of data.

In addition, the aforementioned "mapping" specifically refers to converting a linear frequency into a Bark domain frequency; for example, one available conversion formula is as follows:

z=13arctan(0.00076f)+3.5arctan[(f/7500) ² ]

Where f is the linear Hz frequency, and z is the serial number of the Bark domain.

For the correspondence between the linear Hz frequency and the Bark domain, refer to: Zwicker, E., on the audible frequency range, published in The Journal of the Acoustical Society of America, Vol. 33, No. 2, p. Subdivision of the Audible Frequency Range into Critical Bands, and Traunmüller, H. (1990), in the Journal, Vol. 88, 97–91, on “Analytical Expressions of Sensory Scales for Sound Quality” ( Analytical expressions for the tonotopic sensory scale).

It is well known that when the signal x passes through the quantizer Q, the signal x can be quantized to the quantization level y, ie: y = Q(x); conversely, the process of obtaining the signal x' from the quantization level y is inverse quantization, ie x'= Q ^-1 (y). Due to the existence of quantization errors, the aforementioned signal x and the signal x' may not be exactly coincident.

In the present invention, the above quantizer is unusable. The quantizer used in the present invention is a quantizer capable of adaptive step size and which can perform blind detection on the quantization result. This actually refers to the effect of blind detection of steganographic information, that is, the secret data sequence quantized by the quantizer that can achieve blind detection of the quantization result is written into the carrier, and in the extraction (decoding) phase, the original is not needed. With the participation of the carrier data, the written (embedded) data can be extracted from the quantized data by a quantizer capable of blind detection of the quantized result. It is possible for those skilled in the art to use a quantizer capable of achieving blind detection of the quantized result as long as it is capable of achieving the above effects.

According to the above specific implementation manner of the present invention, for each audio frame in the first digital audio signal, the above operation is performed, and the data to be transmitted can be embedded in the first digital audio signal having a certain length of time. information.

In addition to the specific implementations of the first type, the specific specific improvements or additions of the present invention may be arbitrarily combined with each other on the basis of the above specific embodiments of the first type, and may be different. The design needs to form a specific technical solution that is different.

In the above specific embodiment of the present invention, the quantized device that can perform blind detection on the quantized result is quantized by the target data, and the result obtained by the quantization process is used. A preferred way to assign (replace) the discrete Fourier coefficients of the aforementioned embedded position is:

And calculating an embedding intensity coefficient at the embedding position according to an energy value or a power spectrum parameter of the audio frame data at the embedding position, where the embedding coefficient intensity coefficient determines the corresponding audio frame data. The amount of data that can be embedded in the target data;

According to the embedded intensity coefficient calculated in the above steps, the target data is quantized by a quantizer capable of blind detection of the quantized result, and the discrete Fourier coefficients of the embedded position are assigned (replaced) by the result of the quantization process.

The advantage of adopting such a preferred solution is that the amount of embedded data can be automatically adjusted according to the specific situation of the audio frame data of different embedded positions; for example, in an audio signal with more audio data and higher energy. While ensuring the masking effect, try to increase the amount of data embedded; in audio signals with less audio data and lower energy (for example, in the case of static field), the amount of embedded data can be correspondingly reduced to ensure the effect of masking. .

The process of calculating the embedded intensity coefficient from the energy value or power spectrum of the audio frame data is essentially calculating the quantization step size. In the present invention, in order to better reflect the imperceptibility of the dense audio through the auditory masking, a non-uniform quantization step size can be adopted, the quantization step size is adaptive to the masking threshold of each frame, and the steganographic information cannot be guaranteed. Hear. In a specific embodiment, the quantization step size representing the embedding strength can be calculated using the following formula:

Δ'=Δ+lbLT _min /50

Where Δ' is the quantization step size of the embedded strength, Δ is the base quantization step size, and LT _min is the masking threshold of the audio frame to be embedded in the secret information. Obviously, the larger the masking threshold, the larger the quantization step size can be achieved. Lb is the scaling factor for the quantization step increment, which is between 0 and 1, usually taking a value of 1.

Although the embedding position of the target data is located at the frequency point corresponding to the masking threshold, since the masking thresholds of the respective sub-bands of the critical band are usually different, in order to completely and absolutely mask the embedded target data, it will not It is preferred by a human to hear that, in the first embodiment of the present invention, the frequency point corresponding to the smallest masking threshold in each sub-band is selected as the embedding position, and the target to be embedded is selected. The data is embedded at the embedding location corresponding to the smallest masking threshold.

It is well known that for humans, the entire audio frequency range is 20 Hz to 20 kHz; in fact, not all people can hear all the sounds in the entire audio frequency range mentioned above. Sound signal. To this end, the industry in designing and manufacturing audio playback devices and systems, from reducing the amount of data transmission, improving the performance of equipment or systems, etc., often weaken, and even filter out high-frequency audio signals, enhance the low-frequency Signal; therefore, if the target data is embedded in the signal of the high frequency band in the technical solution of the first type of embodiment of the present invention, when the corresponding audio signal is played by using the aforementioned systems or devices, it may cause Target data embedded in the high frequency band is difficult to extract and recover; sometimes it may not even be received at all. In order to solve such a problem, it is ensured that the robustness of the technical solution of the present invention is adopted. Based on the above various specific embodiments, the frequency points located in the middle and low frequency bands are preferably used as the embedding positions of the target data.

Specifically, the low frequency band in the present invention is 30 to 150 Hz, and the medium and low frequency bands are 30 to 500 Hz; the medium and high frequency bands (500 to 5000 Hz); in general, the most suitable target data is embedded in the invention with 30 to 4000 Hz. The frequency range. Of course, those skilled in the art can also select other frequency bands as the frequency range in which the target data is embedded according to specific design requirements.

Although the foregoing basic objects of the present invention can be achieved by using the various schemes described above. However, in some cases, the following measures are also required to enable the scheme of the present invention to be further optimized: the essence of the technical solution of the present invention is to embed specific target data in the original digital audio signal, and the embedded target data. It can be seen as a noise signal of a new digital audio signal obtained after embedding. It is well known that when the intensity of the noise signal is large enough, it will affect the quality of the new digital audio signal, and will also affect the transmission and extraction of the target data. Therefore, it is necessary to evaluate the quality of the new digital audio signal obtained after embedding the target data, and then determine whether to use or output.

Therefore, when the second digital audio signal is obtained by using any of the foregoing specific embodiments of the present invention, the signal to noise ratio of the second digital audio signal may be further calculated, according to the result of the calculation. The quality of the second digital audio signal after embedding the target data is evaluated. If the calculated signal-to-noise ratio is less than a preset ratio (threshold value, which can be set by the relevant technician according to specific design requirements, for example: 17dB, 20dB, 23dB, etc.), indicating the second digital audio signal The quality does not meet the predetermined signal to noise ratio requirements. In this case, according to the foregoing solution of the present invention, the embedded position of the target data, the Fourier coefficient and the like are re-determined, and the steps of the foregoing various embodiments of the present invention are re-executed until the finally obtained second digital audio signal is obtained. When the noise ratio reaches a predetermined requirement, the second number that meets the SNR requirement is output. Word audio signal.

In all of the above embodiments of the present invention, the embedded target data is actually serially arranged into a target data sequence by more than one specific audio data and/or encoded data in a predetermined order. Specifically, the aforementioned specific audio data corresponds to a specific loudness and/or a specific pitch and/or timbre; and the aforementioned encoded data is a number expressed in a computer count. A specific target data sequence may be composed of one or more specific audio data serially arranged in a predetermined order; or may be formed by simply serially arranging one or more specific encoded data in a predetermined order; The predetermined rule is constituted by interleaving one or more specific audio data and one or more specific encoded data, and serially arranging them in a predetermined order.

In fact, the advantage that a target data sequence is simply serially arranged by more than one specific encoded data is that the target data can be embedded and received and extracted at a high speed, and is suitable for applications that need to transmit data frequently and quickly. Occasionally, for example, scenes such as live interaction.

In the case where some are insensitive to the real-time and speed of data transmission and require a large amount of data transmission, it is more appropriate that a target data sequence is simply serially arranged by more than one specific audio data.

In a particular embodiment of the invention, a preferred solution is that any particular audio data corresponds to a particular loudness and/or a particular pitch and/or timbre. The so-called loudness, also known as the volume, refers to the strength of the human voice; it is a subjective sense of the size of the sound. The objective evaluation scale is the amplitude of the sound. The so-called pitch is the height of the sound, which is determined by the vibration frequency. Therefore, the pitch is proportional to the vibration frequency. The so-called tone is also called the sound, which refers to the characteristics of the sound that the hearing feels. The tone is mainly determined by the spectrum of the sound, that is, the composition of the pitch and each harmonic.

In the various embodiments described above, a target data sequence may be included in a specified number of specific audio data; since any specific audio data may be determined using the above-described loudness, pitch, and timbre, therefore, All target data sequences composed of a predetermined number of specific audio data mentioned in the foregoing various technical solutions are associated with one information codebook for transmitting data covering a larger information codebook.

For example, different pitches have different frequency values; it is assumed that n different frequency values are selected, wherein the n pitches can respectively use A, B, C, D, E, F, G, H, I, J Said Different loudnesses have different sound intensity values; it is assumed that m different sound intensity values are selected, wherein the m loudnesses can respectively use a, b, c, d, e, f, g, h... Representation; different timbres have different sound spectra; assume k different sound spectra are selected, wherein the k sound spectra can be represented by 1, 2, 3, ... k respectively; on this basis, any An audio data can be described in the following form:

Where X is the pitch, the number is n; Y is the loudness, the number is m; Z is the timbre, the number of which is k;

Therefore, the information codebook capacity W of any one of the audio data in the present invention can be calculated by the following formula:

W=n×m×k

It is assumed that in a target data sequence of the present invention, a unit audio group is simply composed of five audio data; the information codebook capacity of any unit audio data group is calculated by:

W=(n×m×k) ⁵

When n=10, m=8, k=8,

The value of W is: 2 ³⁰ × 10 ⁵ > 10 ¹⁴

Of course, the values of the above integers n, m and k are all natural numbers, and the relevant skilled person can select or determine according to the required codebook capacity when implementing the present invention.

As described above, in the above various specific embodiments of the present invention, a target data sequence can be constructed in a completely single target data form, for example, simply using audio data or simply using the encoded data to construct a target data sequence. However, in some cases, it may be necessary to construct a target data sequence using a mixture of audio data and encoded data. In order to be able to extract data information from the first digital audio signal of the present invention by means of correct means upon receiving, it is necessary to insert a predetermined identification data sequence in a predetermined position of the target data sequence, so that the receiving device is parsing and After the identification data sequence is identified, the corresponding identification scheme can be used according to the indication of the identification data sequence to extract the corresponding data. For example, a pattern recognition scheme is used to identify audio data in a target data sequence.

Of course, even if a target data sequence is a mixture of audio data and encoded data, as long as it is used in a completely closed information system, it can be constructed in a good way. Any target data sequence without the need to insert any identifying data sequence into it; instead, in an open information system, identifying the data sequence is almost a must. Therefore, whether or not to use the identification data sequence should be determined by the relevant technical personnel according to the specific needs when designing the relevant system.

In the various specific embodiments of the invention described above, if an identification data sequence is employed, the identification data sequence is preferably constructed using encoded data. However, the relevant technician can also choose to use the audio data according to the specific design requirements, and the combination of the audio data and the encoded data to form the identification data sequence.

In summary, an important advantage of the present invention is that since the above-mentioned target data sequence is inserted at a position below the masking threshold of the digital audio signal, the presence of the masking effect occurs when the digital audio signal after the insertion of the target data sequence is played. The inserted audio signal sequence is not perceived by the human ear.

In addition, since the audio signal (loudness, pitch, and timbre) of various dimensions is used in the present invention to form an audio data sequence, the capacity of the information codebook has a large space and can be utilized with limited space. Audio data to deliver enough information.

In order to receive and acquire a target data sequence embedded in a digital audio signal using the foregoing various aspects of the present invention, the present invention also provides the following technical solutions:

When receiving a digital audio signal embedded with an audio signal sequence using some devices (for example, a mobile phone, a smart device having a microphone and audio processing capability, etc.), the received digital audio signal is framed into a plurality of audio frame data and performed. Windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain a plurality of spectrum data respectively corresponding to the audio frame data;

Mapping the spectral data to an auditory critical band (Bark domain), and calculating a masking threshold of each subband in the auditory critical band; the number of the masking thresholds is in one-to-one correspondence with the number of the aforementioned subbands;

Selecting, from the plurality of spectral data, a frequency point smaller than the masking threshold as an embedded position; and using a quantizer capable of performing blind detection on the quantized result, performing inverse quantization processing on the discrete Fourier coefficients of the embedded position to obtain the digital audio signal. A one-dimensional data sequence embedded in the medium; see the content of each of the above-described digital audio signal processing of the present invention, wherein the foregoing target data sequence is serially arranged in a predetermined order by more than one specific audio data and/or encoded data. Arranged; wherein the particular audio frequency domain signal corresponds to a particular loudness and/or a particular pitch and/or tone.

With the above-described embodiment of extracting data from a digital audio signal of the present invention, a corresponding one-dimensional data sequence can be extracted from a digital audio signal embedded with a target data sequence. However, as mentioned before: when the one-dimensional data sequence is composed of audio data, or is composed of a mixture of audio data and encoded data; or, when the digital audio signal is transmitted in an open information system, it needs to be extracted Finding a predetermined identification data sequence in the one-dimensional data sequence, and performing pattern recognition on the extracted audio data of the position corresponding to the identification data sequence in the extracted one-dimensional data sequence according to the indication of the identification data sequence, and finally obtaining the corresponding target Data sequence.

In some cases, obtaining the target data sequence means obtaining the actual information, for example, when the target data sequence is composed only of the encoded data; but in some cases, for example, when the target data sequence is composed of audio data, Or when the audio data and the encoded data are mixed, even if the target data sequence is extracted by the mode recognition according to the indication of the foregoing identification data sequence, it may be necessary to use the predetermined coding table to transform the target data sequence. Finally, the target data embedded in the aforementioned digital audio signal is obtained.

Of course, in the present invention, after obtaining the aforementioned one-dimensional data sequence or target data sequence, the one-dimensional data sequence or the target data sequence may be obtained by using a receiving device, such as a mobile phone, a smart device having a microphone and audio processing capability, and the like. Sending to the server side, the server side specifically completes the search for the predetermined identification data sequence, extracts the target data sequence by pattern recognition according to the indication of the identification data sequence, and transforms the target data sequence by using a predetermined coding table. Finally, operations such as target data embedded in the aforementioned digital audio signal are obtained.

A specific application example is: after extracting the target data sequence embedded in the digital audio signal by using the above specific embodiments, if the target data sequence is simply composed of audio data, the target data sequence can be The specific specific audio data and the combination thereof are encoded and matched, that is, the data information corresponding to the audio signal sequence can be queried in a predetermined coding table.

The predetermined coding table generally includes at least one-to-one correspondence information: an audio data sequence and specific information corresponding thereto; for example, according to the above-mentioned audio data sequence composed of loudness, pitch, and timbre. Example, a specified length of audio data sequence Corresponding to the letter "A", corresponding to the word "energy", corresponding to the short sentence "spectral data", corresponding to an item object "mobile phone", corresponding to a web page link address "www.baidu.com" and the like. The manner of transmitting information in this way is somewhat similar to that of the telegraph code; however, as described above, if the information codebook capacity is sufficiently large, the method of transmitting information of the present invention can be separated from the aforementioned telegraph code, and the data can be directly transmitted. .

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

A method of digital audio signal processing, comprising:

Framing the first digital audio signal into a plurality of audio frame data and performing windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain multiple numbers corresponding to the plurality of audio frame data respectively a spectrum of data;

Mapping the plurality of first spectral data to an auditory critical band, and calculating a masking threshold of each subband in the auditory critical band; the number of the masking thresholds corresponding to the number of the subbands;

Selecting a frequency point of the plurality of first spectrum data that is smaller than the masking threshold as an embedded position;

The target data is quantized by a quantizer capable of performing blind detection on the quantized result, and the discrete Fourier coefficients of the embedded position are assigned by the result of the quantization process to obtain a plurality of numbers corresponding to the plurality of first spectral data. Second spectrum data;

Performing discrete Fourier transform on the plurality of second spectral data to obtain a second digital audio signal.
The method of claim 1 wherein said target data is obtained by the following steps:

Obtaining more than one specific audio data and/or encoded data, and serially arranging the one or more specific audio data and/or encoded data into a target data sequence in a predetermined order; or

Obtaining more than one specific audio data and/or encoded data, and serially arranging the one or more specific audio data and/or encoded data into a target data sequence in a predetermined order; and predetermining the target data sequence Positioning, inserting a predetermined sequence of identification data; the sequence of identification data is arranged by predetermined encoded data according to a predetermined length and order;

Wherein the specific audio data corresponds to a specific loudness and/or a specific pitch and/or timbre.
The method according to claim 1 or 2, wherein said quantizing unit that can perform blind detection on the quantized result quantizes the target data, and assigns the discrete Fourier of said embedded position by the result of the quantization process Leaf coefficient, including:

Calculating a corresponding embedding based on the masking threshold of the audio frame data based on the embedded position Intensity to determine the amount of data embedded in the corresponding audio frame data;

According to the embedding strength, the target data is quantized by a quantizer that can perform blind detection on the quantization result, and the discrete Fourier coefficients of the embedding position are assigned by the result of the quantization process.
The method according to claim 1 or 2, further comprising:

When the corresponding first spectrum data at the frequency point is smaller than a minimum masking threshold; and/or, when the frequency point is located in the middle and low frequency bands of the audio, the frequency point is used as an embedded position; 30Hz--4KHz; and/or,

Calculating a signal to noise ratio of the second digital audio signal, and outputting the second digital audio signal when a signal to noise ratio of the second digital audio signal is above a predetermined threshold range.
A method of extracting data from a digital audio signal, comprising:

Framing the first digital audio signal into a plurality of audio frame data and performing windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain multiple numbers corresponding to the plurality of audio frame data respectively a spectrum of data;

Mapping the plurality of first spectral data to an auditory critical band, and calculating a masking threshold of each subband in the auditory critical band; the number of the masking thresholds corresponding to the number of the subbands;

Selecting a frequency point of the plurality of first spectrum data that is smaller than the masking threshold as an embedded position;

Performing inverse quantization processing on the discrete Fourier coefficients of the embedded position by using a quantizer capable of performing blind detection on the quantized result, to obtain a target data sequence embedded in the first digital audio signal; wherein the target data sequence More than one particular audio data and/or encoded data is serially arranged in a predetermined order; the particular audio frequency domain signal corresponds to a particular loudness and/or a particular pitch and/or timbre.
The method of claim 5, further comprising:

Finding a predetermined identification data sequence in the target data sequence, and performing pattern recognition on the target data sequence according to the identification data sequence to obtain a corresponding target data sequence; or

Finding a predetermined identification data sequence in the target data sequence, and performing pattern recognition on the target data sequence according to the identification data sequence to obtain a corresponding target data sequence Columns, and using the predetermined coding table, transforming the target data sequence to obtain target data embedded in the first digital audio signal.