CN110728991B

CN110728991B - Improved recording equipment identification algorithm

Info

Publication number: CN110728991B
Application number: CN201910841092.9A
Authority: CN
Inventors: 包永强; 梁瑞宇; 王青云; 冯月芹; 唐闺臣; 朱悦
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2022-03-01
Anticipated expiration: 2039-09-06
Also published as: CN110728991A

Abstract

The invention discloses an improved recording equipment recognition algorithm, which comprises the steps of constructing a first model and a second model, wherein the first model comprises a bidirectional gate cyclic neural network layer, a unidirectional gate cyclic neural network layer and an attention layer, the second model comprises a convolution layer, a jump connection layer and a global average pooling layer, framing and preprocessing are carried out on an audio signal to be detected, multi-dimensional frame-level features of the audio signal are extracted to serve as input of the first model, Mel-frequency spectrum features serve as input of the second model, output features of the first model and the second model are spliced and fused, and a recognition result is obtained through classification. The recognition algorithm of the invention reserves the time sequence characteristic of the audio signal, finally obtains the related characteristic parameters of the high-quality recording equipment by increasing the attention mechanism, the jump connection structure, the hidden unit splicing method and the like, and improves the recognition effect of the recording equipment and the robustness of the model.

Description

Improved recording equipment identification algorithm

Technical Field

The invention relates to the technical field of recording equipment, in particular to an improved recording equipment identification algorithm.

Background

Sound is the most natural means of communication for humans. With the increasing maturity of audio technology, audio has been widely spread in various aspects of social life. Different brands of recording equipment manufacturers generally record audio using different digital signal processing methods and circuits, and the difference between the methods results in audio signals containing features different from other recording equipment. Therefore, the recording apparatus can be identified to some extent by analyzing the audio signal. In the judicial case, related personnel often claim that evidence is recorded by using certain equipment, so that the judgment that the equipment for recording the audio privately is an urgent problem to be solved by relevant departments of the judicial affairs.

With the development of machine learning and deep learning techniques, researchers have proposed a variety of effective machine learning and deep learning recognition models. In 2007, Christian Kraetzer et al combined with time domain and frequency domain mixed feature recognition microphone equipment, and experiments were verified by using a naive Bayes classifier and the like, and finally an identification rate of 75.99% was obtained. Robert Buchholz in 2009 used naive Bayes, logistic regression, and a support vector machine as classifiers to classify microphones, and the characteristic input of the model was the Fourier coefficients of the audio. The effectiveness of the pitch frequency, the formant frequency and the MFCC in the audio in the recording equipment identification process is verified in 2011 by encourage and the like. In 2012, the Mel-Frequency Cepstral Coefficients (MFCC) of audio is extracted by Cemal Hanilc and used as a feature, a support vector machine is used as a model classifier, 14 different telephone devices are identified, and the identification rate reaches 96.42%. In 2014 Vandana Pandey discovered that the power spectral density function of audio could distinguish microphone devices to some extent. In the same year, Ling Zou et al have demonstrated that sound recording devices can be effectively distinguished using MFCC and power-normalized cepstral coefficients (PNCC).

From the present state of research, relatively few studies have been made specifically for recording device identification. Firstly, the shortage of the characteristic database of the recording equipment is caused, with the coming of the 4G era, the brands and signals of mobile phones on the market are continuously increased, and the existing database is not updated in time. And secondly, extracting characteristic parameters of the recording equipment, wherein voice recognition related characteristics are generally adopted in the recording equipment recognition and are not specially used for the recording equipment recognition. And finally, a recording equipment identification model is adopted, the existing recording equipment identification models are all models with excellent performance in speech recognition or speaker recognition, and parameter setting and model design are not specially improved aiming at the characteristics of the recording equipment.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides an improved recording equipment identification algorithm which can overcome the problems of low identification rate and poor generalization performance of recording equipment in the prior art and can effectively identify mobile phones and computer equipment with high utilization rate in the current market.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

an improved sound recording device identification algorithm comprising the steps of:

step S1, framing and preprocessing the audio signal to be detected;

s2, constructing a first model, wherein the first model comprises a bidirectional gate recurrent neural network layer, a unidirectional gate recurrent neural network layer and an attention layer which are sequentially arranged, and multi-dimensional frame-level features of the signals in the S1 are extracted as input of the first model;

s3, constructing a second model, wherein the second model comprises a first convolutional layer, a second convolutional layer, a third convolutional layer, a jump connection layer, a fourth convolutional layer and a global average pooling layer which are sequentially arranged, and Mel frequency spectrum characteristics of the signals in the step S1 are extracted to serve as input of the second model;

and S4, splicing and fusing the output characteristics of the first model and the second model, classifying and obtaining a recognition result.

Preferably, in step S2, 72-dimensional frame-level features are extracted, and after model one processing, 1000-dimensional feature vectors are output.

Preferably, in step S3, the output result of the first convolution layer and the output result of the third convolution layer are superimposed to be the final output of the third convolution layer.

Preferably, in step S1, the audio signal is framed, the frame length is 1024, the frame shift is 25%, and Hanning window processing is performed on the signal to extract the multi-dimensional frame-level features.

Preferably, in step S1, the audio signal is framed, the frame length is 1024, and the frame shift is 25%; calculating FFT for each frame of data, wherein the number of FFT points is 2048; and then a logarithmic Mel frequency spectrum diagram is obtained by calculation through a Mel filter bank with 80 sub-band filters.

Preferably, in step S2, the multidimensional frame-level features include a short-term zero-crossing rate, a root-mean-square energy, a fundamental frequency, a spectral centroid, a spectral spread, a spectral entropy, a spectral flux, a formant frequency, a first-order difference mel-frequency cepstral coefficient, a second-order difference mel-frequency cepstral coefficient, a linear prediction coefficient, and a Bark frequency cepstral coefficient.

Preferably, in step S2, the output S of the attention layer is expressed as P (v | x, q) expectation of the class probability distribution:

wherein the input sequence is

The corresponding request is q.

Has the advantages that: the improved recording equipment recognition algorithm has the following advantages:

1) the frame-level characteristics of the signals are introduced into a recording equipment identification algorithm, and the time sequence characteristics of the audio signals are reserved;

2) adding an attention mechanism to carry out weighted summation on the high-level features according to importance, and finally obtaining related feature parameters of the high-quality recording equipment so as to improve the robustness of the model;

3) improving a standard convolutional neural network model by adding a jump connection structure, and further improving the performance of the model;

4) and the final model fusion is realized by adopting a hidden unit splicing method, the method can improve the recognition effect of the sound recording equipment recognition and the robustness of the model, and has good application prospect.

Drawings

FIG. 1 is a schematic diagram of a model structure of an improved recognition algorithm of a recording device according to the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

As shown in fig. 1, the improved sound recording device recognition algorithm model of the present invention, specifically, the algorithm, comprises the following steps: and (1) extracting 72-dimensional frame-level characteristic parameters from each audio as input of a first model. Since the audio signal is relatively stable in a short time and non-stable in a long time, the framing is performed, and the frame length in the invention is 1024. In order to smooth the transition between two frames, it is necessary to have an overlap between the two frames, with an overlap ratio of 25%. Since framing causes spectral leakage, the signal is Hanning windowed.

And finally, extracting the features. Extracting 72-dimensional features for each frame signal, wherein the features are as follows: short-time zero-crossing rate, root-mean-square energy, fundamental frequency, spectrum centroid, spectrum diffusion, spectrum entropy, spectrum flux, formant frequency, first-order difference Mel cepstrum coefficient, second-order difference Mel cepstrum coefficient, linear prediction coefficient, Bark frequency cepstrum coefficient, and specific parameters are shown in table 1. These features are then combined together in frames, each frame has 72-dimensional speech features, and the precedence relationship between each frame of data also retains the timing information of the original audio signal. The finally obtained feature dimension is (frame number 72), and the frame number is dynamically changed along with the original audio length, so that the contradiction between the feature of the fixed dimension and the changed speech length is solved.

TABLE 1

Step (2), constructing a first model: and constructing a model I by utilizing a layer of bidirectional door circulation unit, a layer of unidirectional door circulation unit and a layer of attention layer. The recurrent neural network can well process the time sequence signal, the attention mechanism can independently learn the characteristics of the time sequence signal, and the characteristic parameters of the time sequence signal can be effectively mined by combining the recurrent neural network and the attention mechanism. Model one uses one layer of bidirectional gate cycle unit, one layer of unidirectional gate cycle unit and one layer of attention layer, and the input of the model is 72-dimensional frame-level features.

The principle of attention mechanism (attention) is to simulate the human visual attention mechanism. Suppose the input sequence is

The corresponding request is q, and the standard attention mechanism principle is to use a function f (x)_iQ) calculating a q and x_iAlignment score a between_i. All alignment scores for q with respect to x are noted as a ═ a (a)₁，a₂，…，a_n) Finally, a is mapped to a class probability distribution P (v | x, q) using a soft maximization function, which represents the selection of x according to q when v ═ i_iSuch as the following equation:

equation 2 expresses the output of attention s as P (v | x, q) expectation of the class probability distribution:

the attention mechanism can endow different importance to the local part of the same sample, automatically learn the characteristics of a time sequence signal and improve the robustness of the model. And outputting a 1000-dimensional characteristic vector by the model, overlapping the output of the model II, and finally classifying.

And (3) extracting a Mel frequency spectrum from each audio as an input of a model II. Firstly, framing an audio signal sample, wherein the frame length is 1024 and the frame shift is 25%; secondly, FFT is calculated for each frame of data, and the number of FFT points is 2048; again, a log-mel spectrogram was calculated using a mel-filter bank having 80 subband filters.

Step (4), constructing a model II: and (4) inputting the second model into the Mel frequency spectrum obtained in the step (3), adding jump connection to the first three layers of the second model which are convolution layers, connecting a layer of convolution and a layer of global average pooling, and overlapping the output result of the first layer of convolution layer with the output result of the third layer of convolution layer to form the final characteristic of the third layer.

And (5): the model I comprises a layer of bidirectional gate circulation unit, a layer of unidirectional gate circulation unit and an attention layer, and a 1000-dimensional high-level feature is finally extracted; and the first three layers of the model II are convolution layers, jump connection is added, a convolution layer and a global average pooling layer are connected, and finally a 1000-dimensional high-level feature is extracted. And splicing and fusing the output characteristics of the two models, and finally classifying.

TABLE 2 comparison of different model identification rates

Model (model)	Support vector machine	Recurrent neural networks	Standard convolutional neural network	Model fusion
					Average recognition rate	81％	82.3％	81.5％	87.5％

In conclusion, the improved recording equipment identification algorithm has the accuracy rate of 87.5%. It is characterized in that: 1) the model fusion structure improves the robustness of the system; 2) the extraction of the frame-level features can effectively mine the information of the recording equipment in the audio; 3) different importance is given to the local part of the same sample by using an attention mechanism, and the characteristics of a time sequence signal are automatically learned; 4) the underlying features are extracted using a jump join operation. Therefore, in practical application, different recording devices such as mobile phones and computers with higher utilization rate in the current market can be effectively distinguished according to the detected audio signals. The invention can overcome the problem of low recognition rate of the recognition model of the traditional recording equipment. The method can improve the recognition effect of the recording equipment recognition and the robustness of the model, and has good application prospect.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An improved sound recording device identification algorithm comprising the steps of:

step S1, framing and preprocessing the audio signal to be detected;

s3, constructing a second model, wherein the second model comprises a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer and a global average pooling layer which are sequentially arranged, the first three layers of the second model are convolutional layers, jump connection is added, the output result of the first convolutional layer and the output result of the third convolutional layer are superposed to form the final characteristic of the third layer, and the Mel frequency spectrum characteristic of the signal in the S1 is extracted as the input of the second model;

2. An improved sound recording device identification algorithm as claimed in claim 1 wherein: in step S2, 72-dimensional frame-level features are extracted, and 1000-dimensional feature vectors are output after model one processing.

3. An improved sound recording device identification algorithm as claimed in claim 1 wherein: in step S3, the output result of the first convolution layer and the output result of the third convolution layer are superimposed to be the final output of the third convolution layer.

4. An improved sound recording device identification algorithm as claimed in claim 1 wherein: in step S1, the audio signal is framed, the frame length is 1024, the frame shift is 25%, and Hanning window processing is performed on the signal to extract multi-dimensional frame-level features.

5. An improved sound recording device identification algorithm as claimed in claim 1 wherein: in step S1, framing the audio signal, where the frame length is 1024 and the frame shift is 25%; calculating FFT for each frame of data, wherein the number of FFT points is 2048; and then a logarithmic Mel frequency spectrum diagram is obtained by calculation through a Mel filter bank with 80 sub-band filters.

6. An improved sound recording device identification algorithm as claimed in claim 1 wherein: in step S2, the multidimensional frame-level features include a short-term zero-crossing rate, root-mean-square energy, fundamental frequency, spectrum centroid, spectrum diffusion, spectrum entropy, spectrum flux, formant frequency, first-order difference mel cepstrum coefficient, second-order difference mel cepstrum coefficient, linear prediction coefficient, and Bark frequency cepstrum coefficient.

7. An improved sound recording device identification algorithm as claimed in claim 1 wherein: in step S2, the output S of the attention layer is expressed as P (v | x, q) expectation of the class probability distribution:

wherein the input sequence is

The corresponding request is q;

p denotes the probability distribution function, i.e. when v ═ i represents the choice of x according to q_iThe probability of (d);

q and x_iThere is a relationship:

i.e. using a function f (x) according to the standard attention mechanism principle_iQ) calculating a q and x_iAlignment score a between_i。