CN113921041A - Recording equipment identification method and system based on packet convolution attention network - Google Patents

Recording equipment identification method and system based on packet convolution attention network Download PDF

Info

Publication number
CN113921041A
CN113921041A CN202111183247.8A CN202111183247A CN113921041A CN 113921041 A CN113921041 A CN 113921041A CN 202111183247 A CN202111183247 A CN 202111183247A CN 113921041 A CN113921041 A CN 113921041A
Authority
CN
China
Prior art keywords
attention
audio
feature map
channel
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111183247.8A
Other languages
Chinese (zh)
Inventor
李晔
李姝�
张鹏
冯涛
汪付强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202111183247.8A priority Critical patent/CN113921041A/en
Publication of CN113921041A publication Critical patent/CN113921041A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a recording equipment identification method and a system based on a packet convolution attention network, comprising the following steps: detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, screening the non-speech section of the audio to be detected after the non-speech section is detected, and splicing the non-speech section into a complete non-speech section audio; and extracting the random spectral characteristic feature which is used as the inherent track of the sound recording device to be detected from the non-speech segment, and based on the feature, utilizing a second grouped convolution attention network to identify the sound recording device. The method is based on the grouping convolution attention network and is respectively used for non-speech segment detection and recording equipment identification, and the high efficiency of the whole recording equipment identification model is ensured while the complexity is reduced.

Description

Recording equipment identification method and system based on packet convolution attention network
Technical Field
The invention belongs to the technical field of audio identification, and particularly relates to a recording equipment identification method and system based on a packet convolution attention network.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
In recent years, audio evidence plays an increasingly important role in proving the fact of a case due to the characteristics of convenience in recording, visual presentation and the like.
Audio forensics include identification of recording location, identification of recording equipment, identification of audio tampering, and the like. The voice recording equipment recognition model mainly comprises non-speech segment detection, namely voice endpoint detection, feature extraction, mode recognition, database construction and the like. The purpose of non-speech segment detection is to detect whether the audio segment is a speech segment or a non-speech segment, and since the signal power of the speech segment is large in energy ratio in the whole audio signal and has a large influence on the characteristics of the recording device, only non-speech frames are generally processed in the recognition of the recording device. Thus, the accuracy of non-speech segment detection is the basis for the recognition accuracy of the recording device. However, the research on the identification of the recording device is still in the beginning stage, and the main problems are as follows:
1) noise interference still exists in the detection of the unvoiced segment, and the influence of non-stationary noise cannot be overcome;
2) the method has the advantages that the method cannot be accurately realized in the aspects of separating the characteristics of the recording equipment from other characteristics and extracting the characteristic parameters of the recording equipment;
3) the recognition precision is low, and the sound recording equipment is not accurately recognized by using an intelligent algorithm (such as a deep learning model).
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the recording equipment identification method based on the grouping convolution attention network, the grouping convolution attention network is respectively used for non-speech segment detection and recording equipment identification, and the complexity of the identification model of the whole recording equipment is reduced while the high efficiency of the identification model is ensured.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
in a first aspect, a recording device identification method based on a packet convolution attention network is disclosed, which comprises the following steps:
and detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network so as to remove the influence of the speech section on the characteristics of the recording equipment. After the detection of the non-speech section is finished, screening the non-speech section of the audio frequency to be detected, and splicing the non-speech section into a complete non-speech section audio frequency serving as a new audio frequency to be detected;
and extracting random spectral characteristic features serving as inherent tracks of the recording equipment to be detected from the new audio to be detected, and identifying the recording equipment by utilizing a second grouped convolution attention network based on the features.
According to the further technical scheme, when the non-speech section of the audio to be detected is detected, the detection is carried out based on the generated local correlation spectrogram.
According to a further technical scheme, the step of generating the local correlation spectrogram comprises the following steps:
windowing and framing the complete audio to obtain a spectrogram of the complete audio;
processing the spectrogram by using a self-attention locality sensitive hashing algorithm: and selecting corresponding k indexes representing similar frame positions according to the attention score value, and stacking the frequency spectrums corresponding to the k positions along the time dimension to generate a local correlation spectrogram. The spectrum information of each frame can be well expanded.
In a further aspect, the first packet convolutional attention network is configured to perform:
carrying out down-sampling feed-forward operation on the local correlation spectrogram;
performing feedback operation of up-sampling based on the down-sampling operation result;
and taking the result of the up-sampling operation as a carrier of a grouping attention module for guiding feature learning, so that the network learns more essential features of the spectrogram.
In a further aspect, the first packet convolutional attention network is further configured to perform:
dividing the result of the up-sampling operation into a plurality of groups according to the channel dimension;
for any group, the system is divided into three branches which are respectively used for generating a channel attention feature map, a frequency spectrum attention feature map and a time attention feature map;
performing splicing operation on the channel attention feature map, the frequency spectrum attention feature map and the time attention feature map along the channel dimension to generate a complete group feature map;
aggregating the multiple groups of group feature maps along the channel dimension to generate an aggregated attention feature map of the first layer;
channel shuffling is carried out on the aggregation attention feature map, and the information loss influence caused by channel grouping is eliminated;
repeating the first packet convolutional attention network operation to generate 4 aggregated attention feature maps;
and finally, reducing the size of the aggregation feature map of the fourth layer through average pooling operation, and detecting the non-speech segments.
In a further technical solution, the process of generating the channel attention feature map includes:
performing global average pooling operation on a first channel branch in the three branches to generate channel statistical information;
and passing the channel statistical information through a full connection layer, and generating channel attention weight distribution by using an activation function so as to generate a channel attention feature map.
In a further technical solution, the process of generating the spectrum attention feature map comprises: and after the second frequency spectrum branch in the three branches is used for generating frequency spectrum statistical information, carrying out group standardization operation on the frequency spectrum statistical information, generating frequency spectrum attention weight distribution by using an activation function after passing through a full connection layer, and further generating a frequency spectrum attention feature map.
In a further technical solution, the process of generating the time attention feature map comprises: and after the third time branch in the three branches is used for generating time statistical information, carrying out group standardization operation on the time statistical information, generating time attention weight distribution by using an activation function after passing through a full connection layer, and further generating a time attention feature map.
According to a further technical scheme, the extracting of the random spectral characteristic features of the non-speech segments comprises:
firstly, windowing and framing a non-speech section signal, and obtaining a spectrogram through fast Fourier transform;
calculating the short-time power spectrum of the non-speech segment signal, taking the logarithm of the power spectrum and averaging along the time axis to obtain an average power spectrum,
adopting an orthogonal random Gaussian matrix, and reducing the dimension of the average power spectrum through matrix multiplication to obtain random spectrum characteristic parameters of the non-speech section signals;
and generating a two-dimensional random spectral characteristic map based on the random spectral characteristic parameters.
According to the further technical scheme, a second-group convolutional attention network is utilized for sound recording device identification, the number of neurons in an output layer of the second-group convolutional attention network is the number of the types of sound recording devices used for training, the probability of each type is predicted through Softmax layer output, and the index of the maximum value is the last identification type of the frame;
and knowing the identification results of all the frames, calculating the type with the maximum ratio, and then determining the type as the number of the recording equipment to which the audio to be tested belongs.
In a second aspect, a system for identifying a sound recording device based on a packet convolutional attention network is disclosed, comprising:
a non-speech segment detection module configured to: detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, screening the non-speech section or the non-speech section of the audio to be detected after the non-speech section is detected, and splicing the non-speech section or the non-speech section into a complete non-speech section audio serving as a new audio to be detected;
a sound recording device identification module configured to: and extracting the random spectral characteristic feature which is used as the inherent track of the sound recording device to be detected from the non-speech segment, and based on the feature, utilizing a second grouped convolution attention network to identify the sound recording device.
The above one or more technical solutions have the following beneficial effects:
the method is respectively used for non-speech segment detection and recording equipment identification based on the grouping convolution attention network, and the complexity of the identification model of the whole recording equipment is reduced while the high efficiency of the identification model is ensured.
When detecting a non-speech segment, the local correlation spectrogram is generated by using a self-attention local sensitive Hash algorithm for each frame, and the local correlation spectrogram is formed by a plurality of similar frames and used as an expansion feature map of the frame, so that the sense field of the convolutional layer is expanded while the frame information is increased.
The first packet convolution attention network of the present invention is used for non-speech segment detection: the aim is to remove the influence of speech segments on the characteristics of the recording device. Firstly, a carrier feature map module promotes and guides feature learning of a grouping convolution attention module through a local downsampling-upsampling mode, the grouping attention module divides a local related spectrogram into a plurality of groups along a channel dimension, each group is divided into a channel branch, a spectrum branch and a time branch, parallel and efficient modeling of channel, frequency domain and time domain attention is achieved, a network is helped to focus attention on VAD features which are helpful for non-speech section detection, and other irrelevant features are restrained.
When the recording equipment is identified, firstly, the stochastic spectral characteristics (RSFs) representing the power track of the recording equipment are extracted, and a stochastic spectral characteristic diagram is generated.
The second packet convolutional attention network of the present invention is used for recording device identification: the difference with the first packet convolutional attention network is that the packet attention module divides the stochastic spectrum characteristic graph into a plurality of groups along the channel dimension, each group is divided into a channel branch and a stochastic spectrum branch, and the stochastic spectrum branch is used for capturing the inherent track of the sound recording device. Thus helping the network to focus attention on the recording device-related feature areas.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a recording device identification according to an embodiment of the present invention;
FIG. 2 is a block diagram of a packet convolution attention network flow according to an embodiment of the present invention;
fig. 3 is a first packet convolutional attention network of an embodiment of the present invention: a carrier characteristic map module and a packet attention module;
FIG. 4 is a flow chart of the random spectral feature extraction according to an embodiment of the present invention;
FIG. 5 is a second packet convolutional attention network of an embodiment of the present invention: packet attention module schematic.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment discloses a recording device identification method based on a packet convolution attention network, and a recording device identification flow chart is shown in FIG. 1.
The method comprises the following steps:
s1, utilizing a first packet convolution attention network to perform non-speech section recognition;
s2, extracting Random Spectral Features (RSFs) of the non-speech segments as characteristic parameters identified by the sound recording equipment;
and S3, identifying the sound recording equipment by using the second packet of convolutional attention network.
1. Non-speech segment detection using a first packet convolution attention network
In step S1, the non-utterance section detection process includes:
step 1.1, generating a local correlation spectrogram;
specifically, the generating a local correlation spectrogram includes:
windowing and framing the complete audio to obtain a spectrogram of the complete audio; the 0 th dimension of the spectrogram represents a frequency spectrum and the 1 st dimension represents a time frame. Assuming that the number of generated frames is M, and the M frame data are sequentially used as query points, for each query point q, k frame data most similar to q, that is, k neighboring frames, need to be found from the M frame data.
Searching k adjacent frames by using a self-attention locality sensitive hashing algorithm; the self-attention locality sensitive hashing algorithm has the following meanings: for each query point q, its k neighbor frames can be automatically completed by taking the k indices representing the maximum attention scores of similar frame positions. Its self-attention hash function is defined as:
Figure BDA0003298117730000071
score(Q,K)=dist-2*Q·KT
where D represents the number of frequency points per frame, and Q, K is a data space composed of M frames of spectra.
Figure BDA0003298117730000072
Respectively the square sum of each frequency point in the ith frame in Q, K,
Figure BDA0003298117730000073
Figure BDA0003298117730000074
it stands for the straight-forward operation,
Figure BDA0003298117730000075
each row in the list represents the relevancy score of the query point to the M frames, and when the two frames are similar enough, the corresponding scores are large enough, otherwise, the corresponding scores are small enough.
Selecting corresponding k indexes according to the attention value, stacking the frequency spectrums corresponding to the k positions along the time dimension to generate a local correlation spectrogram
Figure BDA0003298117730000076
Wherein C, F 'and T' respectively represent a channel dimension, a spectrum dimension and a time dimension. The local correlation spectrogram well expands the spectrum information of each frame and provides a larger feature learning space for the network.
Step 1.2 first packet convolution attention network
Referring to fig. 2, the complete packet convolution attention network includes a four-layer bearer feature map module and a packet attention module.
Step 1.2.1 Carrier feature map Module
Firstly, a local spectrogram passes through a two-dimensional convolution layer and a maximum pooling layer, and the output size is as follows:
Figure BDA0003298117730000077
the carrier feature map module comprises:
and performing a downsampling operation on the X, wherein the downsampling process is represented as:
Figure BDA0003298117730000078
wherein f isC-B-R(. The) represents a residual block operation, consisting of a two-dimensional convolution layer, a batch normalization layer, and a Relu layer. MaxPool (·) denotes the maximum pooling, i.e. down-sampling operation,
Figure BDA0003298117730000079
Figure BDA00032981177300000710
will MdownPerforming an upsampling operation, the upsampling process can be expressed as:
Figure BDA0003298117730000081
wherein UpSample (-) represents the upsampling operation, implemented by a bilinear interpolation upsampling layer,
Figure BDA0003298117730000082
through the feed-forward and feedback process of down-sampling and up-sampling, the network can be prompted to learn more essential characteristics of the spectrogram, so that M is usedupAs a carrier of grouping attention, the method can better guide feature learning.
Step 1.2.2 regarding the group attention module, comprising:
A. outputting M carrier feature map module along channel dimensionupThe method comprises the following steps: mup=[X1,…,XG]Then each group
Figure BDA0003298117730000083
B. For any group XiIs mixing XiThe method comprises the following steps:
Figure BDA0003298117730000084
then
Figure BDA0003298117730000085
j∈[c,f,t]. Respectively generating a channel attention feature map, a frequency spectrum attention feature map and a time attention feature map;
the process of generating the channel attention feature map comprises the following steps: will be provided with
Figure BDA0003298117730000086
Performing a global average pooling operation
Figure BDA0003298117730000087
Generating channel statistics
Figure BDA0003298117730000088
Figure BDA0003298117730000089
Wherein the content of the first and second substances,
Figure BDA00032981177300000810
represents
Figure BDA00032981177300000811
Frequency point (c).
Channel statistical information scThrough the full connection layer
Figure BDA00032981177300000812
Generating channel attention weight distribution by using Sigmoid activation function, and further generating channel attention feature map
Figure BDA00032981177300000813
Figure BDA00032981177300000814
Wherein the content of the first and second substances,
Figure BDA00032981177300000815
representing the extension of the channel attention weight distribution dimension to
Figure BDA00032981177300000816
After the same, the dot product operation is performed with the same.
The process of generating the spectrum attention feature map comprises the following steps: by using
Figure BDA00032981177300000817
Generating spectral statistics
Figure BDA00032981177300000818
Then, s isfPerforming group standardization operation
Figure BDA00032981177300000819
Generating channel attention weight distribution by using Sigmoid activation function after passing through a full connection layer, and further generating a frequency spectrum attention feature map
Figure BDA0003298117730000091
Figure BDA0003298117730000092
The process of generating the time attention feature map comprises the following steps: similar to spectral attention, utilize
Figure BDA0003298117730000093
Generating temporal statistics
Figure BDA0003298117730000094
Then, s istPerforming group standardization operation
Figure BDA0003298117730000095
Generating time attention weight distribution by using Sigmoid activation function after passing through a full connection layer, and further generating a time attention feature map
Figure BDA0003298117730000096
Figure BDA0003298117730000097
C. Performing a stitching operation (Concat [ · on the channel attention feature map, the spectral attention feature map, the temporal attention feature map along the channel dimension]) Generating a complete set of feature maps
Figure BDA0003298117730000098
Figure BDA0003298117730000099
Finally, the feature maps of the G groups are aggregated along the channel dimension to generate an aggregated attention feature map of the first layer
Figure BDA00032981177300000910
And C is G × N, N being the number of channels per group.
D. The calculation of each attention feature map only needs to utilize a part of channel information, particularly the calculation of time-frequency attention, and does not need all the channel information. However, in the case of the superposition of multiple grouped convolution attention modules, the output of a certain group of channels is only related to the part of information related to the group of input channels, and the property can weaken the information flow representation among the whole channel group. Thus adding a channel shuffling operation at this step, sufficient information is provided for each group in the subsequent packet convolution attention module.
The channel shuffling operation is: attention feature map of convergence Y1Is expanded and transposed into [ N, G, F, T ] in the channel dimension]And then aggregated into channel dimensions: [ C, F, T ]]As input for the next layer. The carrier characteristic map module and the grouping attention module are shown in fig. 3.
Step 1.2.3 repeat steps 1.2.1 to 1.2.2 three times, and the output of the aggregate profile for each layer is then:
Figure BDA00032981177300000911
finally, through average pooling operation, Y is added4Is reduced to [ C,1 ]]For final detection.
Step 1.2.4 non-speech segment detection
The detection operation represents voice and non-voice segment detection in non-speech segment detection, and comprises a full connection layer and a Sigmoid layer, and the neuron number of an output layer is 1.
And after the detection of the non-speaking sections is finished, screening the detected non-speaking sections, and splicing the non-speaking sections into a complete non-speaking section audio serving as a new audio to be detected.
And S2, extracting random spectral characteristics (RSFs) of the non-speech section, assuming that the equipment to be tested is a linear time-invariant system, and modeling the influence of the recording equipment on the speech through convolution of the original speech and the equipment impulse response. Since the spectrum of any speech segment is the product of the spectrum of the original speech signal and the device frequency response, the identity of each recording device is embedded in the speech. Based on this assumption, stochastic spectral features (RSFs) can be used as the natural trajectory of the recording device under test.
The extracting of the Random Spectral Features (RSFs) of the non-speech segments comprises:
firstly, windowing and framing a noise section signal, wherein the frame length is 64ms, the frame shift is 32ms, and a frequency spectrogram of 2048 points of each frame is obtained through fast Fourier transform, and each frame is marked as x (t);
computing short-time power spectra of non-speech segment signals
Figure BDA0003298117730000101
The calculation formula is as follows:
Figure BDA0003298117730000102
wherein
Figure BDA0003298117730000103
Represents the distribution of the average power of the non-speech segment signal on the frequency domain, namely the condition that the power of the unit frequency band changes along with the frequency;
taking the logarithm of the power spectrogram w and averaging along a time axis to obtain an average power spectrum of 2048 dimensions;
adopting an orthogonal random Gaussian matrix with the size of d multiplied by 2048, and reducing the dimension of the average power spectrum to d <2048 through matrix multiplication to obtain d-dimension RSFs parameters of the noise signal;
generating a two-dimensional stochastic spectral profile
Figure BDA0003298117730000104
Where A represents amplitude, then R means average powerThe case of frequency variation. The complete extraction process is shown in fig. 4.
Step S3, utilizing the second group of convolution attention network to identify the recording equipment
The process of using the second packet convolutional attention network for audio recording device identification synchronizes the operation processes of step 1.2.1 to step 1.2.4, except that in the packet attention module:
outputting M carrier characteristic map moduleupAfter division into groups G, for any group XiIs mixing XiIs divided into two blocks:
Figure BDA0003298117730000111
then
Figure BDA0003298117730000112
j∈[c,r]。
Figure BDA0003298117730000113
For generating a stochastic spectral attention feature map:
by using
Figure BDA0003298117730000114
Spatial statistics for generating random spectra
Figure BDA0003298117730000115
Then, s isrPerforming group standardization operation
Figure BDA0003298117730000116
Through the full connection layer
Figure BDA0003298117730000117
Enhancement srThe representation of the random spectrum attention weight value is generated by utilizing a Sigmoid activation function, and then a random spectrum attention feature diagram is generated
Figure BDA0003298117730000118
Figure BDA0003298117730000119
Figure BDA00032981177300001110
The packet attention module of the second packet convolutional attention network is shown in fig. 5.
In the final recording device detection link, the number of neurons in the output layer is changed into the number of recording device types for training, the probability of each type is predicted through the output of the Softmax layer, and the index of the maximum value is the final recognition type of the frame. Knowing the recognition results of all the frames, calculating the category with the largest proportion, and determining the category as the number of the recording equipment to which the audio to be tested belongs.
Example two
It is an object of this embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
Example four
The embodiment aims to provide a second aspect and discloses a recording device identification system based on a packet convolution attention network, which comprises:
a non-speech segment detection module configured to: detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, screening the non-speech section of the audio to be detected after the non-speech section is detected, and splicing the non-speech section into a complete non-speech section audio serving as a new audio to be detected;
a sound recording device identification module configured to: and extracting random spectral characteristic features serving as inherent tracks of the recording equipment to be detected from the new audio to be detected, and identifying the recording equipment by utilizing a second grouped convolution attention network based on the features.
The first packet convolutional attention network includes a four-layer bearer feature map module and a packet attention module, and specific implementation and functions of the modules are described in the first embodiment, and are not described herein again.
The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.
Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (10)

1. The method for identifying the recording equipment based on the packet convolution attention network is characterized by comprising the following steps:
detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, screening the non-speech section of the audio to be detected after the non-speech section is detected, and splicing the non-speech section into a complete non-speech section audio serving as a new audio to be detected;
and extracting random spectral characteristic features serving as inherent tracks of the recording equipment to be detected from the new audio to be detected, and identifying the recording equipment by utilizing a second grouped convolution attention network based on the features.
2. The method of claim 1, wherein the detecting the non-speech segment of the audio to be tested is performed based on the generated local correlation spectrogram;
further, the step of generating the local correlation spectrogram is as follows:
windowing and framing the complete audio to obtain a spectrogram of the complete audio;
processing the spectrogram by using a self-attention locality sensitive hashing algorithm: and selecting corresponding k indexes representing similar frame positions according to the attention score value, and stacking the frequency spectrums corresponding to the k positions along the time dimension to generate a local correlation spectrogram.
3. The method for packet convolutional attention network based audio recording device identification as claimed in claim 1 wherein the first packet convolutional attention network is configured to perform:
carrying out down-sampling feed-forward operation on the local correlation spectrogram;
performing feedback operation of up-sampling based on the down-sampling operation result;
and taking the result of the up-sampling operation as a carrier of a grouping attention module for guiding feature learning, so that the network learns more essential features of the spectrogram.
4. The method for packet convolutional attention network based audio recording device identification of claim 3 wherein the first packet convolutional attention network is further configured to perform:
the results of the upsampling operation are divided into a plurality of groups;
for any group, the system is divided into three branches which are respectively used for generating a channel attention feature map, a frequency spectrum attention feature map and a time attention feature map;
performing splicing operation on the channel attention feature map, the frequency spectrum attention feature map and the time attention feature map along the channel dimension to generate a complete group feature map;
aggregating the multiple groups of group feature maps along the channel dimension to generate an aggregated attention feature map of a first layer, wherein the aggregated attention feature map is the same as the dimension of the local spectrogram;
the aggregated attention feature map is subjected to channel shuffling operation, so that the information loss influence caused by channel grouping is eliminated;
repeating the first packet convolutional attention network operation to generate 4 aggregated attention feature maps;
and finally, reducing the size of the aggregation feature map of the fourth layer through average pooling operation, and detecting the non-speech segments.
5. The method for audio recording device identification based on packet convolution attention network of claim 4 wherein the process of generating the channel attention feature map is:
performing global average pooling operation on a first channel branch in the three branches to generate channel statistical information;
the channel statistical information passes through a full connection layer, and an activation function is used for generating channel attention weight distribution so as to generate a channel attention feature map;
the process of generating the spectrum attention feature map comprises the following steps: after the second frequency spectrum branch in the three branches is used for generating frequency spectrum statistical information, the frequency spectrum statistical information is subjected to group standardization operation, branch attention weight distribution is generated by using an activation function after passing through a full connection layer, and a frequency spectrum attention feature map is generated;
the process of generating the time attention feature map comprises the following steps: and after the third time branch in the three branches is used for generating time statistical information, carrying out group standardization operation on the time statistical information, generating time attention weight distribution by using an activation function after passing through a full connection layer, and further generating a time attention feature map.
6. The method for sound recording device identification based on packet convolution attention network as claimed in claim 1, wherein said extracting the random spectral feature of the non-speech segment includes:
firstly, windowing and framing a noise section signal, and obtaining a spectrogram through fast Fourier transform;
calculating a short-time power spectrum of the non-speech segment signals, taking the logarithm of the power spectrum and averaging along a time axis to obtain an average power spectrum;
adopting an orthogonal random Gaussian matrix, and reducing the dimension of the average power spectrum through matrix multiplication to obtain random spectrum characteristic parameters of the noise signal;
and generating a two-dimensional random spectral characteristic map based on the random spectral characteristic parameters.
7. The method of claim 4, wherein the voice recording device identification is performed by using a second packet convolutional attention network, the number of neurons in the output layer of the second packet convolutional attention network is the number of the types of voice recording devices used for training, the probability of predicting each type is output through a Softmax layer, and the index of the maximum value is the last identification type of the frame;
and knowing the identification results of all the frames, calculating the type with the maximum ratio, and then determining the type as the number of the recording equipment to which the audio to be tested belongs.
8. Recording equipment identification system based on grouping convolution attention network is characterized by comprising:
a non-speech segment detection module configured to: detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, and after the non-speech section detection is finished, selecting the non-speech section of the audio to be detected, splicing the non-speech section into a complete non-speech section audio serving as a new audio to be detected;
a sound recording device identification module configured to: and extracting new audio to be tested to be used as the random spectral characteristic feature of the inherent track of the sound recording equipment to be tested, and identifying the sound recording equipment by utilizing a second grouped convolution attention network based on the feature.
9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of the preceding claims 1 to 7.
CN202111183247.8A 2021-10-11 2021-10-11 Recording equipment identification method and system based on packet convolution attention network Pending CN113921041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111183247.8A CN113921041A (en) 2021-10-11 2021-10-11 Recording equipment identification method and system based on packet convolution attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111183247.8A CN113921041A (en) 2021-10-11 2021-10-11 Recording equipment identification method and system based on packet convolution attention network

Publications (1)

Publication Number Publication Date
CN113921041A true CN113921041A (en) 2022-01-11

Family

ID=79239310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111183247.8A Pending CN113921041A (en) 2021-10-11 2021-10-11 Recording equipment identification method and system based on packet convolution attention network

Country Status (1)

Country Link
CN (1) CN113921041A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596879A (en) * 2022-03-25 2022-06-07 北京远鉴信息技术有限公司 False voice detection method and device, electronic equipment and storage medium
WO2024040601A1 (en) * 2022-08-26 2024-02-29 Intel Corporation Head architecture for deep neural network (dnn)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596879A (en) * 2022-03-25 2022-06-07 北京远鉴信息技术有限公司 False voice detection method and device, electronic equipment and storage medium
CN114596879B (en) * 2022-03-25 2022-12-30 北京远鉴信息技术有限公司 False voice detection method and device, electronic equipment and storage medium
WO2024040601A1 (en) * 2022-08-26 2024-02-29 Intel Corporation Head architecture for deep neural network (dnn)

Similar Documents

Publication Publication Date Title
Becker et al. Interpreting and explaining deep neural networks for classification of audio signals
CN110516305B (en) Intelligent fault diagnosis method under small sample based on attention mechanism meta-learning model
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN112562741B (en) Singing voice detection method based on dot product self-attention convolution neural network
CN108986798B (en) Processing method, device and the equipment of voice data
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN113921041A (en) Recording equipment identification method and system based on packet convolution attention network
CN104424290A (en) Voice based question-answering system and method for interactive voice system
CN111341319B (en) Audio scene identification method and system based on local texture features
WO2021075063A1 (en) Neural network-based signal processing apparatus, neural network-based signal processing method, and computer-readable storage medium
CN113205820B (en) Method for generating voice coder for voice event detection
CN111341294A (en) Method for converting text into voice with specified style
CN113409827B (en) Voice endpoint detection method and system based on local convolution block attention network
Xie et al. KD-CLDNN: Lightweight automatic recognition model based on bird vocalization
CN114582325A (en) Audio detection method and device, computer equipment and storage medium
Imran et al. An analysis of audio classification techniques using deep learning architectures
JP2010515085A (en) Audio segmentation method and apparatus
Zhu et al. Speech emotion recognition using semi-supervised learning with efficient labeling strategies
CN112735466A (en) Audio detection method and device
CN105006231A (en) Distributed large population speaker recognition method based on fuzzy clustering decision tree
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
Wang et al. Revealing the processing history of pitch-shifted voice using CNNs
CN115083422B (en) Voice traceability evidence obtaining method and device, equipment and storage medium
CN116312628A (en) False audio detection method and system based on self knowledge distillation
CN115221351A (en) Audio matching method and device, electronic equipment and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination