CN113921041A

CN113921041A - Recording equipment identification method and system based on packet convolution attention network

Info

Publication number: CN113921041A
Application number: CN202111183247.8A
Authority: CN
Inventors: 李晔; 李姝�; 张鹏; 冯涛; 汪付强
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2021-10-11
Filing date: 2021-10-11
Publication date: 2022-01-11

Abstract

The invention provides a recording equipment identification method and a system based on a packet convolution attention network, comprising the following steps: detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, screening the non-speech section of the audio to be detected after the non-speech section is detected, and splicing the non-speech section into a complete non-speech section audio; and extracting the random spectral characteristic feature which is used as the inherent track of the sound recording device to be detected from the non-speech segment, and based on the feature, utilizing a second grouped convolution attention network to identify the sound recording device. The method is based on the grouping convolution attention network and is respectively used for non-speech segment detection and recording equipment identification, and the high efficiency of the whole recording equipment identification model is ensured while the complexity is reduced.

Description

Recording equipment identification method and system based on packet convolution attention network

Technical Field

The invention belongs to the technical field of audio identification, and particularly relates to a recording equipment identification method and system based on a packet convolution attention network.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In recent years, audio evidence plays an increasingly important role in proving the fact of a case due to the characteristics of convenience in recording, visual presentation and the like.

Audio forensics include identification of recording location, identification of recording equipment, identification of audio tampering, and the like. The voice recording equipment recognition model mainly comprises non-speech segment detection, namely voice endpoint detection, feature extraction, mode recognition, database construction and the like. The purpose of non-speech segment detection is to detect whether the audio segment is a speech segment or a non-speech segment, and since the signal power of the speech segment is large in energy ratio in the whole audio signal and has a large influence on the characteristics of the recording device, only non-speech frames are generally processed in the recognition of the recording device. Thus, the accuracy of non-speech segment detection is the basis for the recognition accuracy of the recording device. However, the research on the identification of the recording device is still in the beginning stage, and the main problems are as follows:

1) noise interference still exists in the detection of the unvoiced segment, and the influence of non-stationary noise cannot be overcome;

2) the method has the advantages that the method cannot be accurately realized in the aspects of separating the characteristics of the recording equipment from other characteristics and extracting the characteristic parameters of the recording equipment;

3) the recognition precision is low, and the sound recording equipment is not accurately recognized by using an intelligent algorithm (such as a deep learning model).

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the recording equipment identification method based on the grouping convolution attention network, the grouping convolution attention network is respectively used for non-speech segment detection and recording equipment identification, and the complexity of the identification model of the whole recording equipment is reduced while the high efficiency of the identification model is ensured.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

in a first aspect, a recording device identification method based on a packet convolution attention network is disclosed, which comprises the following steps:

and detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network so as to remove the influence of the speech section on the characteristics of the recording equipment. After the detection of the non-speech section is finished, screening the non-speech section of the audio frequency to be detected, and splicing the non-speech section into a complete non-speech section audio frequency serving as a new audio frequency to be detected;

and extracting random spectral characteristic features serving as inherent tracks of the recording equipment to be detected from the new audio to be detected, and identifying the recording equipment by utilizing a second grouped convolution attention network based on the features.

According to the further technical scheme, when the non-speech section of the audio to be detected is detected, the detection is carried out based on the generated local correlation spectrogram.

According to a further technical scheme, the step of generating the local correlation spectrogram comprises the following steps:

windowing and framing the complete audio to obtain a spectrogram of the complete audio;

processing the spectrogram by using a self-attention locality sensitive hashing algorithm: and selecting corresponding k indexes representing similar frame positions according to the attention score value, and stacking the frequency spectrums corresponding to the k positions along the time dimension to generate a local correlation spectrogram. The spectrum information of each frame can be well expanded.

In a further aspect, the first packet convolutional attention network is configured to perform:

carrying out down-sampling feed-forward operation on the local correlation spectrogram;

performing feedback operation of up-sampling based on the down-sampling operation result;

and taking the result of the up-sampling operation as a carrier of a grouping attention module for guiding feature learning, so that the network learns more essential features of the spectrogram.

In a further aspect, the first packet convolutional attention network is further configured to perform:

dividing the result of the up-sampling operation into a plurality of groups according to the channel dimension;

for any group, the system is divided into three branches which are respectively used for generating a channel attention feature map, a frequency spectrum attention feature map and a time attention feature map;

performing splicing operation on the channel attention feature map, the frequency spectrum attention feature map and the time attention feature map along the channel dimension to generate a complete group feature map;

aggregating the multiple groups of group feature maps along the channel dimension to generate an aggregated attention feature map of the first layer;

channel shuffling is carried out on the aggregation attention feature map, and the information loss influence caused by channel grouping is eliminated;

repeating the first packet convolutional attention network operation to generate 4 aggregated attention feature maps;

and finally, reducing the size of the aggregation feature map of the fourth layer through average pooling operation, and detecting the non-speech segments.

In a further technical solution, the process of generating the channel attention feature map includes:

performing global average pooling operation on a first channel branch in the three branches to generate channel statistical information;

and passing the channel statistical information through a full connection layer, and generating channel attention weight distribution by using an activation function so as to generate a channel attention feature map.

In a further technical solution, the process of generating the spectrum attention feature map comprises: and after the second frequency spectrum branch in the three branches is used for generating frequency spectrum statistical information, carrying out group standardization operation on the frequency spectrum statistical information, generating frequency spectrum attention weight distribution by using an activation function after passing through a full connection layer, and further generating a frequency spectrum attention feature map.

In a further technical solution, the process of generating the time attention feature map comprises: and after the third time branch in the three branches is used for generating time statistical information, carrying out group standardization operation on the time statistical information, generating time attention weight distribution by using an activation function after passing through a full connection layer, and further generating a time attention feature map.

According to a further technical scheme, the extracting of the random spectral characteristic features of the non-speech segments comprises:

firstly, windowing and framing a non-speech section signal, and obtaining a spectrogram through fast Fourier transform;

calculating the short-time power spectrum of the non-speech segment signal, taking the logarithm of the power spectrum and averaging along the time axis to obtain an average power spectrum,

adopting an orthogonal random Gaussian matrix, and reducing the dimension of the average power spectrum through matrix multiplication to obtain random spectrum characteristic parameters of the non-speech section signals;

and generating a two-dimensional random spectral characteristic map based on the random spectral characteristic parameters.

According to the further technical scheme, a second-group convolutional attention network is utilized for sound recording device identification, the number of neurons in an output layer of the second-group convolutional attention network is the number of the types of sound recording devices used for training, the probability of each type is predicted through Softmax layer output, and the index of the maximum value is the last identification type of the frame;

and knowing the identification results of all the frames, calculating the type with the maximum ratio, and then determining the type as the number of the recording equipment to which the audio to be tested belongs.

In a second aspect, a system for identifying a sound recording device based on a packet convolutional attention network is disclosed, comprising:

a non-speech segment detection module configured to: detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, screening the non-speech section or the non-speech section of the audio to be detected after the non-speech section is detected, and splicing the non-speech section or the non-speech section into a complete non-speech section audio serving as a new audio to be detected;

a sound recording device identification module configured to: and extracting the random spectral characteristic feature which is used as the inherent track of the sound recording device to be detected from the non-speech segment, and based on the feature, utilizing a second grouped convolution attention network to identify the sound recording device.

The above one or more technical solutions have the following beneficial effects:

the method is respectively used for non-speech segment detection and recording equipment identification based on the grouping convolution attention network, and the complexity of the identification model of the whole recording equipment is reduced while the high efficiency of the identification model is ensured.

When detecting a non-speech segment, the local correlation spectrogram is generated by using a self-attention local sensitive Hash algorithm for each frame, and the local correlation spectrogram is formed by a plurality of similar frames and used as an expansion feature map of the frame, so that the sense field of the convolutional layer is expanded while the frame information is increased.

The first packet convolution attention network of the present invention is used for non-speech segment detection: the aim is to remove the influence of speech segments on the characteristics of the recording device. Firstly, a carrier feature map module promotes and guides feature learning of a grouping convolution attention module through a local downsampling-upsampling mode, the grouping attention module divides a local related spectrogram into a plurality of groups along a channel dimension, each group is divided into a channel branch, a spectrum branch and a time branch, parallel and efficient modeling of channel, frequency domain and time domain attention is achieved, a network is helped to focus attention on VAD features which are helpful for non-speech section detection, and other irrelevant features are restrained.

When the recording equipment is identified, firstly, the stochastic spectral characteristics (RSFs) representing the power track of the recording equipment are extracted, and a stochastic spectral characteristic diagram is generated.

The second packet convolutional attention network of the present invention is used for recording device identification: the difference with the first packet convolutional attention network is that the packet attention module divides the stochastic spectrum characteristic graph into a plurality of groups along the channel dimension, each group is divided into a channel branch and a stochastic spectrum branch, and the stochastic spectrum branch is used for capturing the inherent track of the sound recording device. Thus helping the network to focus attention on the recording device-related feature areas.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a recording device identification according to an embodiment of the present invention;

FIG. 2 is a block diagram of a packet convolution attention network flow according to an embodiment of the present invention;

fig. 3 is a first packet convolutional attention network of an embodiment of the present invention: a carrier characteristic map module and a packet attention module;

FIG. 4 is a flow chart of the random spectral feature extraction according to an embodiment of the present invention;

FIG. 5 is a second packet convolutional attention network of an embodiment of the present invention: packet attention module schematic.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The embodiment discloses a recording device identification method based on a packet convolution attention network, and a recording device identification flow chart is shown in FIG. 1.

The method comprises the following steps:

s1, utilizing a first packet convolution attention network to perform non-speech section recognition;

s2, extracting Random Spectral Features (RSFs) of the non-speech segments as characteristic parameters identified by the sound recording equipment;

and S3, identifying the sound recording equipment by using the second packet of convolutional attention network.

1. Non-speech segment detection using a first packet convolution attention network

In step S1, the non-utterance section detection process includes:

step 1.1, generating a local correlation spectrogram;

specifically, the generating a local correlation spectrogram includes:

windowing and framing the complete audio to obtain a spectrogram of the complete audio; the 0 th dimension of the spectrogram represents a frequency spectrum and the 1 st dimension represents a time frame. Assuming that the number of generated frames is M, and the M frame data are sequentially used as query points, for each query point q, k frame data most similar to q, that is, k neighboring frames, need to be found from the M frame data.

Searching k adjacent frames by using a self-attention locality sensitive hashing algorithm; the self-attention locality sensitive hashing algorithm has the following meanings: for each query point q, its k neighbor frames can be automatically completed by taking the k indices representing the maximum attention scores of similar frame positions. Its self-attention hash function is defined as:

score(Q,K)＝dist-2*Q·K^T

where D represents the number of frequency points per frame, and Q, K is a data space composed of M frames of spectra.

Respectively the square sum of each frequency point in the ith frame in Q, K,

it stands for the straight-forward operation,

each row in the list represents the relevancy score of the query point to the M frames, and when the two frames are similar enough, the corresponding scores are large enough, otherwise, the corresponding scores are small enough.

Selecting corresponding k indexes according to the attention value, stacking the frequency spectrums corresponding to the k positions along the time dimension to generate a local correlation spectrogram

Wherein C, F 'and T' respectively represent a channel dimension, a spectrum dimension and a time dimension. The local correlation spectrogram well expands the spectrum information of each frame and provides a larger feature learning space for the network.

Step 1.2 first packet convolution attention network

Referring to fig. 2, the complete packet convolution attention network includes a four-layer bearer feature map module and a packet attention module.

Step 1.2.1 Carrier feature map Module

Firstly, a local spectrogram passes through a two-dimensional convolution layer and a maximum pooling layer, and the output size is as follows:

the carrier feature map module comprises:

and performing a downsampling operation on the X, wherein the downsampling process is represented as:

wherein f is_C-B-R(. The) represents a residual block operation, consisting of a two-dimensional convolution layer, a batch normalization layer, and a Relu layer. MaxPool (·) denotes the maximum pooling, i.e. down-sampling operation,

will M_downPerforming an upsampling operation, the upsampling process can be expressed as:

wherein UpSample (-) represents the upsampling operation, implemented by a bilinear interpolation upsampling layer,

through the feed-forward and feedback process of down-sampling and up-sampling, the network can be prompted to learn more essential characteristics of the spectrogram, so that M is used_upAs a carrier of grouping attention, the method can better guide feature learning.

Step 1.2.2 regarding the group attention module, comprising:

A. outputting M carrier feature map module along channel dimension_upThe method comprises the following steps: m_up＝[X₁,…,X_G]Then each group

B. For any group X_iIs mixing X_iThe method comprises the following steps:

then

j∈[c,f,t]. Respectively generating a channel attention feature map, a frequency spectrum attention feature map and a time attention feature map;

the process of generating the channel attention feature map comprises the following steps: will be provided with

Performing a global average pooling operation

Generating channel statistics

Wherein the content of the first and second substances,

represents

Frequency point (c).

Channel statistical information s_cThrough the full connection layer

Generating channel attention weight distribution by using Sigmoid activation function, and further generating channel attention feature map

Wherein the content of the first and second substances,

representing the extension of the channel attention weight distribution dimension to

After the same, the dot product operation is performed with the same.

The process of generating the spectrum attention feature map comprises the following steps: by using

Generating spectral statistics

Then, s is_fPerforming group standardization operation

Generating channel attention weight distribution by using Sigmoid activation function after passing through a full connection layer, and further generating a frequency spectrum attention feature map

The process of generating the time attention feature map comprises the following steps: similar to spectral attention, utilize

Generating temporal statistics

Then, s is_tPerforming group standardization operation

Generating time attention weight distribution by using Sigmoid activation function after passing through a full connection layer, and further generating a time attention feature map

C. Performing a stitching operation (Concat [ · on the channel attention feature map, the spectral attention feature map, the temporal attention feature map along the channel dimension]) Generating a complete set of feature maps

Finally, the feature maps of the G groups are aggregated along the channel dimension to generate an aggregated attention feature map of the first layer

And C is G × N, N being the number of channels per group.

D. The calculation of each attention feature map only needs to utilize a part of channel information, particularly the calculation of time-frequency attention, and does not need all the channel information. However, in the case of the superposition of multiple grouped convolution attention modules, the output of a certain group of channels is only related to the part of information related to the group of input channels, and the property can weaken the information flow representation among the whole channel group. Thus adding a channel shuffling operation at this step, sufficient information is provided for each group in the subsequent packet convolution attention module.

The channel shuffling operation is: attention feature map of convergence Y₁Is expanded and transposed into [ N, G, F, T ] in the channel dimension]And then aggregated into channel dimensions: [ C, F, T ]]As input for the next layer. The carrier characteristic map module and the grouping attention module are shown in fig. 3.

Step 1.2.3 repeat steps 1.2.1 to 1.2.2 three times, and the output of the aggregate profile for each layer is then:

finally, through average pooling operation, Y is added₄Is reduced to [ C,1 ]]For final detection.

Step 1.2.4 non-speech segment detection

The detection operation represents voice and non-voice segment detection in non-speech segment detection, and comprises a full connection layer and a Sigmoid layer, and the neuron number of an output layer is 1.

And after the detection of the non-speaking sections is finished, screening the detected non-speaking sections, and splicing the non-speaking sections into a complete non-speaking section audio serving as a new audio to be detected.

And S2, extracting random spectral characteristics (RSFs) of the non-speech section, assuming that the equipment to be tested is a linear time-invariant system, and modeling the influence of the recording equipment on the speech through convolution of the original speech and the equipment impulse response. Since the spectrum of any speech segment is the product of the spectrum of the original speech signal and the device frequency response, the identity of each recording device is embedded in the speech. Based on this assumption, stochastic spectral features (RSFs) can be used as the natural trajectory of the recording device under test.

The extracting of the Random Spectral Features (RSFs) of the non-speech segments comprises:

firstly, windowing and framing a noise section signal, wherein the frame length is 64ms, the frame shift is 32ms, and a frequency spectrogram of 2048 points of each frame is obtained through fast Fourier transform, and each frame is marked as x (t);

computing short-time power spectra of non-speech segment signals

The calculation formula is as follows:

wherein

Represents the distribution of the average power of the non-speech segment signal on the frequency domain, namely the condition that the power of the unit frequency band changes along with the frequency;

taking the logarithm of the power spectrogram w and averaging along a time axis to obtain an average power spectrum of 2048 dimensions;

adopting an orthogonal random Gaussian matrix with the size of d multiplied by 2048, and reducing the dimension of the average power spectrum to d <2048 through matrix multiplication to obtain d-dimension RSFs parameters of the noise signal;

generating a two-dimensional stochastic spectral profile

Where A represents amplitude, then R means average powerThe case of frequency variation. The complete extraction process is shown in fig. 4.

Step S3, utilizing the second group of convolution attention network to identify the recording equipment

The process of using the second packet convolutional attention network for audio recording device identification synchronizes the operation processes of step 1.2.1 to step 1.2.4, except that in the packet attention module:

outputting M carrier characteristic map module_upAfter division into groups G, for any group X_iIs mixing X_iIs divided into two blocks:

then

j∈[c,r]。

For generating a stochastic spectral attention feature map:

by using

Spatial statistics for generating random spectra

Then, s is_rPerforming group standardization operation

Through the full connection layer

Enhancement s_rThe representation of the random spectrum attention weight value is generated by utilizing a Sigmoid activation function, and then a random spectrum attention feature diagram is generated

The packet attention module of the second packet convolutional attention network is shown in fig. 5.

In the final recording device detection link, the number of neurons in the output layer is changed into the number of recording device types for training, the probability of each type is predicted through the output of the Softmax layer, and the index of the maximum value is the final recognition type of the frame. Knowing the recognition results of all the frames, calculating the category with the largest proportion, and determining the category as the number of the recording equipment to which the audio to be tested belongs.

Example two

It is an object of this embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program.

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

Example four

The embodiment aims to provide a second aspect and discloses a recording device identification system based on a packet convolution attention network, which comprises:

a non-speech segment detection module configured to: detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, screening the non-speech section of the audio to be detected after the non-speech section is detected, and splicing the non-speech section into a complete non-speech section audio serving as a new audio to be detected;

a sound recording device identification module configured to: and extracting random spectral characteristic features serving as inherent tracks of the recording equipment to be detected from the new audio to be detected, and identifying the recording equipment by utilizing a second grouped convolution attention network based on the features.

The first packet convolutional attention network includes a four-layer bearer feature map module and a packet attention module, and specific implementation and functions of the modules are described in the first embodiment, and are not described herein again.

The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. The method for identifying the recording equipment based on the packet convolution attention network is characterized by comprising the following steps:

detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, screening the non-speech section of the audio to be detected after the non-speech section is detected, and splicing the non-speech section into a complete non-speech section audio serving as a new audio to be detected;

2. The method of claim 1, wherein the detecting the non-speech segment of the audio to be tested is performed based on the generated local correlation spectrogram;

further, the step of generating the local correlation spectrogram is as follows:

processing the spectrogram by using a self-attention locality sensitive hashing algorithm: and selecting corresponding k indexes representing similar frame positions according to the attention score value, and stacking the frequency spectrums corresponding to the k positions along the time dimension to generate a local correlation spectrogram.

3. The method for packet convolutional attention network based audio recording device identification as claimed in claim 1 wherein the first packet convolutional attention network is configured to perform:

4. The method for packet convolutional attention network based audio recording device identification of claim 3 wherein the first packet convolutional attention network is further configured to perform:

the results of the upsampling operation are divided into a plurality of groups;

aggregating the multiple groups of group feature maps along the channel dimension to generate an aggregated attention feature map of a first layer, wherein the aggregated attention feature map is the same as the dimension of the local spectrogram;

the aggregated attention feature map is subjected to channel shuffling operation, so that the information loss influence caused by channel grouping is eliminated;

5. The method for audio recording device identification based on packet convolution attention network of claim 4 wherein the process of generating the channel attention feature map is:

the channel statistical information passes through a full connection layer, and an activation function is used for generating channel attention weight distribution so as to generate a channel attention feature map;

the process of generating the spectrum attention feature map comprises the following steps: after the second frequency spectrum branch in the three branches is used for generating frequency spectrum statistical information, the frequency spectrum statistical information is subjected to group standardization operation, branch attention weight distribution is generated by using an activation function after passing through a full connection layer, and a frequency spectrum attention feature map is generated;

the process of generating the time attention feature map comprises the following steps: and after the third time branch in the three branches is used for generating time statistical information, carrying out group standardization operation on the time statistical information, generating time attention weight distribution by using an activation function after passing through a full connection layer, and further generating a time attention feature map.

6. The method for sound recording device identification based on packet convolution attention network as claimed in claim 1, wherein said extracting the random spectral feature of the non-speech segment includes:

firstly, windowing and framing a noise section signal, and obtaining a spectrogram through fast Fourier transform;

calculating a short-time power spectrum of the non-speech segment signals, taking the logarithm of the power spectrum and averaging along a time axis to obtain an average power spectrum;

adopting an orthogonal random Gaussian matrix, and reducing the dimension of the average power spectrum through matrix multiplication to obtain random spectrum characteristic parameters of the noise signal;

7. The method of claim 4, wherein the voice recording device identification is performed by using a second packet convolutional attention network, the number of neurons in the output layer of the second packet convolutional attention network is the number of the types of voice recording devices used for training, the probability of predicting each type is output through a Softmax layer, and the index of the maximum value is the last identification type of the frame;

8. Recording equipment identification system based on grouping convolution attention network is characterized by comprising:

a non-speech segment detection module configured to: detecting the non-speech section of the audio to be detected by utilizing the first packet convolution attention network, and after the non-speech section detection is finished, selecting the non-speech section of the audio to be detected, splicing the non-speech section into a complete non-speech section audio serving as a new audio to be detected;

a sound recording device identification module configured to: and extracting new audio to be tested to be used as the random spectral characteristic feature of the inherent track of the sound recording equipment to be tested, and identifying the sound recording equipment by utilizing a second grouped convolution attention network based on the feature.

9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of the preceding claims 1 to 7.