CN113129908B

CN113129908B - End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion

Info

Publication number: CN113129908B
Application number: CN202110313689.3A
Authority: CN
Inventors: 李松斌; 唐计刚; 刘鹏
Original assignee: Nanhai Research Station Institute Of Acoustics Chinese Academy Of Sciences
Current assignee: Nanhai Research Station Institute Of Acoustics Chinese Academy Of Sciences
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2022-07-26
Anticipated expiration: 2041-03-24
Also published as: CN113129908A

Abstract

The invention discloses an end-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion, wherein the method comprises the following steps: preprocessing a voice pair of a macaque to be verified; the macaque voice pair is two macaque voice sections; inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification; the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing circulating frame interception and grouping on the frame-level feature vectors extracted by the features and mapping the frame-level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the voice segments of the macaque.

Description

End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion

Technical Field

The invention relates to the technical field of computers, in particular to an end-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion.

Background

Primates are facing a serious survival crisis. In order to effectively protect primates, it is important to understand individual ranges of activity and population changes of animals. These all rely on individual animal verification and tracking. Meanwhile, animal individual verification is used as a basic research, is an important basis for realizing animal individual tracking, and has important research value.

The current commonly used animal individual verification techniques mainly comprise an artificial observation method, a DNA fingerprint method, a marking method, an image verification method and a voice verification method. Primates mostly live in mountain forests, and it is difficult to observe animals visually and effectively, and primates are highly alert and difficult to access by humans, making direct observation, DNA fingerprinting and labeling difficult to implement.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an end-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion.

In order to achieve the above object, the present invention provides an end-to-end macaque voiceprint verification method based on cyclic frame-level feature fusion, the method including:

preprocessing a macaque voice pair to be verified; the macaque voice pair is two macaque voice sections;

inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification;

the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features of the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the Chinese gooseberry voice segments.

As an improvement of the above method, the macaque voice pair to be verified is preprocessed; the method specifically comprises the following steps:

cutting off the mute section in the two sections of voice sections to be verified to obtain preprocessed voice format data.

As an improvement of the above method, the input of the backbone network is preprocessed voice format data, and the output is a frame-level feature vector; the backbone network comprises the following components which are connected in sequence: 1 learnable band-pass filter convolutional layer, 6 one-dimensional residual convolutional blocks, 1 × 1 channel conversion convolutional layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:

the learnable band-pass filter convolution layer converts time domain information of voice format data into frequency domain information, and combines 6 one-dimensional residual convolution blocks and pooling operation thereof, feature dimensions are reduced while extracting features, 1 × 1 channel conversion convolution layer performs channel conversion on output features of the residual convolution blocks, and then multi-head conversion is performed to output frame-level feature vectors.

As an improvement of the above method, the input of the feature fusion network is a frame-level feature vector, and the output is a fusion frame feature vector, the feature fusion network includes a cyclic frame interception grouping unit and a channel group, the cyclic frame interception grouping unit connects the frame-level feature vectors end to obtain a feature sequence F, and groups the F by a preset step length to obtain a grouping representation FG of the feature sequence; the channel group is used for mapping FG (FG) into fusion frame characteristics based on a channel weighting fusion mechanism; the channel group comprises a plurality of CFM layers, and each CFM layer comprises a first branch and a second branch which are connected in parallel; the specific processing procedure of the CFM layer is as follows:

transposing the grouped frame level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame-level feature vectors are processed by 2 FC of a second branch, the grouped frame-level feature vectors respectively pass through 1 maximum pooling layer and 1 average pooling layer, the output of the maximum pooling layer and the output of the average pooling layer are subjected to matrix addition calculation, and then a sigmoid layer is used for activation processing to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristics and the second part of the fusion frame characteristics to obtain a fusion frame characteristic vector.

As an improvement of the above method, the input of the feature compression network is a fusion frame feature vector, and the output is a sentence-level feature vector with a dimension of d, and the feature compression network includes 1 gate control cycle unit and 1 full connection layer which are connected in sequence;

the output of the feature compression network is a sentence-level feature vector e with a dimension d:

e＝h(x)e∈R ^d

len(x)＝l

where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix.

As an improvement of the above method, the method further comprises a training step and a testing step of the macaque voiceprint verification model; the method specifically comprises the following steps:

step 1) preprocessing a voice section of a macaque corpus, wherein the preprocessed macaque corpus comprises a plurality of corpora of T macaques;

step 2) randomly selecting the voice segments of T1 macaques from the preprocessed macaque corpus as a training corpus, and using the voice segments of the rest T-T1 macaques as a test corpus;

step 3) selecting data from the training corpus to establish a training set; randomly dividing training set data into q groups, wherein each group comprises m macaques, and each macaque has n voice sections;

step 4) extracting positive sample voice pairs and negative sample voice pairs with equal quantity from a test corpus to form a test set, wherein the positive sample voice pairs are two different voices of the same macaque, and the negative sample voice pairs are respective voices of the two macaques;

step 5) sequentially inputting q groups of data in a training set into a Chinese goosebeery verification model, setting the learning rate to be 0.001 and the attenuation rate to be 0.0001, using a Leaky ReLU activation function with the slope of-0.3, training by adopting AMSGrad, calculating a loss function, reversely transmitting a loss value through a reverse transmission algorithm, updating network parameters, and completing a training period after all the q groups of data are input once;

step 6) sequentially inputting the test set data into the Chinese gooseberry voiceprint verification model obtained in the current training period, and calculating to obtain an accuracy result of the current training period;

step 7) repeating the step 5) and the step 6) until P training periods are completed; and selecting the network parameter combination corresponding to the maximum value from the P accuracy results to form the optimal parameter combination of the Chinese gooseberry voiceprint verification model, thereby obtaining the trained Chinese gooseberry voiceprint verification model.

As an improvement of the above method, the specific process of calculating the loss function is:

according to the sentence-level characteristics of each group of Chinese macaque speech segments, calculating a cosine distance dist (A, A') as follows:

wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | v | | represents a second-order norm;

from the cosine distance dist (A, A'), the intra-class loss and inter-class loss are calculated as follows:

wherein j represents a voice of the ith macaque, k represents another voice segment, j-k represents that two voice segments belong to the same macaque, and the calculated loss value S _ji,k Is an intra-class loss; j ≠ k denotes that two voice segments belong to different macaques, and the calculated loss value S _ji,k Is an inter-class loss; w is a weighted value, b is an offset;

calculating Loss function Loss of the Chinese gooseberry voiceprint verification model by the following formula _ji Comprises the following steps:

an end-to-end macaque voiceprint validation system based on cyclical frame-level feature fusion, the system comprising: the system comprises a trained kiwi voiceprint verification model, a data processing module and a verification module; wherein the content of the first and second substances,

the data processing module is used for preprocessing a voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;

the verification module is used for inputting the preprocessed macaque voice pair into a pre-trained macaque voiceprint verification model to obtain a conclusion whether the macaque voice pair to be verified belongs to the same individual macaque or not, and therefore voiceprint verification is achieved.

Compared with the prior art, the invention has the advantages that:

1. according to the method, the macaque voice is processed into the voice pair and the designed backbone network can automatically extract the frame-level characteristics of the sampled data of the macaque voice pair, the frame-level characteristics can be mapped into the fusion frame characteristics by using the designed characteristic fusion network, and then the fusion frame characteristics are compressed by using the characteristic compression network to obtain sentence-level characteristics corresponding to the macaque voice segment;

2. the Chinese gooseberry voiceprint verification method provided by the invention realizes the design of converting a closed data set multi-classification model to a sentence-level embedded characteristic representation model, and expands the limitation of the existing animal verification algorithm based on sound. The definition of the animal individual authentication task is converted into a feature representation by a multi-classification task. The voiceprint verification algorithm can meet the requirements of more application scenes, and the trained model can be used for verifying or identifying the unknown target;

3. the method of the invention can be applied to voiceprint verification of macaques and subsequent voiceprint identification of other animals, and has guiding significance.

Drawings

FIG. 1 is a schematic flow chart of an end-to-end Chinese goosebeery verification method based on cyclic frame level feature fusion according to the present invention;

fig. 2 is a schematic structural diagram of a backbone network provided by the present invention;

FIG. 3 is a schematic diagram of the overall structure of an end-to-end Kiwi voiceprint verification network based on cyclic frame-level feature fusion according to the present invention;

fig. 4 is a block diagram of an end-to-end macaque voiceprint verification system based on loop frame level feature fusion provided by the invention.

Reference numerals

410. Data processing module 420, backbone network 430 and feature fusion network

440. Feature compression network 450, network parameter update module 460, and verification module

Detailed Description

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, embodiment 1 of the present invention provides an end-to-end macaque voiceprint verification method based on cyclic frame-level feature fusion, which includes the following steps:

step 110: and selecting a voice pair from the preprocessed macaque corpus according to rules to construct a training set and a test set. The training set data is randomly divided into q groups, and each group comprises the voice sections of m macaques.

In the prior art, feature extraction is usually performed on voice data in a preprocessing stage to obtain MFCC, LPC or a spectrogram for classification by a classification model. The preprocessing of the invention is to intercept the effective voice segment, namely to cut off the mute segment in the original voice, but not to extract the characteristics, and the preprocessed macaque corpus is still the data of the voice format.

Step 120: and randomly reading a group of macaque speech segments in the test set, inputting the macaque speech segments into the backbone network, and performing feature extraction to obtain the frame-level feature vectors of the macaque speech segments.

The backbone network provided by the invention is designed based on CNN, and a network can be extracted by adopting characteristics such as VGG or ResNet. The invention designs a new backbone network combining SincNet, ResNet and transforms to extract the frame level characteristics of the voice of the macaque.

Step 130: and performing cyclic frame interception on the frame level feature vector output by the backbone network, inputting the frame level feature vector to a feature fusion network, and performing feature fusion to obtain a fusion frame feature vector of the voice section of the macaque.

The feature fusion network provided by the invention is designed based on a channel fusion mechanism, and the frame-level features are mapped into fusion frame features by weighting fusion of a plurality of adjacent channel features.

Step 140: and inputting the fusion frame feature vector output by the feature fusion network into a feature compression network, and performing feature compression to obtain sentence-level feature vectors corresponding to the voice segments of the macaque.

Step 150: according to sentence-level feature vectors of the voice segments of the set of macaques, intra-class loss and inter-class loss are calculated by utilizing cosine distances of the vectors, and parameters in a backbone network, a feature fusion network and a feature compression network are updated by adopting AMSGrad.

Step 160: and repeating the steps 120-150, and repeating iteration until the trained network has the highest accuracy on the test set, so as to obtain the optimal parameter combination of the network.

And (3) randomly and repeatedly selecting a group of data from the constructed training set to input the data into the backbone network, repeatedly executing the steps 120 to 150, finishing a training period when all the data in the training set are used once, and testing the network model on the test set once every time a training period is finished at the moment to record the accuracy of the test. And when the preset training period number is finished, determining the network model parameters obtained on the test set when the test accuracy is highest as the optimal parameter combination of the network.

Step 170: and verifying whether any macaque voice pair is the same individual or not based on the optimal parameter network.

The network model with the parameters set as the optimal parameter combination is used for detecting the voice pairs of the macaques except the training set and the testing set, whether two pieces of voice in the voice pairs are sent by the same macaque or not can be judged, and voiceprint verification of the macaques is achieved.

The embodiment can realize the automatic frame-level feature extraction of the sampled data of the macaque voice pair by processing the macaque voice as the voice pair and the designed backbone network, can map the frame-level features into the fusion frame features by utilizing the designed feature fusion network, and then compress the fusion frame features by utilizing the feature compression network to obtain the sentence-level features corresponding to the macaque voice segment.

Optionally, assuming that the preprocessed macaque corpus contains sounds of T macaques, randomly selecting T1 macaque corpora as a training corpus, and taking the rest T-T1 macaque corpora as a test corpus; respectively extracting a positive sample pair and a negative sample pair from a training corpus and a testing corpus to construct the training set and the testing set; the positive sample pair randomly selects two sections of voices from the corpus of the same macaque, and the negative sample pair randomly selects one section of voice from two different macaque corpora respectively.

For example, the corpus includes 144 macaque voices in 0-2 years, the voice time of each macaque is 5-30 minutes, the total voice time is 2143.11 minutes, the effective time after cutting the mute section is 171.35 minutes, and the total number of the macaque voice calling sections is 18309. During training, the first 100 macaque corpora can be selected as a training set, and the last 44 macaque corpora can be selected as a test set. The positive sample speech pair is constructed by randomly selecting two sections of speech from a macaque corpus. The negative sample voice pair construction method comprises the steps of randomly selecting two targets (namely voice file catalogues of two different macaques) from a test corpus, and then randomly selecting a section of voice from each of the two target corpora. Finally, 80000 voice pairs are constructed as a test set for the voiceprint verification of the macaque, wherein the proportion of positive and negative samples is 1: 1, i.e. 40000 positive sample speech pairs and 40000 negative sample speech pairs. And grouping 80000 voice pairs according to a preset number b to obtain n-80000/b groups of test data. The test set extracts speech pairs in the same manner as the training set.

When testing across data sets, in order to solve the voice duration difference of different data sets, the test voice can be compressed or expanded to a fixed length in a cutting or copying mode. The formula is expressed as follows:

optionally, intra-class loss and inter-class loss calculation is performed according to sentence-level features of each group of macaque speech segments by using cosine distances of vectors. Wherein the cosine distance calculation of the vector is represented as:

wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | | | | represents a second-order norm;

the intra-class loss and inter-class loss calculations are expressed as:

wherein, A represents the jth section pronunciation of ith kiwi fruit, and k represents another pronunciation section in the pronunciation centering, and it indicates that two pronunciation sections belong to same kiwi fruit to note A', j ═ k, calculates and obtains the loss in the class, otherwise, indicates that two pronunciation sections belong to different kiwi fruits, represents another pronunciation section with B, calculates and obtains the loss between the class. The calculation of the total loss function of the network constructed by the invention is represented as follows:

VGG and ResNet are widely applied to speaker verification, but the difference of input data has different structural requirements on a feature extraction network, so that the invention designs a backbone network for extracting features of Chinese macaque voices. Fig. 2 is a schematic structural diagram of a backbone network, which is composed of a learnable band-pass filter sincenet layer, 6 layers of residual convolution blocks (ResBlock), 1 channel conversion convolution layer and two layers of transform blocks.

The SincNet layer is formed by activating 1 sinc one-dimensional convolution, 1 largest pooling layer, 3 pooling windows, 1 BN layer and LeakyRelu in sequence; the ResBlock is composed of two groups of convolution units, 1 maximum pooling layer and 1 characteristic weight scaling layer (FMS), wherein the pooling window of the maximum pooling layer is 3, the convolution units are sequentially composed of 1 BN, LeakyRelu and 1 two-dimensional convolution Conv, the convolution kernel size of the two-dimensional convolution is 3, and the step length is 1; the size of a convolution kernel of the channel conversion convolution layer is 1, and the step length is 1; the transforms block is composed of 2 groups of fully connected units, and the fully connected units are composed of 1 multi-head attention Mechanism (MHA) and 2 groups of FC and Droupout layers in sequence. Network parameters and data dimensions of each layer of the backbone network are shown in table 1.

TABLE 1

FMS is a characteristic weight scaling mechanism that assigns a different weight to each frame, and ResBlock assigns the same weight to each frame if the mechanism is not added. MHA is a multi-head attention mechanism.

In the embodiment of the invention, a group of macaque voice segments in a test set are read randomly, the voice sampling data value of the macaque voice segments is input into a backbone network, the time domain information of the macaque voice is converted into frequency domain information by a SincNet layer, and the feature dimension is reduced while the feature is extracted by combining 6 layers of one-dimensional residual convolution blocks and pooling operation thereof. In order to facilitate transform block multi-head design, before data is input into a transform block, 1 layer of 1 multiplied by 1 channel conversion convolution is used for carrying out channel conversion on output characteristics of residual convolution blocks, and finally a backbone network outputs a two-dimensional characteristic diagram as input of a characteristic fusion network.

By means of a residual structure consisting of a one-dimensional convolution residual unit and a pooling layer, frame-level feature representation can be extracted from original voice, feature transfer among different convolution layers is achieved, and fusion of semantic features before and after enhancement in a network learning process is facilitated. And (3) adding a 1 multiplied by 1 convolution layer for connecting a residual module and a Transform module and converting the number of characteristic diagram channels for selecting the number of the multiple heads in the subsequent Transform structure design. By adding a Transform structure, the feature extraction capability of a backbone network is further enhanced, end-to-end feature extraction is realized, frame-level features of the voice of the macaque are automatically extracted, and the steps of extracting the features by a complex manual design algorithm are simplified.

Fig. 3 is a schematic diagram of an overall structure of a macaque voiceprint verification network according to an embodiment of the present invention, where a waveform at a leftmost end of fig. 3 represents input voice pair sampling data, CNN represents the backbone network, and a frame-level feature vector output by the backbone network is represented as M ═ M ₁ ,m ₂ ,...，n _T ]Wherein m is _i ∈R ⁿ N is a frame-level feature dimension, and T represents the number of frames; and performing cyclic frame interception before inputting the frame level features output by the backbone network into the feature fusion network. Specifically, the frame-level feature vectors are connected end to obtain a feature sequence F as follows:

F＝[f ₁ ,f ₂ ,...f _N ,f ₁ ,f ₂ ,，...f _c-1 ]

grouping F by preset step length to obtain grouping expression FG of the characteristic sequence:

FG＝([f ₁ ,f ₂ ,...f _c ],[f ₂ ,f ₃ ,…f _c+1 ],…，[f _T ,f ₁ ,f ₂ ，...f _c-1 ])

where c represents the number of frames per feature set. And inputting the frame-level feature vectors grouped in the FG into a feature fusion network for feature fusion, thereby obtaining the fusion frame feature vector of the macaque speech segment.

As shown in fig. 3, in order to perform more efficient feature Fusion, the present invention designs a feature Fusion network based on a Channel Fusion Mechanism (CFM). The feature fusion network comprises two branches, wherein the first branch is a transposition operation, and the second branch sequentially comprises 2 FCs connected in series, a maximum pooling (Maxpool) layer and an average pooling (Avgppool) layer which are connected in parallel, and 1 sigmoid layer;

transposing the grouped frame-level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame level feature vectors are processed by 2 FC of the second branch, the grouped frame level feature vectors respectively pass through a Maxpool layer and an Avgpol layer, matrix addition calculation is carried out on the outputs of the Maxpool layer and the Avgpol layer, namely corresponding bits are added, and then activation processing is carried out by a sigmoid layer to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristic and the second part of the fusion frame characteristic to obtain a fusion frame characteristic vector of the voice section.

The calculation process of the fusion frame feature vector can be represented as follows:

FG _i '＝g(f(FG _i ))＝w ₂ (w ₁ (FG _i )+b ₁ )+b ₂

m＝MaxP(FG _i ',c)

a＝AvgP(FG _i ',c)

w _i ＝σ(m+a)

f _i ^* ＝tp(FG _i )·w _i

wherein f and g respectively represent mapping functions of two layers of full-connection structures, and parameters of two layers of FC are respectively omega ₁ ,b ₁ And omega ₂ ,b ₂ MaxP and AvgP denote maximum pooling and mean pooling, respectively, σ is a Sigmoid function, tp denotes transposition, and · denotes dot multiplication. Final ith feature group FG _i Is mapped as a feature f _i ^* Denotes the i-th feature vector after fusion, where FG _i ∈R ^c×n ，f _i ^* ∈R ⁿ 。

In a voiceprint verification task, the correlation between the feature representation of each frame (moment) of time-frequency two-dimensional features and final sentence-level features needs to be represented, so that the invention provides a channel weighted fusion mechanism, a new frame-level feature representation is obtained through weighted fusion of a plurality of adjacent channel features, the interframe relevance of a feature map is effectively enhanced, and the mapping from the effective frame-level features to the fused frame features is realized.

As shown in fig. 3, the feature compression network includes a gated loop unit GRU and a fully connected FC layer; after being processed by the GRU unit and an FC layer in sequence, the fusion frame feature vector is mapped into a sentence-level feature vector e with the dimensionality d:

e＝h(x)e∈R ^d

len(x)＝l

where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix. The purpose of the feature compression network is to map input two-dimensional features into one-dimensional features, and the process can be regarded as embedding of original speech segment data.

The network model designed by the invention inputs voice data with the size of mxnxl, wherein m represents the number of macaques selected in each group of data, n represents the number of voice segments selected from each macaque corpus, and l represents the number of frames of each voice segment. After being processed by the backbone network, the feature fusion network and the feature compression network, the feature output of m multiplied by n multiplied by d can be obtained. Where d represents the sentence-level feature dimension, d may be set to 256. In addition, a leak ReLU activation function with a slope of-0.3 may be used in the network, the learning rate of the AMSGrad optimizer may be set to 0.001, the attenuation rate may be set to 0.0001, the number of macaques in each voice group may be set to m-8, and the number of voice segments per macaque may be set to n-10.

The method for verifying the voice print of the macaque realizes the design of switching from a closed data set multi-classification model to a sentence-level embedded characteristic representation model, and expands the limitation of the existing animal verification algorithm based on voice. The definition of the animal individual verification task is converted into a feature representation by a multi-classification task. The voiceprint verification algorithm can meet the requirements of more application scenes, and the trained model can be used for verifying or identifying the unknown target.

Example 2

As shown in fig. 4, an embodiment 2 of the present invention provides a macaque voiceprint verification system, which is implemented based on the macaque voiceprint verification method, and in a training test stage, the system specifically includes:

the data processing module 410 is configured to select a voice pair from the preprocessed macaque corpus according to rules, construct a training set and a test set, and group the training set and the test set;

the backbone network 420 is used for extracting features of the input macaque speech segment to obtain a frame-level feature vector of the macaque speech segment;

the feature fusion network 430 is configured to perform feature fusion after performing cyclic frame interception on the frame-level feature vectors output by the backbone network, so as to obtain a fusion frame feature vector of a voice segment of the macaque;

a feature compression network 440, configured to perform feature compression on the sentence-level features output by the feature fusion network to obtain sentence-level feature vectors corresponding to the voice segments of the macaque

A network parameter updating module 450, configured to perform intra-class loss and inter-class loss calculation according to sentence-level feature vectors of the macaque speech segments by using cosine distances of the vectors, update parameters in the backbone network, the feature fusion network, and the feature compression network by using AMSGrad, and repeat iteration until the trained network has the highest accuracy on the test set, so as to obtain an optimal parameter combination of the network

And the verification module 460 is configured to verify whether any macaque voice pair belongs to the same individual according to the output of the optimal parameter network.

The macaque voiceprint verification system provided by the invention processes macaque voice into voice pairs, can automatically extract frame-level features of sampled data of the macaque voice pairs through a designed backbone network, can map the frame-level features into fusion frame features by utilizing the designed feature fusion network, and then compresses the fusion frame features by utilizing the feature compression network to obtain sentence-level features corresponding to macaque voice segments.

Optionally, the data processing module 410 further includes:

and the corpus dividing unit is used for randomly selecting T1 macaque corpora as a training corpus and the rest T-T1 macaque corpora as a test corpus from T macaque sounds contained in the preprocessed macaque corpus.

The voice pair extraction unit is used for extracting a positive sample pair and a negative sample pair from a training corpus and a testing corpus respectively to construct the training set and the testing set; the positive sample pair randomly selects two sections of voices from the corpus of the same macaque, and the negative sample pair randomly selects one section of voice from two different macaque corpuses respectively.

Backbone network 420 is designed based on CNN, and may employ a feature extraction network such as VGG or ResNet. The invention designs a new backbone network combining SincNet, ResNet and transforms to extract the frame level characteristics of the voice of the macaque, the specific structure of which is shown in FIG. 2 and Table 1, and the details are not repeated here. In addition, the specific structures of the feature fusion network 430 and the feature compression network 440 are shown in fig. 3, and detailed descriptions of the method portions may be specifically referred to, which are not described herein again.

And after the training test is finished, obtaining a trained macaque voiceprint verification model, and entering a practical application stage. The system then comprises: a trained kiwi voiceprint verification model, a data processing module 410 and a verification module 460; wherein the content of the first and second substances,

the verification module is used for inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, and therefore voiceprint verification is achieved.

The trained macaque voiceprint verification model comprises a backbone network 420, a feature fusion network 430 and a feature compression network 440 which are connected in sequence.

It should be noted that, for the convenience of description, only some but not all of the related contents of the embodiments of the present invention are shown in the drawings. Some example embodiments are described as processes or methods depicted as flow diagrams, which describe operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously, and the order of the operations can be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that the technical solutions of the present invention may be modified or substituted with equivalents without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered by the scope of the claims of the present invention.

Claims

1. An end-to-end macaque voiceprint verification method based on cycle frame level feature fusion, the method comprising:

the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features of the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; the feature compression network is used for compressing the feature of the fusion frame to obtain sentence-level features corresponding to the voice segments of the macaque;

the input of the backbone network is preprocessed voice format data, and the output of the backbone network is a frame level feature vector; the backbone network comprises the following components which are connected in sequence: 1 learnable band-pass filter convolution layer, 6 one-dimensional residual convolution blocks, 1 × 1 channel conversion convolution layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:

the learnable band-pass filter convolution layer converts time domain information of voice format data into frequency domain information, and combines 6 one-dimensional residual convolution blocks and pooling operation thereof, feature dimensions are reduced while extracting features, 1 × 1 channel conversion convolution layer performs channel conversion on output features of the residual convolution blocks, and then multi-head conversion is performed to output frame level feature vectors.

2. The end-to-end macaque voiceprint validation method based on loop frame-level feature fusion as claimed in claim 1, wherein the macaque speech pair to be validated is preprocessed; the method specifically comprises the following steps:

cutting off the mute section in the two sections of voice sections to be verified to obtain the preprocessed voice format data.

3. The end-to-end kiwi voiceprint verification method based on cycle frame level feature fusion as claimed in claim 2, wherein the input of said feature fusion network is a frame level feature vector, and the output is a fusion frame feature vector, said feature fusion network comprises a cycle frame interception grouping unit and a channel group, said cycle frame interception grouping unit connects end to end frame level feature vectors to obtain a feature sequence F, and groups F by a preset step size to obtain a grouping representation FG of the feature sequence; the channel group is used for mapping FG (FG) into fusion frame characteristics based on a channel weighting fusion mechanism; the channel group comprises a plurality of CFM layers, and each CFM layer comprises a first branch and a second branch which are connected in parallel; the specific processing procedure of the CFM layer is as follows:

4. The end-to-end macaque voiceprint verification method based on cycle frame level feature fusion of claim 3, wherein the input of the feature compression network is a fusion frame feature vector, the output is a sentence level feature vector with dimension d, and the feature compression network comprises 1 gate control cycle unit and 1 full connection layer which are connected in sequence;

the output of the feature compression network is a sentence-level feature vector e with the dimension d:

e＝h(x)e∈R ^d

len(x)＝l

5. The end-to-end macaque voiceprint validation method based on cycle frame level feature fusion as claimed in claim 4, wherein the method further comprises a training step and a testing step of a macaque voiceprint validation model; the method specifically comprises the following steps:

step 2) randomly selecting T1 voice segments of the macaques from the preprocessed macaque corpus as a training corpus, and using the rest T-T1 voice segments of the macaques as a test corpus;

step 4) extracting equal number of positive sample voice pairs and negative sample voice pairs from a test corpus to form a test set, wherein the positive sample voice pairs are two different voices of the same macaque, and the negative sample voice pairs are respective voices of the two macaques;

step 5) sequentially inputting q groups of data in a training set into a Chinese gooseberry voiceprint verification model, setting the learning rate to be 0.001 and the attenuation rate to be 0.0001, using a Leaky ReLU activation function with the slope of-0.3, training by adopting AMSGrad, calculating a loss function, reversely transmitting a loss value through a back propagation algorithm, updating network parameters, and completing a training period after all q groups of data are input once;

step 6) sequentially inputting the test set data into the Chinese gooseberry voiceprint verification model obtained in the current training period, and calculating to obtain an accuracy rate result of the current training period;

6. The end-to-end macaque voiceprint verification method based on cycle frame-level feature fusion as claimed in claim 5, wherein the specific process of calculating the loss function is:

from the cosine distance dist (A, A'), the intra-class loss and the inter-class loss are calculated as follows:

wherein j represents a voice of the ith macaque, k represents another voice segment, j-k represents that two voice segments belong to the same macaque, and the calculated loss value S _ji,k Is an intra-class loss; j ≠ k represents that the two voice segments belong to different macaques, and the loss value S is obtained through calculation _ji,k Is a loss between classes(ii) a w is a weighted value, b is an offset;

7. an end-to-end macaque voiceprint verification system based on cyclic frame-level feature fusion, the system comprising: the trained kiwi voiceprint verification model, the data processing module and the verification module; wherein the content of the first and second substances,

the data processing module is used for preprocessing the voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;

the verification module is used for inputting the preprocessed macaque voice pair into a pre-trained macaque voiceprint verification model to obtain a conclusion whether the macaque voice pair to be verified belongs to the same individual macaque or not, so that voiceprint verification is realized;

the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing circulating frame interception and grouping on the frame-level feature vectors extracted by the features and mapping the frame-level features into fusion frame features based on a channel weighting fusion mechanism; the feature compression network is used for compressing the feature of the fusion frame to obtain sentence-level features corresponding to the voice segments of the macaque;

the input of the backbone network is preprocessed voice format data, and the output is a frame level feature vector; the backbone network comprises the following components connected in sequence: 1 learnable band-pass filter convolution layer, 6 one-dimensional residual convolution blocks, 1 × 1 channel conversion convolution layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps: