CN113129908A

CN113129908A - End-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion

Info

Publication number: CN113129908A
Application number: CN202110313689.3A
Authority: CN
Inventors: 李松斌; 唐计刚; 刘鹏
Original assignee: Research Station Of South China Sea Institute Of Acoustics Chinese Academy Of Sciences
Current assignee: Research Station Of South China Sea Institute Of Acoustics Chinese Academy Of Sciences
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-07-16
Anticipated expiration: 2041-03-24
Also published as: CN113129908B

Abstract

The invention discloses an end-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion, wherein the method comprises the following steps: preprocessing a voice pair of a macaque to be verified; the macaque voice pair is two macaque voice sections; inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification; the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the Chinese gooseberry voice segments.

Description

End-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion

Technical Field

The invention relates to the technical field of computers, in particular to an end-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion.

Background

Primates are facing a serious crisis for survival. In order to effectively protect primates, it is important to know the individual range of motion and the population change of the animals. These all rely on individual animal verification and tracking. Meanwhile, animal individual verification is used as a basic research, is an important basis for realizing animal individual tracking, and has important research value.

The current commonly used animal individual verification techniques mainly comprise an artificial observation method, a DNA fingerprint method, a marking method, an image verification method and a voice verification method. Primates mostly live in mountain forests, and it is difficult to observe animals visually and effectively, and primates are highly alert and difficult to access by humans, making direct observation, DNA fingerprinting and labeling difficult to implement.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an end-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion.

In order to achieve the above object, the present invention provides an end-to-end macaque voiceprint verification method based on cyclic frame level feature fusion, the method comprising:

preprocessing a voice pair of a macaque to be verified; the macaque voice pair is two macaque voice sections;

inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification;

the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the Chinese gooseberry voice segments.

As an improvement of the above method, the macaque voice pair to be verified is preprocessed; the method specifically comprises the following steps:

cutting off the mute section in the two sections of voice sections to be verified to obtain the preprocessed voice format data.

As an improvement of the above method, the input of the backbone network is preprocessed voice format data, and the output is a frame-level feature vector; the backbone network comprises the following components which are connected in sequence: 1 learnable band-pass filter convolutional layer, 6 one-dimensional residual convolutional blocks, 1 × 1 channel conversion convolutional layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:

the learnable band-pass filter convolution layer converts time domain information of voice format data into frequency domain information, and combines 6 one-dimensional residual convolution blocks and pooling operation thereof, feature dimensions are reduced while extracting features, 1 × 1 channel conversion convolution layer performs channel conversion on output features of the residual convolution blocks, and then multi-head conversion is performed to output frame level feature vectors.

As an improvement of the above method, the input of the feature fusion network is a frame-level feature vector, and the output is a fusion frame feature vector, the feature fusion network comprises a cyclic frame interception grouping unit and a channel group, the cyclic frame interception grouping unit connects the frame-level feature vectors end to obtain a feature sequence F, and the F is grouped by a preset step length to obtain a grouping representation FG of the feature sequence; the channel group is used for mapping FG (FG) into fusion frame characteristics based on a channel weighting fusion mechanism; the channel group comprises a plurality of CFM layers, and each CFM layer comprises a first branch and a second branch which are connected in parallel; the specific processing procedure of the CFM layer is as follows:

transposing the grouped frame-level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame-level feature vectors are processed by 2 FC of a second branch, the grouped frame-level feature vectors respectively pass through 1 maximum pooling layer and 1 average pooling layer, the output of the maximum pooling layer and the output of the average pooling layer are subjected to matrix addition calculation, and then a sigmoid layer is used for activation processing to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristics and the second part of the fusion frame characteristics to obtain a fusion frame characteristic vector.

As an improvement of the above method, the input of the feature compression network is a fusion frame feature vector, and the output is a sentence-level feature vector with a dimension of d, the feature compression network comprises 1 gate control cycle unit and 1 full connection layer which are connected in sequence;

the output of the feature compression network is a sentence-level feature vector e with the dimension d:

e＝h(x)e∈R^d

len(x)＝l

where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix.

As an improvement of the above method, the method further comprises a training step and a testing step of the macaque voiceprint verification model; the method specifically comprises the following steps:

step 1) preprocessing a voice section of a macaque corpus, wherein the preprocessed macaque corpus comprises a plurality of corpora of T macaques;

step 2) randomly selecting T1 voice segments of the macaques from the preprocessed macaque corpus as a training corpus, and using the rest T-T1 voice segments of the macaques as a test corpus;

step 3) selecting data from the training corpus to establish a training set; randomly dividing training set data into q groups, wherein each group comprises m macaques, and each macaque has n voice sections;

step 4) extracting equal number of positive sample voice pairs and negative sample voice pairs from a test corpus to form a test set, wherein the positive sample voice pairs are two different voices of the same macaque, and the negative sample voice pairs are respective voices of the two macaques;

step 5) sequentially inputting q groups of data in a training set into a Chinese goosebeery verification model, setting the learning rate to be 0.001 and the attenuation rate to be 0.0001, using a Leaky ReLU activation function with the slope of-0.3, training by adopting AMSGrad, calculating a loss function, reversely transmitting a loss value through a reverse transmission algorithm, updating network parameters, and completing a training period after all the q groups of data are input once;

step 6) sequentially inputting the test set data into the Chinese gooseberry voiceprint verification model obtained in the current training period, and calculating to obtain an accuracy result of the current training period;

step 7) repeating the step 5) and the step 6) until P training periods are finished; and selecting the network parameter combination corresponding to the maximum value from the P accuracy results to form the optimal parameter combination of the Chinese gooseberry voiceprint verification model, thereby obtaining the trained Chinese gooseberry voiceprint verification model.

As an improvement of the above method, the specific process of calculating the loss function is as follows:

according to the sentence-level characteristics of each group of Chinese macaque speech segments, calculating the cosine distance dist (A, A') as follows:

wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | | | | represents a second-order norm;

from the cosine distance dist (A, A'), the intra-class loss and inter-class loss are calculated as follows:

wherein j represents a voice of the ith macaque, k represents another voice segment, j-k represents that two voice segments belong to the same macaque, and the calculated loss value S_ji,kIs an intra-class loss; j ≠ k denotes that two voice segments belong to different macaques, and the calculated loss value S_ji,kIs an inter-class loss; w is a weighted value, b is an offset;

from belowFormula calculation macaque voiceprint verification model Loss function Loss_jiComprises the following steps:

an end-to-end macaque voiceprint validation system based on cyclical frame-level feature fusion, the system comprising: the trained kiwi voiceprint verification model, the data processing module and the verification module; wherein,

the data processing module is used for preprocessing a voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;

the verification module is used for inputting the preprocessed macaque voice pair into a pre-trained macaque voiceprint verification model to obtain a conclusion whether the macaque voice pair to be verified belongs to the same individual macaque or not, and therefore voiceprint verification is achieved.

Compared with the prior art, the invention has the advantages that:

1. according to the method, the macaque voice is processed into the voice pair and the designed backbone network can automatically extract the frame-level characteristics of the sampled data of the macaque voice pair, the frame-level characteristics can be mapped into the fusion frame characteristics by using the designed characteristic fusion network, and then the fusion frame characteristics are compressed by using the characteristic compression network to obtain sentence-level characteristics corresponding to the macaque voice segment;

2. the Chinese gooseberry voiceprint verification method provided by the invention realizes the design of converting a closed data set multi-classification model to a sentence-level embedded characteristic representation model, and expands the limitation of the existing animal verification algorithm based on sound. The definition of the animal individual authentication task is converted into a feature representation by a multi-classification task. The voiceprint verification algorithm can meet the requirements of more application scenes, and the trained model can be used for verifying or identifying the unknown target;

3. the method of the invention can be applied to voiceprint verification of macaques and voiceprint identification of other animals subsequently, and has guiding significance.

Drawings

FIG. 1 is a schematic flow chart of an end-to-end Chinese goosebeery verification method based on cyclic frame level feature fusion according to the present invention;

fig. 2 is a schematic structural diagram of a backbone network provided by the present invention;

FIG. 3 is a schematic diagram of the overall structure of an end-to-end macaque voiceprint verification network based on cyclic frame level feature fusion provided by the present invention;

fig. 4 is a block diagram of an end-to-end macaque voiceprint verification system based on loop frame level feature fusion provided by the invention.

Reference numerals

410. Data processing module 420, backbone network 430 and feature fusion network

440. Feature compression network 450, network parameter update module 460, and verification module

Detailed Description

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, embodiment 1 of the present invention provides an end-to-end macaque voiceprint verification method based on loop frame level feature fusion, which includes the following steps:

step 110: and selecting a voice pair from the preprocessed macaque corpus according to rules to construct a training set and a test set. The training set data is randomly divided into q groups, and each group comprises the voice sections of m macaques.

In the prior art, feature extraction is usually performed on voice data in a preprocessing stage to obtain MFCC, LPC or a spectrogram for classification by a classification model. The preprocessing of the invention is to intercept the effective voice segment, namely to cut off the mute segment in the original voice, but not to extract the characteristics, and the preprocessed macaque corpus is still the data of the voice format.

Step 120: and randomly reading a group of voice segments of the macaque in the test set, inputting the voice segments into the backbone network, and performing feature extraction to obtain frame-level feature vectors of the macaque voice segments.

The backbone network provided by the invention is designed based on CNN, and a network can be extracted by adopting characteristics such as VGG or ResNet. The invention designs a new backbone network combining SincNet, ResNet and transforms to extract the frame level characteristics of the voice of the macaque.

Step 130: and performing cyclic frame interception on the frame level feature vector output by the backbone network, inputting the frame level feature vector to a feature fusion network, and performing feature fusion to obtain a fusion frame feature vector of the voice segments of the macaque.

The feature fusion network provided by the invention is designed based on a channel fusion mechanism, and the frame-level features are mapped into fusion frame features by weighting and fusing a plurality of adjacent channel features.

Step 140: and inputting the fusion frame feature vector output by the feature fusion network into a feature compression network, and performing feature compression to obtain sentence-level feature vectors corresponding to the voice segments of the macaque.

Step 150: according to sentence-level feature vectors of the group of Chinese gooseberry voice segments, intra-class loss and inter-class loss are calculated by utilizing cosine distances of the vectors, and parameters in a backbone network, a feature fusion network and a feature compression network are updated by adopting AMSGrad.

Step 160: and repeating the steps 120-150, and repeating iteration until the trained network has the highest accuracy on the test set, so as to obtain the optimal parameter combination of the network.

And (3) randomly and repeatedly selecting a group of data from the constructed training set to input the data into the backbone network, repeatedly executing the steps 120 to 150, finishing a training period when all the data in the training set are used once, and testing the network model on the test set once every time a training period is finished at the moment to record the accuracy of the test. And when the preset training period number is completed, determining the network model parameters obtained on the test set when the test accuracy is highest as the optimal parameter combination of the network.

Step 170: and verifying whether any macaque voice pair is the same individual or not based on the optimal parameter network.

The network model with the parameters set as the optimal parameter combination is used for detecting the voice pairs of the macaques except the training set and the testing set, whether two pieces of voice in the voice pairs are sent by the same macaque or not can be judged, and voiceprint verification of the macaques is achieved.

The embodiment can realize the automatic frame-level feature extraction of the sampled data of the macaque voice pair by processing the macaque voice as the voice pair and the designed backbone network, can map the frame-level features into the fusion frame features by utilizing the designed feature fusion network, and then compress the fusion frame features by utilizing the feature compression network to obtain the sentence-level features corresponding to the macaque voice segment.

Optionally, assuming that the preprocessed macaque corpus contains sounds of T macaques, randomly selecting T1 macaque corpora as a training corpus, and taking the rest T-T1 macaque corpora as a test corpus; respectively extracting positive sample pairs and negative sample pairs from a training corpus and a test corpus to construct the training set and the test set; the positive sample pair randomly selects two sections of voices from the corpus of the same macaque, and the negative sample pair randomly selects one section of voice from two different macaque corpora respectively.

For example, the corpus includes 144 macaque voices of 0-2 years old, the voice time length of each macaque is 5-30 minutes, the total voice time length is 2143.11 minutes, the effective time length after cutting the mute section is 171.35 minutes, and the total number of the macaque voice segments is 18309 segments. During training, the first 100 macaque corpora can be selected as a training set, and the last 44 macaque corpora can be selected as a test set. The positive sample speech pair is constructed by randomly selecting two sections of speech from a macaque corpus. The negative sample voice pair construction method comprises the steps of randomly selecting two targets (namely voice file catalogues of two different macaques) from a test corpus, and then randomly selecting a section of voice from each of the two target corpora. Finally, 80000 voice pairs are constructed as a test set for the voiceprint verification of the macaque, wherein the proportion of positive and negative samples is 1: 1, i.e. 40000 positive sample speech pairs and 40000 negative sample speech pairs. And grouping 80000 voice pairs according to a preset number b to obtain n-80000/b groups of test data. The test set extracts the speech pairs in the same way as the training set.

When testing across data sets, in order to solve the voice duration difference of different data sets, the testing voice can be compressed or expanded to a fixed length in a cutting or copying mode. The formula is expressed as follows:

optionally, intra-class loss and inter-class loss calculation is performed according to sentence-level features of each group of macaque speech segments by using cosine distances of vectors. Wherein the cosine distance calculation of the vector is represented as:

the intra-class loss and inter-class loss calculations are expressed as:

wherein, A represents the jth section voice of the ith macaque, k represents another voice section in the voice pair, and it is marked as A', j ═ k represents that two voice sections belong to the same macaque, and the intra-class loss is obtained by calculation, otherwise, represents that two voice sections belong to different macaques, and B represents another voice section, and the inter-class loss is obtained by calculation. The calculation of the total loss function of the network constructed by the invention is represented as follows:

VGG and ResNet are widely applied to speaker verification, but the difference of input data has different structural requirements on a feature extraction network, so that the invention designs a backbone network for extracting features of Chinese gooseberry voices. Fig. 2 is a schematic structural diagram of a backbone network, which is composed of a learnable band-pass filter sincenet layer, 6 layers of residual convolution blocks (ResBlock), 1 channel conversion convolution layer and two layers of transform blocks.

The SincNet layer is composed of 1 sinc one-dimensional convolution, 1 maximum pooling layer, 3 pooling windows, 1 BN layer and LeakyRelu activation in sequence; the ResBlock is composed of two groups of convolution units, 1 maximum pooling layer and 1 characteristic weight scaling layer (FMS), wherein the pooling window of the maximum pooling layer is 3, the convolution units are sequentially composed of 1 BN and LeakyRelu activation and 1 two-dimensional convolution Conv, the convolution kernel size of the two-dimensional convolution is 3, and the step length is 1; the size of a convolution kernel of the channel conversion convolution layer is 1, and the step length is 1; the transforms block is composed of 2 groups of fully connected units, and the fully connected units are composed of 1 multi-head attention Mechanism (MHA) and 2 groups of FC and Droupout layers in sequence. The network parameters and data dimensions of each layer of the backbone network are shown in table 1.

TABLE 1

FMS is a characteristic weight scaling mechanism that assigns a different weight to each frame, and ResBlock assigns the same weight to each frame if the mechanism is not added. MHA is a multi-head attention mechanism.

In the embodiment of the invention, a group of macaque voice segments in a test set are randomly read, voice sampling data values of the macaque voice segments are input into a backbone network, time domain information of the macaque voice is converted into frequency domain information by a SincNet layer, and feature dimensions are reduced while extracting features by combining 6 layers of one-dimensional residual rolling blocks and pooling operation thereof. For convenience of transform block multi-head design, before data is input into the transform block, 1 layer of 1 × 1 channel conversion convolution is used for carrying out channel conversion on output characteristics of a residual convolution block, and finally a backbone network outputs a two-dimensional characteristic graph as input of a characteristic fusion network.

By the residual structure consisting of the one-dimensional convolution residual unit and the pooling layer, frame-level feature representation can be extracted from the original voice, feature transfer among different convolution layers is realized, and semantic feature fusion before and after enhancement in the network learning process is facilitated. And (3) adding a 1 multiplied by 1 convolution layer for connecting a residual module and a Transform module and converting the number of characteristic diagram channels for selecting the number of the multiple heads in the subsequent Transform structure design. By adding a Transform structure, the feature extraction capability of a backbone network is further enhanced, end-to-end feature extraction is realized, frame-level features of the voice of the macaque are automatically extracted, and the steps of extracting the features by a complex manual design algorithm are simplified.

Fig. 3 is a schematic diagram of an overall structure of a macaque voiceprint verification network according to an embodiment of the present invention, where a waveform at a leftmost end of fig. 3 represents input voice pair sampling data, CNN represents the backbone network, and a frame-level feature vector output by the backbone network is represented as M ═ M₁,m₂,...，n_T]Wherein m is_i∈RⁿN is a frame-level feature dimension, and T represents the number of frames; and performing cyclic frame interception before inputting the frame level features output by the backbone network into the feature fusion network. Specifically, the frame-level feature vectors are connected end to obtain a feature sequence F as follows:

F＝[f₁,f₂,...f_N,f₁,f₂,，...f_c-1]

grouping F by preset step length to obtain a grouping representation FG of the characteristic sequence:

FG＝([f₁,f₂,...f_c],[f₂,f₃,…f_c+1],…，[f_T,f₁,f₂，...f_c-1])

where c represents the number of frames per feature set. And inputting the frame-level feature vectors grouped in the FG into a feature fusion network for feature fusion, thereby obtaining the fusion frame feature vectors of the voice segments of the macaque.

As shown in fig. 3, in order to perform more efficient feature Fusion, the present invention designs a feature Fusion network based on a Channel Fusion Mechanism (CFM). The feature fusion network comprises two branches, wherein the first branch is a transposition operation, and the second branch sequentially comprises 2 FCs connected in series, a maximum pooling (Maxpool) layer and an average pooling (Avgppool) layer which are connected in parallel, and 1 sigmoid layer;

transposing the grouped frame-level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame level feature vectors are processed by 2 FC of the second branch, the grouped frame level feature vectors respectively pass through a Maxpool layer and an Avgpol layer, matrix addition calculation is carried out on the outputs of the Maxpool layer and the Avgpol layer, namely corresponding bits are added, and then activation processing is carried out by a sigmoid layer to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristics and the second part of the fusion frame characteristics to obtain a fusion frame characteristic vector of the voice section.

The calculation process of the fusion frame feature vector can be represented as follows:

FG_i'＝g(f(FG_i))＝w₂(w₁(FG_i)+b₁)+b₂

m＝MaxP(FG_i',c)

a＝AvgP(FG_i',c)

w_i＝σ(m+a)

f_i ^*＝tp(FG_i)·w_i

wherein f and g respectively represent mapping functions of two layers of full-connection structures, and parameters of two layers of FC are respectively omega₁,b₁And ω₂,b₂MaxP and AvgP represent maximum pooling and mean pooling, respectively, σ is a Sigmoid function, tp represents transposition, and · represents point multiplication. Final ith feature group FG_iIs mapped as a feature f_i ^*Denotes the i-th feature vector after fusion, where FG_i∈R^c×n，f_i ^*∈Rⁿ。

In the voiceprint verification task, the correlation between the feature representation of each frame (moment) of the time-frequency two-dimensional features and the final sentence-level features needs to be represented, so that the invention provides a channel weighting fusion mechanism, a new frame-level feature representation is obtained through the weighting fusion of a plurality of adjacent channel features, the interframe relevance degree of a feature graph is effectively enhanced, and the effective mapping from the frame-level features to the fusion frame features is realized.

As shown in fig. 3, the feature compression network includes a gated loop unit GRU and a fully connected FC layer; after being processed by the GRU unit and an FC layer in sequence, the fusion frame feature vector is mapped into a sentence-level feature vector e with the dimensionality d:

e＝h(x)e∈R^d

len(x)＝l

where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix. The feature compression network aims to map input two-dimensional features into one-dimensional features, and the process can be regarded as embedding of original voice segment data.

The network model designed by the invention inputs voice data with the size of m multiplied by n multiplied by l, wherein m represents the number of macaques selected in each group of data, n represents the number of voice segments selected from each macaque corpus, and l represents the frame number of each segment of voice data. After the processing of the backbone network, the feature fusion network and the feature compression network, the feature output of m multiplied by n multiplied by d can be obtained. Where d represents the sentence-level feature dimension, d may be set to 256. In addition, a leak ReLU activation function with a slope of-0.3 may be used in the network, the learning rate of the AMSGrad optimizer may be set to 0.001, the attenuation rate may be set to 0.0001, the number of macaques in each voice group may be set to m-8, and the number of voice segments per macaque may be set to n-10.

The Chinese gooseberry voiceprint verification method provided by the invention realizes the design of converting a closed data set multi-classification model to a sentence-level embedded characteristic representation model, and expands the limitation of the existing animal verification algorithm based on sound. The definition of the animal individual authentication task is converted into a feature representation by a multi-classification task. The voiceprint verification algorithm can meet the requirements of more application scenes, and the trained model can be used for verifying or identifying the unknown target.

Example 2

As shown in fig. 4, an embodiment 2 of the present invention provides a macaque voiceprint verification system, which is implemented based on the macaque voiceprint verification method, and in a training test stage, the system specifically includes:

the data processing module 410 is configured to select a voice pair from the preprocessed macaque corpus according to rules, construct a training set and a test set, and group the training set and the test set;

the backbone network 420 is used for extracting features of the input macaque speech segment to obtain a frame-level feature vector of the macaque speech segment;

the feature fusion network 430 is configured to perform feature fusion after performing cyclic frame interception on the frame-level feature vectors output by the backbone network, so as to obtain a fusion frame feature vector of a voice segment of the macaque;

a feature compression network 440, configured to perform feature compression on the sentence-level features output by the feature fusion network to obtain sentence-level feature vectors corresponding to the macaque speech segments

A network parameter updating module 450, configured to perform intra-class loss and inter-class loss calculation according to sentence-level feature vectors of the macaque speech segments by using cosine distances of the vectors, update parameters in the backbone network, the feature fusion network, and the feature compression network by using AMSGrad, and repeat iteration until the trained network has the highest accuracy on the test set, so as to obtain an optimal parameter combination of the network

And the verification module 460 is configured to verify whether any macaque voice pair belongs to the same individual according to the output of the optimal parameter network.

The macaque voiceprint verification system provided by the invention processes macaque voice into voice pairs, can automatically extract frame-level features of sampled data of the macaque voice pairs through a designed backbone network, can map the frame-level features into fusion frame features by utilizing the designed feature fusion network, and then compresses the fusion frame features by utilizing the feature compression network to obtain sentence-level features corresponding to macaque voice segments.

Optionally, the data processing module 410 further includes:

and the corpus dividing unit is used for randomly selecting T1 macaque corpora as a training corpus and the rest T-T1 macaque as a test corpus from T macaque sounds contained in the preprocessed macaque corpus.

The voice pair extraction unit is used for extracting a positive sample pair and a negative sample pair from the training corpus and the test corpus respectively to construct the training set and the test set; the positive sample pair randomly selects two sections of voices from the corpus of the same macaque, and the negative sample pair randomly selects one section of voice from two different macaque corpora respectively.

Backbone network 420 is designed based on CNN, and may employ a feature extraction network such as VGG or ResNet. The invention designs a new backbone network combining SincNet, ResNet and transforms to extract the frame level characteristics of the voice of the macaque, the specific structure of which is shown in FIG. 2 and Table 1, and the details are not repeated here. In addition, the specific structures of the feature fusion network 430 and the feature compression network 440 are shown in fig. 3, and specific reference may be made to the detailed description of the method portion, which is not repeated herein.

And after the training test is finished, obtaining a trained macaque voiceprint verification model, and entering a practical application stage. The system then comprises: a trained macaque voiceprint verification model, a data processing module 410 and a verification module 460; wherein,

The trained macaque voiceprint verification model comprises a backbone network 420, a feature fusion network 430 and a feature compression network 440 which are connected in sequence.

It should be noted that, for the convenience of description, only some but not all of the related contents of the embodiments of the present invention are shown in the drawings. Some example embodiments are described as processes or methods depicted as flow diagrams, which describe operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously, and the order of the operations can be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An end-to-end macaque voiceprint verification method based on cycle frame level feature fusion, the method comprising:

2. The end-to-end macaque voiceprint validation method based on loop frame-level feature fusion as claimed in claim 1, wherein the macaque speech pair to be validated is preprocessed; the method specifically comprises the following steps:

3. The cyclic frame-level feature fusion-based end-to-end macaque voiceprint verification method according to claim 2, wherein the input of the backbone network is preprocessed voice format data, and the output is a frame-level feature vector; the backbone network comprises the following components which are connected in sequence: 1 learnable band-pass filter convolutional layer, 6 one-dimensional residual convolutional blocks, 1 × 1 channel conversion convolutional layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:

4. The method for verifying the voice print of the macaque based on the cyclic frame level feature fusion of the claim 3, wherein the input of the feature fusion network is a frame level feature vector, and the output of the feature fusion network is a fusion frame feature vector, the feature fusion network comprises a cyclic frame interception grouping unit and a channel group, the cyclic frame interception grouping unit connects the frame level feature vectors end to obtain a feature sequence F, and the F is grouped according to a preset step length to obtain a grouping representation FG of the feature sequence; the channel group is used for mapping FG (FG) into fusion frame characteristics based on a channel weighting fusion mechanism; the channel group comprises a plurality of CFM layers, and each CFM layer comprises a first branch and a second branch which are connected in parallel; the specific processing procedure of the CFM layer is as follows:

5. The end-to-end macaque voiceprint verification method based on cycle frame level feature fusion according to claim 4, wherein the input of the feature compression network is a fusion frame feature vector, the output is a sentence level feature vector with dimension d, and the feature compression network comprises 1 gate control cycle unit and 1 full connection layer which are connected in sequence;

e＝h(x)e∈R^d

len(x)＝l

6. The end-to-end macaque voiceprint validation method based on cycle frame level feature fusion as claimed in claim 5, wherein the method further comprises a training step and a testing step of a macaque voiceprint validation model; the method specifically comprises the following steps:

7. The end-to-end macaque voiceprint verification method based on cycle frame level feature fusion as claimed in claim 6, wherein the specific process of calculating the loss function is as follows:

calculating Loss function Loss of the Chinese gooseberry voiceprint verification model by the following formula_jiComprises the following steps:

8. an end-to-end macaque voiceprint validation system based on cyclic frame-level feature fusion, the system comprising: the trained kiwi voiceprint verification model, the data processing module and the verification module; wherein,