CN113129908B - End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion - Google Patents

End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion Download PDF

Info

Publication number
CN113129908B
CN113129908B CN202110313689.3A CN202110313689A CN113129908B CN 113129908 B CN113129908 B CN 113129908B CN 202110313689 A CN202110313689 A CN 202110313689A CN 113129908 B CN113129908 B CN 113129908B
Authority
CN
China
Prior art keywords
macaque
frame
voice
fusion
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110313689.3A
Other languages
Chinese (zh)
Other versions
CN113129908A (en
Inventor
李松斌
唐计刚
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanhai Research Station Institute Of Acoustics Chinese Academy Of Sciences
Original Assignee
Nanhai Research Station Institute Of Acoustics Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanhai Research Station Institute Of Acoustics Chinese Academy Of Sciences filed Critical Nanhai Research Station Institute Of Acoustics Chinese Academy Of Sciences
Priority to CN202110313689.3A priority Critical patent/CN113129908B/en
Publication of CN113129908A publication Critical patent/CN113129908A/en
Application granted granted Critical
Publication of CN113129908B publication Critical patent/CN113129908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses an end-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion, wherein the method comprises the following steps: preprocessing a voice pair of a macaque to be verified; the macaque voice pair is two macaque voice sections; inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification; the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing circulating frame interception and grouping on the frame-level feature vectors extracted by the features and mapping the frame-level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the voice segments of the macaque.

Description

End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
Technical Field
The invention relates to the technical field of computers, in particular to an end-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion.
Background
Primates are facing a serious survival crisis. In order to effectively protect primates, it is important to understand individual ranges of activity and population changes of animals. These all rely on individual animal verification and tracking. Meanwhile, animal individual verification is used as a basic research, is an important basis for realizing animal individual tracking, and has important research value.
The current commonly used animal individual verification techniques mainly comprise an artificial observation method, a DNA fingerprint method, a marking method, an image verification method and a voice verification method. Primates mostly live in mountain forests, and it is difficult to observe animals visually and effectively, and primates are highly alert and difficult to access by humans, making direct observation, DNA fingerprinting and labeling difficult to implement.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an end-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion.
In order to achieve the above object, the present invention provides an end-to-end macaque voiceprint verification method based on cyclic frame-level feature fusion, the method including:
preprocessing a macaque voice pair to be verified; the macaque voice pair is two macaque voice sections;
inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification;
the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features of the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the Chinese gooseberry voice segments.
As an improvement of the above method, the macaque voice pair to be verified is preprocessed; the method specifically comprises the following steps:
cutting off the mute section in the two sections of voice sections to be verified to obtain preprocessed voice format data.
As an improvement of the above method, the input of the backbone network is preprocessed voice format data, and the output is a frame-level feature vector; the backbone network comprises the following components which are connected in sequence: 1 learnable band-pass filter convolutional layer, 6 one-dimensional residual convolutional blocks, 1 × 1 channel conversion convolutional layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:
the learnable band-pass filter convolution layer converts time domain information of voice format data into frequency domain information, and combines 6 one-dimensional residual convolution blocks and pooling operation thereof, feature dimensions are reduced while extracting features, 1 × 1 channel conversion convolution layer performs channel conversion on output features of the residual convolution blocks, and then multi-head conversion is performed to output frame-level feature vectors.
As an improvement of the above method, the input of the feature fusion network is a frame-level feature vector, and the output is a fusion frame feature vector, the feature fusion network includes a cyclic frame interception grouping unit and a channel group, the cyclic frame interception grouping unit connects the frame-level feature vectors end to obtain a feature sequence F, and groups the F by a preset step length to obtain a grouping representation FG of the feature sequence; the channel group is used for mapping FG (FG) into fusion frame characteristics based on a channel weighting fusion mechanism; the channel group comprises a plurality of CFM layers, and each CFM layer comprises a first branch and a second branch which are connected in parallel; the specific processing procedure of the CFM layer is as follows:
transposing the grouped frame level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame-level feature vectors are processed by 2 FC of a second branch, the grouped frame-level feature vectors respectively pass through 1 maximum pooling layer and 1 average pooling layer, the output of the maximum pooling layer and the output of the average pooling layer are subjected to matrix addition calculation, and then a sigmoid layer is used for activation processing to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristics and the second part of the fusion frame characteristics to obtain a fusion frame characteristic vector.
As an improvement of the above method, the input of the feature compression network is a fusion frame feature vector, and the output is a sentence-level feature vector with a dimension of d, and the feature compression network includes 1 gate control cycle unit and 1 full connection layer which are connected in sequence;
the output of the feature compression network is a sentence-level feature vector e with a dimension d:
e=h(x)e∈R d
len(x)=l
where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix.
As an improvement of the above method, the method further comprises a training step and a testing step of the macaque voiceprint verification model; the method specifically comprises the following steps:
step 1) preprocessing a voice section of a macaque corpus, wherein the preprocessed macaque corpus comprises a plurality of corpora of T macaques;
step 2) randomly selecting the voice segments of T1 macaques from the preprocessed macaque corpus as a training corpus, and using the voice segments of the rest T-T1 macaques as a test corpus;
step 3) selecting data from the training corpus to establish a training set; randomly dividing training set data into q groups, wherein each group comprises m macaques, and each macaque has n voice sections;
step 4) extracting positive sample voice pairs and negative sample voice pairs with equal quantity from a test corpus to form a test set, wherein the positive sample voice pairs are two different voices of the same macaque, and the negative sample voice pairs are respective voices of the two macaques;
step 5) sequentially inputting q groups of data in a training set into a Chinese goosebeery verification model, setting the learning rate to be 0.001 and the attenuation rate to be 0.0001, using a Leaky ReLU activation function with the slope of-0.3, training by adopting AMSGrad, calculating a loss function, reversely transmitting a loss value through a reverse transmission algorithm, updating network parameters, and completing a training period after all the q groups of data are input once;
step 6) sequentially inputting the test set data into the Chinese gooseberry voiceprint verification model obtained in the current training period, and calculating to obtain an accuracy result of the current training period;
step 7) repeating the step 5) and the step 6) until P training periods are completed; and selecting the network parameter combination corresponding to the maximum value from the P accuracy results to form the optimal parameter combination of the Chinese gooseberry voiceprint verification model, thereby obtaining the trained Chinese gooseberry voiceprint verification model.
As an improvement of the above method, the specific process of calculating the loss function is:
according to the sentence-level characteristics of each group of Chinese macaque speech segments, calculating a cosine distance dist (A, A') as follows:
Figure BDA0002990277960000031
wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | v | | represents a second-order norm;
from the cosine distance dist (A, A'), the intra-class loss and inter-class loss are calculated as follows:
Figure BDA0002990277960000032
wherein j represents a voice of the ith macaque, k represents another voice segment, j-k represents that two voice segments belong to the same macaque, and the calculated loss value S ji,k Is an intra-class loss; j ≠ k denotes that two voice segments belong to different macaques, and the calculated loss value S ji,k Is an inter-class loss; w is a weighted value, b is an offset;
calculating Loss function Loss of the Chinese gooseberry voiceprint verification model by the following formula ji Comprises the following steps:
Figure BDA0002990277960000041
an end-to-end macaque voiceprint validation system based on cyclical frame-level feature fusion, the system comprising: the system comprises a trained kiwi voiceprint verification model, a data processing module and a verification module; wherein the content of the first and second substances,
the data processing module is used for preprocessing a voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;
the verification module is used for inputting the preprocessed macaque voice pair into a pre-trained macaque voiceprint verification model to obtain a conclusion whether the macaque voice pair to be verified belongs to the same individual macaque or not, and therefore voiceprint verification is achieved.
Compared with the prior art, the invention has the advantages that:
1. according to the method, the macaque voice is processed into the voice pair and the designed backbone network can automatically extract the frame-level characteristics of the sampled data of the macaque voice pair, the frame-level characteristics can be mapped into the fusion frame characteristics by using the designed characteristic fusion network, and then the fusion frame characteristics are compressed by using the characteristic compression network to obtain sentence-level characteristics corresponding to the macaque voice segment;
2. the Chinese gooseberry voiceprint verification method provided by the invention realizes the design of converting a closed data set multi-classification model to a sentence-level embedded characteristic representation model, and expands the limitation of the existing animal verification algorithm based on sound. The definition of the animal individual authentication task is converted into a feature representation by a multi-classification task. The voiceprint verification algorithm can meet the requirements of more application scenes, and the trained model can be used for verifying or identifying the unknown target;
3. the method of the invention can be applied to voiceprint verification of macaques and subsequent voiceprint identification of other animals, and has guiding significance.
Drawings
FIG. 1 is a schematic flow chart of an end-to-end Chinese goosebeery verification method based on cyclic frame level feature fusion according to the present invention;
fig. 2 is a schematic structural diagram of a backbone network provided by the present invention;
FIG. 3 is a schematic diagram of the overall structure of an end-to-end Kiwi voiceprint verification network based on cyclic frame-level feature fusion according to the present invention;
fig. 4 is a block diagram of an end-to-end macaque voiceprint verification system based on loop frame level feature fusion provided by the invention.
Reference numerals
410. Data processing module 420, backbone network 430 and feature fusion network
440. Feature compression network 450, network parameter update module 460, and verification module
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, embodiment 1 of the present invention provides an end-to-end macaque voiceprint verification method based on cyclic frame-level feature fusion, which includes the following steps:
step 110: and selecting a voice pair from the preprocessed macaque corpus according to rules to construct a training set and a test set. The training set data is randomly divided into q groups, and each group comprises the voice sections of m macaques.
In the prior art, feature extraction is usually performed on voice data in a preprocessing stage to obtain MFCC, LPC or a spectrogram for classification by a classification model. The preprocessing of the invention is to intercept the effective voice segment, namely to cut off the mute segment in the original voice, but not to extract the characteristics, and the preprocessed macaque corpus is still the data of the voice format.
Step 120: and randomly reading a group of macaque speech segments in the test set, inputting the macaque speech segments into the backbone network, and performing feature extraction to obtain the frame-level feature vectors of the macaque speech segments.
The backbone network provided by the invention is designed based on CNN, and a network can be extracted by adopting characteristics such as VGG or ResNet. The invention designs a new backbone network combining SincNet, ResNet and transforms to extract the frame level characteristics of the voice of the macaque.
Step 130: and performing cyclic frame interception on the frame level feature vector output by the backbone network, inputting the frame level feature vector to a feature fusion network, and performing feature fusion to obtain a fusion frame feature vector of the voice section of the macaque.
The feature fusion network provided by the invention is designed based on a channel fusion mechanism, and the frame-level features are mapped into fusion frame features by weighting fusion of a plurality of adjacent channel features.
Step 140: and inputting the fusion frame feature vector output by the feature fusion network into a feature compression network, and performing feature compression to obtain sentence-level feature vectors corresponding to the voice segments of the macaque.
Step 150: according to sentence-level feature vectors of the voice segments of the set of macaques, intra-class loss and inter-class loss are calculated by utilizing cosine distances of the vectors, and parameters in a backbone network, a feature fusion network and a feature compression network are updated by adopting AMSGrad.
Step 160: and repeating the steps 120-150, and repeating iteration until the trained network has the highest accuracy on the test set, so as to obtain the optimal parameter combination of the network.
And (3) randomly and repeatedly selecting a group of data from the constructed training set to input the data into the backbone network, repeatedly executing the steps 120 to 150, finishing a training period when all the data in the training set are used once, and testing the network model on the test set once every time a training period is finished at the moment to record the accuracy of the test. And when the preset training period number is finished, determining the network model parameters obtained on the test set when the test accuracy is highest as the optimal parameter combination of the network.
Step 170: and verifying whether any macaque voice pair is the same individual or not based on the optimal parameter network.
The network model with the parameters set as the optimal parameter combination is used for detecting the voice pairs of the macaques except the training set and the testing set, whether two pieces of voice in the voice pairs are sent by the same macaque or not can be judged, and voiceprint verification of the macaques is achieved.
The embodiment can realize the automatic frame-level feature extraction of the sampled data of the macaque voice pair by processing the macaque voice as the voice pair and the designed backbone network, can map the frame-level features into the fusion frame features by utilizing the designed feature fusion network, and then compress the fusion frame features by utilizing the feature compression network to obtain the sentence-level features corresponding to the macaque voice segment.
Optionally, assuming that the preprocessed macaque corpus contains sounds of T macaques, randomly selecting T1 macaque corpora as a training corpus, and taking the rest T-T1 macaque corpora as a test corpus; respectively extracting a positive sample pair and a negative sample pair from a training corpus and a testing corpus to construct the training set and the testing set; the positive sample pair randomly selects two sections of voices from the corpus of the same macaque, and the negative sample pair randomly selects one section of voice from two different macaque corpora respectively.
For example, the corpus includes 144 macaque voices in 0-2 years, the voice time of each macaque is 5-30 minutes, the total voice time is 2143.11 minutes, the effective time after cutting the mute section is 171.35 minutes, and the total number of the macaque voice calling sections is 18309. During training, the first 100 macaque corpora can be selected as a training set, and the last 44 macaque corpora can be selected as a test set. The positive sample speech pair is constructed by randomly selecting two sections of speech from a macaque corpus. The negative sample voice pair construction method comprises the steps of randomly selecting two targets (namely voice file catalogues of two different macaques) from a test corpus, and then randomly selecting a section of voice from each of the two target corpora. Finally, 80000 voice pairs are constructed as a test set for the voiceprint verification of the macaque, wherein the proportion of positive and negative samples is 1: 1, i.e. 40000 positive sample speech pairs and 40000 negative sample speech pairs. And grouping 80000 voice pairs according to a preset number b to obtain n-80000/b groups of test data. The test set extracts speech pairs in the same manner as the training set.
When testing across data sets, in order to solve the voice duration difference of different data sets, the test voice can be compressed or expanded to a fixed length in a cutting or copying mode. The formula is expressed as follows:
Figure BDA0002990277960000061
optionally, intra-class loss and inter-class loss calculation is performed according to sentence-level features of each group of macaque speech segments by using cosine distances of vectors. Wherein the cosine distance calculation of the vector is represented as:
Figure BDA0002990277960000071
wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | | | | represents a second-order norm;
the intra-class loss and inter-class loss calculations are expressed as:
Figure BDA0002990277960000072
wherein, A represents the jth section pronunciation of ith kiwi fruit, and k represents another pronunciation section in the pronunciation centering, and it indicates that two pronunciation sections belong to same kiwi fruit to note A', j ═ k, calculates and obtains the loss in the class, otherwise, indicates that two pronunciation sections belong to different kiwi fruits, represents another pronunciation section with B, calculates and obtains the loss between the class. The calculation of the total loss function of the network constructed by the invention is represented as follows:
Figure BDA0002990277960000073
VGG and ResNet are widely applied to speaker verification, but the difference of input data has different structural requirements on a feature extraction network, so that the invention designs a backbone network for extracting features of Chinese macaque voices. Fig. 2 is a schematic structural diagram of a backbone network, which is composed of a learnable band-pass filter sincenet layer, 6 layers of residual convolution blocks (ResBlock), 1 channel conversion convolution layer and two layers of transform blocks.
The SincNet layer is formed by activating 1 sinc one-dimensional convolution, 1 largest pooling layer, 3 pooling windows, 1 BN layer and LeakyRelu in sequence; the ResBlock is composed of two groups of convolution units, 1 maximum pooling layer and 1 characteristic weight scaling layer (FMS), wherein the pooling window of the maximum pooling layer is 3, the convolution units are sequentially composed of 1 BN, LeakyRelu and 1 two-dimensional convolution Conv, the convolution kernel size of the two-dimensional convolution is 3, and the step length is 1; the size of a convolution kernel of the channel conversion convolution layer is 1, and the step length is 1; the transforms block is composed of 2 groups of fully connected units, and the fully connected units are composed of 1 multi-head attention Mechanism (MHA) and 2 groups of FC and Droupout layers in sequence. Network parameters and data dimensions of each layer of the backbone network are shown in table 1.
TABLE 1
Figure BDA0002990277960000081
FMS is a characteristic weight scaling mechanism that assigns a different weight to each frame, and ResBlock assigns the same weight to each frame if the mechanism is not added. MHA is a multi-head attention mechanism.
In the embodiment of the invention, a group of macaque voice segments in a test set are read randomly, the voice sampling data value of the macaque voice segments is input into a backbone network, the time domain information of the macaque voice is converted into frequency domain information by a SincNet layer, and the feature dimension is reduced while the feature is extracted by combining 6 layers of one-dimensional residual convolution blocks and pooling operation thereof. In order to facilitate transform block multi-head design, before data is input into a transform block, 1 layer of 1 multiplied by 1 channel conversion convolution is used for carrying out channel conversion on output characteristics of residual convolution blocks, and finally a backbone network outputs a two-dimensional characteristic diagram as input of a characteristic fusion network.
By means of a residual structure consisting of a one-dimensional convolution residual unit and a pooling layer, frame-level feature representation can be extracted from original voice, feature transfer among different convolution layers is achieved, and fusion of semantic features before and after enhancement in a network learning process is facilitated. And (3) adding a 1 multiplied by 1 convolution layer for connecting a residual module and a Transform module and converting the number of characteristic diagram channels for selecting the number of the multiple heads in the subsequent Transform structure design. By adding a Transform structure, the feature extraction capability of a backbone network is further enhanced, end-to-end feature extraction is realized, frame-level features of the voice of the macaque are automatically extracted, and the steps of extracting the features by a complex manual design algorithm are simplified.
Fig. 3 is a schematic diagram of an overall structure of a macaque voiceprint verification network according to an embodiment of the present invention, where a waveform at a leftmost end of fig. 3 represents input voice pair sampling data, CNN represents the backbone network, and a frame-level feature vector output by the backbone network is represented as M ═ M 1 ,m 2 ,...,n T ]Wherein m is i ∈R n N is a frame-level feature dimension, and T represents the number of frames; and performing cyclic frame interception before inputting the frame level features output by the backbone network into the feature fusion network. Specifically, the frame-level feature vectors are connected end to obtain a feature sequence F as follows:
F=[f 1 ,f 2 ,...f N ,f 1 ,f 2 ,,...f c-1 ]
grouping F by preset step length to obtain grouping expression FG of the characteristic sequence:
FG=([f 1 ,f 2 ,...f c ],[f 2 ,f 3 ,…f c+1 ],…,[f T ,f 1 ,f 2 ,...f c-1 ])
where c represents the number of frames per feature set. And inputting the frame-level feature vectors grouped in the FG into a feature fusion network for feature fusion, thereby obtaining the fusion frame feature vector of the macaque speech segment.
As shown in fig. 3, in order to perform more efficient feature Fusion, the present invention designs a feature Fusion network based on a Channel Fusion Mechanism (CFM). The feature fusion network comprises two branches, wherein the first branch is a transposition operation, and the second branch sequentially comprises 2 FCs connected in series, a maximum pooling (Maxpool) layer and an average pooling (Avgppool) layer which are connected in parallel, and 1 sigmoid layer;
transposing the grouped frame-level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame level feature vectors are processed by 2 FC of the second branch, the grouped frame level feature vectors respectively pass through a Maxpool layer and an Avgpol layer, matrix addition calculation is carried out on the outputs of the Maxpool layer and the Avgpol layer, namely corresponding bits are added, and then activation processing is carried out by a sigmoid layer to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristic and the second part of the fusion frame characteristic to obtain a fusion frame characteristic vector of the voice section.
The calculation process of the fusion frame feature vector can be represented as follows:
FG i '=g(f(FG i ))=w 2 (w 1 (FG i )+b 1 )+b 2
m=MaxP(FG i ',c)
a=AvgP(FG i ',c)
w i =σ(m+a)
f i * =tp(FG i )·w i
wherein f and g respectively represent mapping functions of two layers of full-connection structures, and parameters of two layers of FC are respectively omega 1 ,b 1 And omega 2 ,b 2 MaxP and AvgP denote maximum pooling and mean pooling, respectively, σ is a Sigmoid function, tp denotes transposition, and · denotes dot multiplication. Final ith feature group FG i Is mapped as a feature f i * Denotes the i-th feature vector after fusion, where FG i ∈R c×n ,f i * ∈R n
In a voiceprint verification task, the correlation between the feature representation of each frame (moment) of time-frequency two-dimensional features and final sentence-level features needs to be represented, so that the invention provides a channel weighted fusion mechanism, a new frame-level feature representation is obtained through weighted fusion of a plurality of adjacent channel features, the interframe relevance of a feature map is effectively enhanced, and the mapping from the effective frame-level features to the fused frame features is realized.
As shown in fig. 3, the feature compression network includes a gated loop unit GRU and a fully connected FC layer; after being processed by the GRU unit and an FC layer in sequence, the fusion frame feature vector is mapped into a sentence-level feature vector e with the dimensionality d:
e=h(x)e∈R d
len(x)=l
where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix. The purpose of the feature compression network is to map input two-dimensional features into one-dimensional features, and the process can be regarded as embedding of original speech segment data.
The network model designed by the invention inputs voice data with the size of mxnxl, wherein m represents the number of macaques selected in each group of data, n represents the number of voice segments selected from each macaque corpus, and l represents the number of frames of each voice segment. After being processed by the backbone network, the feature fusion network and the feature compression network, the feature output of m multiplied by n multiplied by d can be obtained. Where d represents the sentence-level feature dimension, d may be set to 256. In addition, a leak ReLU activation function with a slope of-0.3 may be used in the network, the learning rate of the AMSGrad optimizer may be set to 0.001, the attenuation rate may be set to 0.0001, the number of macaques in each voice group may be set to m-8, and the number of voice segments per macaque may be set to n-10.
The method for verifying the voice print of the macaque realizes the design of switching from a closed data set multi-classification model to a sentence-level embedded characteristic representation model, and expands the limitation of the existing animal verification algorithm based on voice. The definition of the animal individual verification task is converted into a feature representation by a multi-classification task. The voiceprint verification algorithm can meet the requirements of more application scenes, and the trained model can be used for verifying or identifying the unknown target.
Example 2
As shown in fig. 4, an embodiment 2 of the present invention provides a macaque voiceprint verification system, which is implemented based on the macaque voiceprint verification method, and in a training test stage, the system specifically includes:
the data processing module 410 is configured to select a voice pair from the preprocessed macaque corpus according to rules, construct a training set and a test set, and group the training set and the test set;
the backbone network 420 is used for extracting features of the input macaque speech segment to obtain a frame-level feature vector of the macaque speech segment;
the feature fusion network 430 is configured to perform feature fusion after performing cyclic frame interception on the frame-level feature vectors output by the backbone network, so as to obtain a fusion frame feature vector of a voice segment of the macaque;
a feature compression network 440, configured to perform feature compression on the sentence-level features output by the feature fusion network to obtain sentence-level feature vectors corresponding to the voice segments of the macaque
A network parameter updating module 450, configured to perform intra-class loss and inter-class loss calculation according to sentence-level feature vectors of the macaque speech segments by using cosine distances of the vectors, update parameters in the backbone network, the feature fusion network, and the feature compression network by using AMSGrad, and repeat iteration until the trained network has the highest accuracy on the test set, so as to obtain an optimal parameter combination of the network
And the verification module 460 is configured to verify whether any macaque voice pair belongs to the same individual according to the output of the optimal parameter network.
The macaque voiceprint verification system provided by the invention processes macaque voice into voice pairs, can automatically extract frame-level features of sampled data of the macaque voice pairs through a designed backbone network, can map the frame-level features into fusion frame features by utilizing the designed feature fusion network, and then compresses the fusion frame features by utilizing the feature compression network to obtain sentence-level features corresponding to macaque voice segments.
Optionally, the data processing module 410 further includes:
and the corpus dividing unit is used for randomly selecting T1 macaque corpora as a training corpus and the rest T-T1 macaque corpora as a test corpus from T macaque sounds contained in the preprocessed macaque corpus.
The voice pair extraction unit is used for extracting a positive sample pair and a negative sample pair from a training corpus and a testing corpus respectively to construct the training set and the testing set; the positive sample pair randomly selects two sections of voices from the corpus of the same macaque, and the negative sample pair randomly selects one section of voice from two different macaque corpuses respectively.
Backbone network 420 is designed based on CNN, and may employ a feature extraction network such as VGG or ResNet. The invention designs a new backbone network combining SincNet, ResNet and transforms to extract the frame level characteristics of the voice of the macaque, the specific structure of which is shown in FIG. 2 and Table 1, and the details are not repeated here. In addition, the specific structures of the feature fusion network 430 and the feature compression network 440 are shown in fig. 3, and detailed descriptions of the method portions may be specifically referred to, which are not described herein again.
And after the training test is finished, obtaining a trained macaque voiceprint verification model, and entering a practical application stage. The system then comprises: a trained kiwi voiceprint verification model, a data processing module 410 and a verification module 460; wherein the content of the first and second substances,
the data processing module is used for preprocessing a voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;
the verification module is used for inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, and therefore voiceprint verification is achieved.
The trained macaque voiceprint verification model comprises a backbone network 420, a feature fusion network 430 and a feature compression network 440 which are connected in sequence.
It should be noted that, for the convenience of description, only some but not all of the related contents of the embodiments of the present invention are shown in the drawings. Some example embodiments are described as processes or methods depicted as flow diagrams, which describe operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously, and the order of the operations can be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that the technical solutions of the present invention may be modified or substituted with equivalents without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered by the scope of the claims of the present invention.

Claims (7)

1. An end-to-end macaque voiceprint verification method based on cycle frame level feature fusion, the method comprising:
preprocessing a macaque voice pair to be verified; the macaque voice pair is two macaque voice sections;
inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification;
the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features of the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; the feature compression network is used for compressing the feature of the fusion frame to obtain sentence-level features corresponding to the voice segments of the macaque;
the input of the backbone network is preprocessed voice format data, and the output of the backbone network is a frame level feature vector; the backbone network comprises the following components which are connected in sequence: 1 learnable band-pass filter convolution layer, 6 one-dimensional residual convolution blocks, 1 × 1 channel conversion convolution layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:
the learnable band-pass filter convolution layer converts time domain information of voice format data into frequency domain information, and combines 6 one-dimensional residual convolution blocks and pooling operation thereof, feature dimensions are reduced while extracting features, 1 × 1 channel conversion convolution layer performs channel conversion on output features of the residual convolution blocks, and then multi-head conversion is performed to output frame level feature vectors.
2. The end-to-end macaque voiceprint validation method based on loop frame-level feature fusion as claimed in claim 1, wherein the macaque speech pair to be validated is preprocessed; the method specifically comprises the following steps:
cutting off the mute section in the two sections of voice sections to be verified to obtain the preprocessed voice format data.
3. The end-to-end kiwi voiceprint verification method based on cycle frame level feature fusion as claimed in claim 2, wherein the input of said feature fusion network is a frame level feature vector, and the output is a fusion frame feature vector, said feature fusion network comprises a cycle frame interception grouping unit and a channel group, said cycle frame interception grouping unit connects end to end frame level feature vectors to obtain a feature sequence F, and groups F by a preset step size to obtain a grouping representation FG of the feature sequence; the channel group is used for mapping FG (FG) into fusion frame characteristics based on a channel weighting fusion mechanism; the channel group comprises a plurality of CFM layers, and each CFM layer comprises a first branch and a second branch which are connected in parallel; the specific processing procedure of the CFM layer is as follows:
transposing the grouped frame level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame-level feature vectors are processed by 2 FC of a second branch, the grouped frame-level feature vectors respectively pass through 1 maximum pooling layer and 1 average pooling layer, the output of the maximum pooling layer and the output of the average pooling layer are subjected to matrix addition calculation, and then a sigmoid layer is used for activation processing to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristics and the second part of the fusion frame characteristics to obtain a fusion frame characteristic vector.
4. The end-to-end macaque voiceprint verification method based on cycle frame level feature fusion of claim 3, wherein the input of the feature compression network is a fusion frame feature vector, the output is a sentence level feature vector with dimension d, and the feature compression network comprises 1 gate control cycle unit and 1 full connection layer which are connected in sequence;
the output of the feature compression network is a sentence-level feature vector e with the dimension d:
e=h(x)e∈R d
len(x)=l
where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix.
5. The end-to-end macaque voiceprint validation method based on cycle frame level feature fusion as claimed in claim 4, wherein the method further comprises a training step and a testing step of a macaque voiceprint validation model; the method specifically comprises the following steps:
step 1) preprocessing a voice section of a macaque corpus, wherein the preprocessed macaque corpus comprises a plurality of corpora of T macaques;
step 2) randomly selecting T1 voice segments of the macaques from the preprocessed macaque corpus as a training corpus, and using the rest T-T1 voice segments of the macaques as a test corpus;
step 3) selecting data from the training corpus to establish a training set; randomly dividing training set data into q groups, wherein each group comprises m macaques, and each macaque has n voice sections;
step 4) extracting equal number of positive sample voice pairs and negative sample voice pairs from a test corpus to form a test set, wherein the positive sample voice pairs are two different voices of the same macaque, and the negative sample voice pairs are respective voices of the two macaques;
step 5) sequentially inputting q groups of data in a training set into a Chinese gooseberry voiceprint verification model, setting the learning rate to be 0.001 and the attenuation rate to be 0.0001, using a Leaky ReLU activation function with the slope of-0.3, training by adopting AMSGrad, calculating a loss function, reversely transmitting a loss value through a back propagation algorithm, updating network parameters, and completing a training period after all q groups of data are input once;
step 6) sequentially inputting the test set data into the Chinese gooseberry voiceprint verification model obtained in the current training period, and calculating to obtain an accuracy rate result of the current training period;
step 7) repeating the step 5) and the step 6) until P training periods are completed; and selecting the network parameter combination corresponding to the maximum value from the P accuracy results to form the optimal parameter combination of the Chinese gooseberry voiceprint verification model, thereby obtaining the trained Chinese gooseberry voiceprint verification model.
6. The end-to-end macaque voiceprint verification method based on cycle frame-level feature fusion as claimed in claim 5, wherein the specific process of calculating the loss function is:
according to the sentence-level characteristics of each group of Chinese macaque speech segments, calculating a cosine distance dist (A, A') as follows:
Figure FDA0003638194120000031
wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | v | | represents a second-order norm;
from the cosine distance dist (A, A'), the intra-class loss and the inter-class loss are calculated as follows:
Figure FDA0003638194120000032
wherein j represents a voice of the ith macaque, k represents another voice segment, j-k represents that two voice segments belong to the same macaque, and the calculated loss value S ji,k Is an intra-class loss; j ≠ k represents that the two voice segments belong to different macaques, and the loss value S is obtained through calculation ji,k Is a loss between classes(ii) a w is a weighted value, b is an offset;
calculating Loss function Loss of the Chinese gooseberry voiceprint verification model by the following formula ji Comprises the following steps:
Figure FDA0003638194120000033
7. an end-to-end macaque voiceprint verification system based on cyclic frame-level feature fusion, the system comprising: the trained kiwi voiceprint verification model, the data processing module and the verification module; wherein the content of the first and second substances,
the data processing module is used for preprocessing the voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;
the verification module is used for inputting the preprocessed macaque voice pair into a pre-trained macaque voiceprint verification model to obtain a conclusion whether the macaque voice pair to be verified belongs to the same individual macaque or not, so that voiceprint verification is realized;
the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing circulating frame interception and grouping on the frame-level feature vectors extracted by the features and mapping the frame-level features into fusion frame features based on a channel weighting fusion mechanism; the feature compression network is used for compressing the feature of the fusion frame to obtain sentence-level features corresponding to the voice segments of the macaque;
the input of the backbone network is preprocessed voice format data, and the output is a frame level feature vector; the backbone network comprises the following components connected in sequence: 1 learnable band-pass filter convolution layer, 6 one-dimensional residual convolution blocks, 1 × 1 channel conversion convolution layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:
the learnable band-pass filter convolution layer converts time domain information of voice format data into frequency domain information, and combines 6 one-dimensional residual convolution blocks and pooling operation thereof, feature dimensions are reduced while extracting features, 1 × 1 channel conversion convolution layer performs channel conversion on output features of the residual convolution blocks, and then multi-head conversion is performed to output frame level feature vectors.
CN202110313689.3A 2021-03-24 2021-03-24 End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion Active CN113129908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110313689.3A CN113129908B (en) 2021-03-24 2021-03-24 End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110313689.3A CN113129908B (en) 2021-03-24 2021-03-24 End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion

Publications (2)

Publication Number Publication Date
CN113129908A CN113129908A (en) 2021-07-16
CN113129908B true CN113129908B (en) 2022-07-26

Family

ID=76774077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110313689.3A Active CN113129908B (en) 2021-03-24 2021-03-24 End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion

Country Status (1)

Country Link
CN (1) CN113129908B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763966B (en) * 2021-09-09 2024-03-19 武汉理工大学 End-to-end text irrelevant voiceprint recognition method and system
CN116386647B (en) * 2023-05-26 2023-08-22 北京瑞莱智慧科技有限公司 Audio verification method, related device, storage medium and program product

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417217A (en) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 Speaker Identification network model training method, method for distinguishing speek person and system
CN110211595A (en) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 A kind of speaker clustering system based on deep learning
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
CN111243602A (en) * 2020-01-06 2020-06-05 天津大学 Voiceprint recognition method based on gender, nationality and emotional information
CN111524525A (en) * 2020-04-28 2020-08-11 平安科技(深圳)有限公司 Original voice voiceprint recognition method, device, equipment and storage medium
CN112331216A (en) * 2020-10-29 2021-02-05 同济大学 Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7356168B2 (en) * 2004-04-23 2008-04-08 Hitachi, Ltd. Biometric verification system and method utilizing a data classifier and fusion model
CN105656887A (en) * 2015-12-30 2016-06-08 百度在线网络技术(北京)有限公司 Artificial intelligence-based voiceprint authentication method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417217A (en) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 Speaker Identification network model training method, method for distinguishing speek person and system
CN110211595A (en) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 A kind of speaker clustering system based on deep learning
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
CN111243602A (en) * 2020-01-06 2020-06-05 天津大学 Voiceprint recognition method based on gender, nationality and emotional information
CN111524525A (en) * 2020-04-28 2020-08-11 平安科技(深圳)有限公司 Original voice voiceprint recognition method, device, equipment and storage medium
CN112331216A (en) * 2020-10-29 2021-02-05 同济大学 Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Speaker Recognition Based on Long Short-Term Memory Networks;Qihang Xu;《2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP)》;20210204;全文 *
基于机器学习的声纹识别研发;祁晓波;《中国优秀硕士学位论文全文数据库》;20200215(第2期);全文 *

Also Published As

Publication number Publication date
CN113129908A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN113488058B (en) Voiceprint recognition method based on short voice
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN108922559A (en) Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming
CN113129908B (en) End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
WO2021051628A1 (en) Method, apparatus and device for constructing speech recognition model, and storage medium
CN112908341B (en) Language learner voiceprint recognition method based on multitask self-attention mechanism
CN112669820B (en) Examination cheating recognition method and device based on voice recognition and computer equipment
CN110349588A (en) A kind of LSTM network method for recognizing sound-groove of word-based insertion
Golovko et al. A new technique for restricted Boltzmann machine learning
CN112183107A (en) Audio processing method and device
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN111312228A (en) End-to-end-based voice navigation method applied to electric power enterprise customer service
CN116153337B (en) Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium
CN105741853A (en) Digital speech perception hash method based on formant frequency
CN114898775B (en) Voice emotion recognition method and system based on cross-layer cross fusion
CN113488069B (en) Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network
Li et al. Fdn: Finite difference network with hierarchical convolutional features for text-independent speaker verification
CN111933117A (en) Voice verification method and device, storage medium and electronic device
CN110689875A (en) Language identification method and device and readable storage medium
CN113129926A (en) Voice emotion recognition model training method, voice emotion recognition method and device
CN117909486B (en) Multi-mode question-answering method and system based on emotion recognition and large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant