CN113129908A - End-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion - Google Patents

End-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion Download PDF

Info

Publication number
CN113129908A
CN113129908A CN202110313689.3A CN202110313689A CN113129908A CN 113129908 A CN113129908 A CN 113129908A CN 202110313689 A CN202110313689 A CN 202110313689A CN 113129908 A CN113129908 A CN 113129908A
Authority
CN
China
Prior art keywords
macaque
voice
frame
fusion
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110313689.3A
Other languages
Chinese (zh)
Other versions
CN113129908B (en
Inventor
李松斌
唐计刚
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Research Station Of South China Sea Institute Of Acoustics Chinese Academy Of Sciences
Original Assignee
Research Station Of South China Sea Institute Of Acoustics Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Research Station Of South China Sea Institute Of Acoustics Chinese Academy Of Sciences filed Critical Research Station Of South China Sea Institute Of Acoustics Chinese Academy Of Sciences
Priority to CN202110313689.3A priority Critical patent/CN113129908B/en
Publication of CN113129908A publication Critical patent/CN113129908A/en
Application granted granted Critical
Publication of CN113129908B publication Critical patent/CN113129908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses an end-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion, wherein the method comprises the following steps: preprocessing a voice pair of a macaque to be verified; the macaque voice pair is two macaque voice sections; inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification; the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the Chinese gooseberry voice segments.

Description

End-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion
Technical Field
The invention relates to the technical field of computers, in particular to an end-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion.
Background
Primates are facing a serious crisis for survival. In order to effectively protect primates, it is important to know the individual range of motion and the population change of the animals. These all rely on individual animal verification and tracking. Meanwhile, animal individual verification is used as a basic research, is an important basis for realizing animal individual tracking, and has important research value.
The current commonly used animal individual verification techniques mainly comprise an artificial observation method, a DNA fingerprint method, a marking method, an image verification method and a voice verification method. Primates mostly live in mountain forests, and it is difficult to observe animals visually and effectively, and primates are highly alert and difficult to access by humans, making direct observation, DNA fingerprinting and labeling difficult to implement.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an end-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion.
In order to achieve the above object, the present invention provides an end-to-end macaque voiceprint verification method based on cyclic frame level feature fusion, the method comprising:
preprocessing a voice pair of a macaque to be verified; the macaque voice pair is two macaque voice sections;
inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification;
the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the Chinese gooseberry voice segments.
As an improvement of the above method, the macaque voice pair to be verified is preprocessed; the method specifically comprises the following steps:
cutting off the mute section in the two sections of voice sections to be verified to obtain the preprocessed voice format data.
As an improvement of the above method, the input of the backbone network is preprocessed voice format data, and the output is a frame-level feature vector; the backbone network comprises the following components which are connected in sequence: 1 learnable band-pass filter convolutional layer, 6 one-dimensional residual convolutional blocks, 1 × 1 channel conversion convolutional layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:
the learnable band-pass filter convolution layer converts time domain information of voice format data into frequency domain information, and combines 6 one-dimensional residual convolution blocks and pooling operation thereof, feature dimensions are reduced while extracting features, 1 × 1 channel conversion convolution layer performs channel conversion on output features of the residual convolution blocks, and then multi-head conversion is performed to output frame level feature vectors.
As an improvement of the above method, the input of the feature fusion network is a frame-level feature vector, and the output is a fusion frame feature vector, the feature fusion network comprises a cyclic frame interception grouping unit and a channel group, the cyclic frame interception grouping unit connects the frame-level feature vectors end to obtain a feature sequence F, and the F is grouped by a preset step length to obtain a grouping representation FG of the feature sequence; the channel group is used for mapping FG (FG) into fusion frame characteristics based on a channel weighting fusion mechanism; the channel group comprises a plurality of CFM layers, and each CFM layer comprises a first branch and a second branch which are connected in parallel; the specific processing procedure of the CFM layer is as follows:
transposing the grouped frame-level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame-level feature vectors are processed by 2 FC of a second branch, the grouped frame-level feature vectors respectively pass through 1 maximum pooling layer and 1 average pooling layer, the output of the maximum pooling layer and the output of the average pooling layer are subjected to matrix addition calculation, and then a sigmoid layer is used for activation processing to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristics and the second part of the fusion frame characteristics to obtain a fusion frame characteristic vector.
As an improvement of the above method, the input of the feature compression network is a fusion frame feature vector, and the output is a sentence-level feature vector with a dimension of d, the feature compression network comprises 1 gate control cycle unit and 1 full connection layer which are connected in sequence;
the output of the feature compression network is a sentence-level feature vector e with the dimension d:
e=h(x)e∈Rd
len(x)=l
where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix.
As an improvement of the above method, the method further comprises a training step and a testing step of the macaque voiceprint verification model; the method specifically comprises the following steps:
step 1) preprocessing a voice section of a macaque corpus, wherein the preprocessed macaque corpus comprises a plurality of corpora of T macaques;
step 2) randomly selecting T1 voice segments of the macaques from the preprocessed macaque corpus as a training corpus, and using the rest T-T1 voice segments of the macaques as a test corpus;
step 3) selecting data from the training corpus to establish a training set; randomly dividing training set data into q groups, wherein each group comprises m macaques, and each macaque has n voice sections;
step 4) extracting equal number of positive sample voice pairs and negative sample voice pairs from a test corpus to form a test set, wherein the positive sample voice pairs are two different voices of the same macaque, and the negative sample voice pairs are respective voices of the two macaques;
step 5) sequentially inputting q groups of data in a training set into a Chinese goosebeery verification model, setting the learning rate to be 0.001 and the attenuation rate to be 0.0001, using a Leaky ReLU activation function with the slope of-0.3, training by adopting AMSGrad, calculating a loss function, reversely transmitting a loss value through a reverse transmission algorithm, updating network parameters, and completing a training period after all the q groups of data are input once;
step 6) sequentially inputting the test set data into the Chinese gooseberry voiceprint verification model obtained in the current training period, and calculating to obtain an accuracy result of the current training period;
step 7) repeating the step 5) and the step 6) until P training periods are finished; and selecting the network parameter combination corresponding to the maximum value from the P accuracy results to form the optimal parameter combination of the Chinese gooseberry voiceprint verification model, thereby obtaining the trained Chinese gooseberry voiceprint verification model.
As an improvement of the above method, the specific process of calculating the loss function is as follows:
according to the sentence-level characteristics of each group of Chinese macaque speech segments, calculating the cosine distance dist (A, A') as follows:
Figure BDA0002990277960000031
wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | | | | represents a second-order norm;
from the cosine distance dist (A, A'), the intra-class loss and inter-class loss are calculated as follows:
Figure BDA0002990277960000032
wherein j represents a voice of the ith macaque, k represents another voice segment, j-k represents that two voice segments belong to the same macaque, and the calculated loss value Sji,kIs an intra-class loss; j ≠ k denotes that two voice segments belong to different macaques, and the calculated loss value Sji,kIs an inter-class loss; w is a weighted value, b is an offset;
from belowFormula calculation macaque voiceprint verification model Loss function LossjiComprises the following steps:
Figure BDA0002990277960000041
an end-to-end macaque voiceprint validation system based on cyclical frame-level feature fusion, the system comprising: the trained kiwi voiceprint verification model, the data processing module and the verification module; wherein,
the data processing module is used for preprocessing a voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;
the verification module is used for inputting the preprocessed macaque voice pair into a pre-trained macaque voiceprint verification model to obtain a conclusion whether the macaque voice pair to be verified belongs to the same individual macaque or not, and therefore voiceprint verification is achieved.
Compared with the prior art, the invention has the advantages that:
1. according to the method, the macaque voice is processed into the voice pair and the designed backbone network can automatically extract the frame-level characteristics of the sampled data of the macaque voice pair, the frame-level characteristics can be mapped into the fusion frame characteristics by using the designed characteristic fusion network, and then the fusion frame characteristics are compressed by using the characteristic compression network to obtain sentence-level characteristics corresponding to the macaque voice segment;
2. the Chinese gooseberry voiceprint verification method provided by the invention realizes the design of converting a closed data set multi-classification model to a sentence-level embedded characteristic representation model, and expands the limitation of the existing animal verification algorithm based on sound. The definition of the animal individual authentication task is converted into a feature representation by a multi-classification task. The voiceprint verification algorithm can meet the requirements of more application scenes, and the trained model can be used for verifying or identifying the unknown target;
3. the method of the invention can be applied to voiceprint verification of macaques and voiceprint identification of other animals subsequently, and has guiding significance.
Drawings
FIG. 1 is a schematic flow chart of an end-to-end Chinese goosebeery verification method based on cyclic frame level feature fusion according to the present invention;
fig. 2 is a schematic structural diagram of a backbone network provided by the present invention;
FIG. 3 is a schematic diagram of the overall structure of an end-to-end macaque voiceprint verification network based on cyclic frame level feature fusion provided by the present invention;
fig. 4 is a block diagram of an end-to-end macaque voiceprint verification system based on loop frame level feature fusion provided by the invention.
Reference numerals
410. Data processing module 420, backbone network 430 and feature fusion network
440. Feature compression network 450, network parameter update module 460, and verification module
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, embodiment 1 of the present invention provides an end-to-end macaque voiceprint verification method based on loop frame level feature fusion, which includes the following steps:
step 110: and selecting a voice pair from the preprocessed macaque corpus according to rules to construct a training set and a test set. The training set data is randomly divided into q groups, and each group comprises the voice sections of m macaques.
In the prior art, feature extraction is usually performed on voice data in a preprocessing stage to obtain MFCC, LPC or a spectrogram for classification by a classification model. The preprocessing of the invention is to intercept the effective voice segment, namely to cut off the mute segment in the original voice, but not to extract the characteristics, and the preprocessed macaque corpus is still the data of the voice format.
Step 120: and randomly reading a group of voice segments of the macaque in the test set, inputting the voice segments into the backbone network, and performing feature extraction to obtain frame-level feature vectors of the macaque voice segments.
The backbone network provided by the invention is designed based on CNN, and a network can be extracted by adopting characteristics such as VGG or ResNet. The invention designs a new backbone network combining SincNet, ResNet and transforms to extract the frame level characteristics of the voice of the macaque.
Step 130: and performing cyclic frame interception on the frame level feature vector output by the backbone network, inputting the frame level feature vector to a feature fusion network, and performing feature fusion to obtain a fusion frame feature vector of the voice segments of the macaque.
The feature fusion network provided by the invention is designed based on a channel fusion mechanism, and the frame-level features are mapped into fusion frame features by weighting and fusing a plurality of adjacent channel features.
Step 140: and inputting the fusion frame feature vector output by the feature fusion network into a feature compression network, and performing feature compression to obtain sentence-level feature vectors corresponding to the voice segments of the macaque.
Step 150: according to sentence-level feature vectors of the group of Chinese gooseberry voice segments, intra-class loss and inter-class loss are calculated by utilizing cosine distances of the vectors, and parameters in a backbone network, a feature fusion network and a feature compression network are updated by adopting AMSGrad.
Step 160: and repeating the steps 120-150, and repeating iteration until the trained network has the highest accuracy on the test set, so as to obtain the optimal parameter combination of the network.
And (3) randomly and repeatedly selecting a group of data from the constructed training set to input the data into the backbone network, repeatedly executing the steps 120 to 150, finishing a training period when all the data in the training set are used once, and testing the network model on the test set once every time a training period is finished at the moment to record the accuracy of the test. And when the preset training period number is completed, determining the network model parameters obtained on the test set when the test accuracy is highest as the optimal parameter combination of the network.
Step 170: and verifying whether any macaque voice pair is the same individual or not based on the optimal parameter network.
The network model with the parameters set as the optimal parameter combination is used for detecting the voice pairs of the macaques except the training set and the testing set, whether two pieces of voice in the voice pairs are sent by the same macaque or not can be judged, and voiceprint verification of the macaques is achieved.
The embodiment can realize the automatic frame-level feature extraction of the sampled data of the macaque voice pair by processing the macaque voice as the voice pair and the designed backbone network, can map the frame-level features into the fusion frame features by utilizing the designed feature fusion network, and then compress the fusion frame features by utilizing the feature compression network to obtain the sentence-level features corresponding to the macaque voice segment.
Optionally, assuming that the preprocessed macaque corpus contains sounds of T macaques, randomly selecting T1 macaque corpora as a training corpus, and taking the rest T-T1 macaque corpora as a test corpus; respectively extracting positive sample pairs and negative sample pairs from a training corpus and a test corpus to construct the training set and the test set; the positive sample pair randomly selects two sections of voices from the corpus of the same macaque, and the negative sample pair randomly selects one section of voice from two different macaque corpora respectively.
For example, the corpus includes 144 macaque voices of 0-2 years old, the voice time length of each macaque is 5-30 minutes, the total voice time length is 2143.11 minutes, the effective time length after cutting the mute section is 171.35 minutes, and the total number of the macaque voice segments is 18309 segments. During training, the first 100 macaque corpora can be selected as a training set, and the last 44 macaque corpora can be selected as a test set. The positive sample speech pair is constructed by randomly selecting two sections of speech from a macaque corpus. The negative sample voice pair construction method comprises the steps of randomly selecting two targets (namely voice file catalogues of two different macaques) from a test corpus, and then randomly selecting a section of voice from each of the two target corpora. Finally, 80000 voice pairs are constructed as a test set for the voiceprint verification of the macaque, wherein the proportion of positive and negative samples is 1: 1, i.e. 40000 positive sample speech pairs and 40000 negative sample speech pairs. And grouping 80000 voice pairs according to a preset number b to obtain n-80000/b groups of test data. The test set extracts the speech pairs in the same way as the training set.
When testing across data sets, in order to solve the voice duration difference of different data sets, the testing voice can be compressed or expanded to a fixed length in a cutting or copying mode. The formula is expressed as follows:
Figure BDA0002990277960000061
optionally, intra-class loss and inter-class loss calculation is performed according to sentence-level features of each group of macaque speech segments by using cosine distances of vectors. Wherein the cosine distance calculation of the vector is represented as:
Figure BDA0002990277960000071
wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | | | | represents a second-order norm;
the intra-class loss and inter-class loss calculations are expressed as:
Figure BDA0002990277960000072
wherein, A represents the jth section voice of the ith macaque, k represents another voice section in the voice pair, and it is marked as A', j ═ k represents that two voice sections belong to the same macaque, and the intra-class loss is obtained by calculation, otherwise, represents that two voice sections belong to different macaques, and B represents another voice section, and the inter-class loss is obtained by calculation. The calculation of the total loss function of the network constructed by the invention is represented as follows:
Figure BDA0002990277960000073
VGG and ResNet are widely applied to speaker verification, but the difference of input data has different structural requirements on a feature extraction network, so that the invention designs a backbone network for extracting features of Chinese gooseberry voices. Fig. 2 is a schematic structural diagram of a backbone network, which is composed of a learnable band-pass filter sincenet layer, 6 layers of residual convolution blocks (ResBlock), 1 channel conversion convolution layer and two layers of transform blocks.
The SincNet layer is composed of 1 sinc one-dimensional convolution, 1 maximum pooling layer, 3 pooling windows, 1 BN layer and LeakyRelu activation in sequence; the ResBlock is composed of two groups of convolution units, 1 maximum pooling layer and 1 characteristic weight scaling layer (FMS), wherein the pooling window of the maximum pooling layer is 3, the convolution units are sequentially composed of 1 BN and LeakyRelu activation and 1 two-dimensional convolution Conv, the convolution kernel size of the two-dimensional convolution is 3, and the step length is 1; the size of a convolution kernel of the channel conversion convolution layer is 1, and the step length is 1; the transforms block is composed of 2 groups of fully connected units, and the fully connected units are composed of 1 multi-head attention Mechanism (MHA) and 2 groups of FC and Droupout layers in sequence. The network parameters and data dimensions of each layer of the backbone network are shown in table 1.
TABLE 1
Figure BDA0002990277960000081
FMS is a characteristic weight scaling mechanism that assigns a different weight to each frame, and ResBlock assigns the same weight to each frame if the mechanism is not added. MHA is a multi-head attention mechanism.
In the embodiment of the invention, a group of macaque voice segments in a test set are randomly read, voice sampling data values of the macaque voice segments are input into a backbone network, time domain information of the macaque voice is converted into frequency domain information by a SincNet layer, and feature dimensions are reduced while extracting features by combining 6 layers of one-dimensional residual rolling blocks and pooling operation thereof. For convenience of transform block multi-head design, before data is input into the transform block, 1 layer of 1 × 1 channel conversion convolution is used for carrying out channel conversion on output characteristics of a residual convolution block, and finally a backbone network outputs a two-dimensional characteristic graph as input of a characteristic fusion network.
By the residual structure consisting of the one-dimensional convolution residual unit and the pooling layer, frame-level feature representation can be extracted from the original voice, feature transfer among different convolution layers is realized, and semantic feature fusion before and after enhancement in the network learning process is facilitated. And (3) adding a 1 multiplied by 1 convolution layer for connecting a residual module and a Transform module and converting the number of characteristic diagram channels for selecting the number of the multiple heads in the subsequent Transform structure design. By adding a Transform structure, the feature extraction capability of a backbone network is further enhanced, end-to-end feature extraction is realized, frame-level features of the voice of the macaque are automatically extracted, and the steps of extracting the features by a complex manual design algorithm are simplified.
Fig. 3 is a schematic diagram of an overall structure of a macaque voiceprint verification network according to an embodiment of the present invention, where a waveform at a leftmost end of fig. 3 represents input voice pair sampling data, CNN represents the backbone network, and a frame-level feature vector output by the backbone network is represented as M ═ M1,m2,...,nT]Wherein m isi∈RnN is a frame-level feature dimension, and T represents the number of frames; and performing cyclic frame interception before inputting the frame level features output by the backbone network into the feature fusion network. Specifically, the frame-level feature vectors are connected end to obtain a feature sequence F as follows:
F=[f1,f2,...fN,f1,f2,,...fc-1]
grouping F by preset step length to obtain a grouping representation FG of the characteristic sequence:
FG=([f1,f2,...fc],[f2,f3,…fc+1],…,[fT,f1,f2,...fc-1])
where c represents the number of frames per feature set. And inputting the frame-level feature vectors grouped in the FG into a feature fusion network for feature fusion, thereby obtaining the fusion frame feature vectors of the voice segments of the macaque.
As shown in fig. 3, in order to perform more efficient feature Fusion, the present invention designs a feature Fusion network based on a Channel Fusion Mechanism (CFM). The feature fusion network comprises two branches, wherein the first branch is a transposition operation, and the second branch sequentially comprises 2 FCs connected in series, a maximum pooling (Maxpool) layer and an average pooling (Avgppool) layer which are connected in parallel, and 1 sigmoid layer;
transposing the grouped frame-level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame level feature vectors are processed by 2 FC of the second branch, the grouped frame level feature vectors respectively pass through a Maxpool layer and an Avgpol layer, matrix addition calculation is carried out on the outputs of the Maxpool layer and the Avgpol layer, namely corresponding bits are added, and then activation processing is carried out by a sigmoid layer to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristics and the second part of the fusion frame characteristics to obtain a fusion frame characteristic vector of the voice section.
The calculation process of the fusion frame feature vector can be represented as follows:
FGi'=g(f(FGi))=w2(w1(FGi)+b1)+b2
m=MaxP(FGi',c)
a=AvgP(FGi',c)
wi=σ(m+a)
fi *=tp(FGi)·wi
wherein f and g respectively represent mapping functions of two layers of full-connection structures, and parameters of two layers of FC are respectively omega1,b1And ω2,b2MaxP and AvgP represent maximum pooling and mean pooling, respectively, σ is a Sigmoid function, tp represents transposition, and · represents point multiplication. Final ith feature group FGiIs mapped as a feature fi *Denotes the i-th feature vector after fusion, where FGi∈Rc×n,fi *∈Rn
In the voiceprint verification task, the correlation between the feature representation of each frame (moment) of the time-frequency two-dimensional features and the final sentence-level features needs to be represented, so that the invention provides a channel weighting fusion mechanism, a new frame-level feature representation is obtained through the weighting fusion of a plurality of adjacent channel features, the interframe relevance degree of a feature graph is effectively enhanced, and the effective mapping from the frame-level features to the fusion frame features is realized.
As shown in fig. 3, the feature compression network includes a gated loop unit GRU and a fully connected FC layer; after being processed by the GRU unit and an FC layer in sequence, the fusion frame feature vector is mapped into a sentence-level feature vector e with the dimensionality d:
e=h(x)e∈Rd
len(x)=l
where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix. The feature compression network aims to map input two-dimensional features into one-dimensional features, and the process can be regarded as embedding of original voice segment data.
The network model designed by the invention inputs voice data with the size of m multiplied by n multiplied by l, wherein m represents the number of macaques selected in each group of data, n represents the number of voice segments selected from each macaque corpus, and l represents the frame number of each segment of voice data. After the processing of the backbone network, the feature fusion network and the feature compression network, the feature output of m multiplied by n multiplied by d can be obtained. Where d represents the sentence-level feature dimension, d may be set to 256. In addition, a leak ReLU activation function with a slope of-0.3 may be used in the network, the learning rate of the AMSGrad optimizer may be set to 0.001, the attenuation rate may be set to 0.0001, the number of macaques in each voice group may be set to m-8, and the number of voice segments per macaque may be set to n-10.
The Chinese gooseberry voiceprint verification method provided by the invention realizes the design of converting a closed data set multi-classification model to a sentence-level embedded characteristic representation model, and expands the limitation of the existing animal verification algorithm based on sound. The definition of the animal individual authentication task is converted into a feature representation by a multi-classification task. The voiceprint verification algorithm can meet the requirements of more application scenes, and the trained model can be used for verifying or identifying the unknown target.
Example 2
As shown in fig. 4, an embodiment 2 of the present invention provides a macaque voiceprint verification system, which is implemented based on the macaque voiceprint verification method, and in a training test stage, the system specifically includes:
the data processing module 410 is configured to select a voice pair from the preprocessed macaque corpus according to rules, construct a training set and a test set, and group the training set and the test set;
the backbone network 420 is used for extracting features of the input macaque speech segment to obtain a frame-level feature vector of the macaque speech segment;
the feature fusion network 430 is configured to perform feature fusion after performing cyclic frame interception on the frame-level feature vectors output by the backbone network, so as to obtain a fusion frame feature vector of a voice segment of the macaque;
a feature compression network 440, configured to perform feature compression on the sentence-level features output by the feature fusion network to obtain sentence-level feature vectors corresponding to the macaque speech segments
A network parameter updating module 450, configured to perform intra-class loss and inter-class loss calculation according to sentence-level feature vectors of the macaque speech segments by using cosine distances of the vectors, update parameters in the backbone network, the feature fusion network, and the feature compression network by using AMSGrad, and repeat iteration until the trained network has the highest accuracy on the test set, so as to obtain an optimal parameter combination of the network
And the verification module 460 is configured to verify whether any macaque voice pair belongs to the same individual according to the output of the optimal parameter network.
The macaque voiceprint verification system provided by the invention processes macaque voice into voice pairs, can automatically extract frame-level features of sampled data of the macaque voice pairs through a designed backbone network, can map the frame-level features into fusion frame features by utilizing the designed feature fusion network, and then compresses the fusion frame features by utilizing the feature compression network to obtain sentence-level features corresponding to macaque voice segments.
Optionally, the data processing module 410 further includes:
and the corpus dividing unit is used for randomly selecting T1 macaque corpora as a training corpus and the rest T-T1 macaque as a test corpus from T macaque sounds contained in the preprocessed macaque corpus.
The voice pair extraction unit is used for extracting a positive sample pair and a negative sample pair from the training corpus and the test corpus respectively to construct the training set and the test set; the positive sample pair randomly selects two sections of voices from the corpus of the same macaque, and the negative sample pair randomly selects one section of voice from two different macaque corpora respectively.
Backbone network 420 is designed based on CNN, and may employ a feature extraction network such as VGG or ResNet. The invention designs a new backbone network combining SincNet, ResNet and transforms to extract the frame level characteristics of the voice of the macaque, the specific structure of which is shown in FIG. 2 and Table 1, and the details are not repeated here. In addition, the specific structures of the feature fusion network 430 and the feature compression network 440 are shown in fig. 3, and specific reference may be made to the detailed description of the method portion, which is not repeated herein.
And after the training test is finished, obtaining a trained macaque voiceprint verification model, and entering a practical application stage. The system then comprises: a trained macaque voiceprint verification model, a data processing module 410 and a verification module 460; wherein,
the data processing module is used for preprocessing a voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;
the verification module is used for inputting the preprocessed macaque voice pair into a pre-trained macaque voiceprint verification model to obtain a conclusion whether the macaque voice pair to be verified belongs to the same individual macaque or not, and therefore voiceprint verification is achieved.
The trained macaque voiceprint verification model comprises a backbone network 420, a feature fusion network 430 and a feature compression network 440 which are connected in sequence.
It should be noted that, for the convenience of description, only some but not all of the related contents of the embodiments of the present invention are shown in the drawings. Some example embodiments are described as processes or methods depicted as flow diagrams, which describe operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously, and the order of the operations can be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. An end-to-end macaque voiceprint verification method based on cycle frame level feature fusion, the method comprising:
preprocessing a voice pair of a macaque to be verified; the macaque voice pair is two macaque voice sections;
inputting the preprocessed Chinese gooseberry voice pair into a pre-trained Chinese gooseberry voiceprint verification model to obtain a conclusion whether the Chinese gooseberry voice pair to be verified belongs to the same individual Chinese gooseberry or not, thereby realizing voiceprint verification;
the macaque voiceprint verification model comprises a backbone network, a feature fusion network and a feature compression network which are connected in sequence; the backbone network is used for extracting the features at the frame level; the feature fusion network is used for performing cyclic frame interception and grouping on the frame level feature vectors extracted by the features and mapping the frame level features into fusion frame features based on a channel weighting fusion mechanism; and the feature compression network is used for compressing the fusion frame features to obtain sentence-level features corresponding to the Chinese gooseberry voice segments.
2. The end-to-end macaque voiceprint validation method based on loop frame-level feature fusion as claimed in claim 1, wherein the macaque speech pair to be validated is preprocessed; the method specifically comprises the following steps:
cutting off the mute section in the two sections of voice sections to be verified to obtain the preprocessed voice format data.
3. The cyclic frame-level feature fusion-based end-to-end macaque voiceprint verification method according to claim 2, wherein the input of the backbone network is preprocessed voice format data, and the output is a frame-level feature vector; the backbone network comprises the following components which are connected in sequence: 1 learnable band-pass filter convolutional layer, 6 one-dimensional residual convolutional blocks, 1 × 1 channel conversion convolutional layer and 2 multi-head conversion blocks; the specific treatment process comprises the following steps:
the learnable band-pass filter convolution layer converts time domain information of voice format data into frequency domain information, and combines 6 one-dimensional residual convolution blocks and pooling operation thereof, feature dimensions are reduced while extracting features, 1 × 1 channel conversion convolution layer performs channel conversion on output features of the residual convolution blocks, and then multi-head conversion is performed to output frame level feature vectors.
4. The method for verifying the voice print of the macaque based on the cyclic frame level feature fusion of the claim 3, wherein the input of the feature fusion network is a frame level feature vector, and the output of the feature fusion network is a fusion frame feature vector, the feature fusion network comprises a cyclic frame interception grouping unit and a channel group, the cyclic frame interception grouping unit connects the frame level feature vectors end to obtain a feature sequence F, and the F is grouped according to a preset step length to obtain a grouping representation FG of the feature sequence; the channel group is used for mapping FG (FG) into fusion frame characteristics based on a channel weighting fusion mechanism; the channel group comprises a plurality of CFM layers, and each CFM layer comprises a first branch and a second branch which are connected in parallel; the specific processing procedure of the CFM layer is as follows:
transposing the grouped frame-level feature vectors through a first branch to obtain a first part of the fusion frame features; after the grouped frame-level feature vectors are processed by 2 FC of a second branch, the grouped frame-level feature vectors respectively pass through 1 maximum pooling layer and 1 average pooling layer, the output of the maximum pooling layer and the output of the average pooling layer are subjected to matrix addition calculation, and then a sigmoid layer is used for activation processing to obtain a second part of the fusion frame feature; and performing dot product calculation on the first part of the fusion frame characteristics and the second part of the fusion frame characteristics to obtain a fusion frame characteristic vector.
5. The end-to-end macaque voiceprint verification method based on cycle frame level feature fusion according to claim 4, wherein the input of the feature compression network is a fusion frame feature vector, the output is a sentence level feature vector with dimension d, and the feature compression network comprises 1 gate control cycle unit and 1 full connection layer which are connected in sequence;
the output of the feature compression network is a sentence-level feature vector e with the dimension d:
e=h(x)e∈Rd
len(x)=l
where R represents a real number, e includes d real numbers, x represents a feature vector matrix at a frame level, l is the length of x, h () represents an embedded mapping function, and len () represents the number of frames in the feature vector matrix.
6. The end-to-end macaque voiceprint validation method based on cycle frame level feature fusion as claimed in claim 5, wherein the method further comprises a training step and a testing step of a macaque voiceprint validation model; the method specifically comprises the following steps:
step 1) preprocessing a voice section of a macaque corpus, wherein the preprocessed macaque corpus comprises a plurality of corpora of T macaques;
step 2) randomly selecting T1 voice segments of the macaques from the preprocessed macaque corpus as a training corpus, and using the rest T-T1 voice segments of the macaques as a test corpus;
step 3) selecting data from the training corpus to establish a training set; randomly dividing training set data into q groups, wherein each group comprises m macaques, and each macaque has n voice sections;
step 4) extracting equal number of positive sample voice pairs and negative sample voice pairs from a test corpus to form a test set, wherein the positive sample voice pairs are two different voices of the same macaque, and the negative sample voice pairs are respective voices of the two macaques;
step 5) sequentially inputting q groups of data in a training set into a Chinese goosebeery verification model, setting the learning rate to be 0.001 and the attenuation rate to be 0.0001, using a Leaky ReLU activation function with the slope of-0.3, training by adopting AMSGrad, calculating a loss function, reversely transmitting a loss value through a reverse transmission algorithm, updating network parameters, and completing a training period after all the q groups of data are input once;
step 6) sequentially inputting the test set data into the Chinese gooseberry voiceprint verification model obtained in the current training period, and calculating to obtain an accuracy result of the current training period;
step 7) repeating the step 5) and the step 6) until P training periods are finished; and selecting the network parameter combination corresponding to the maximum value from the P accuracy results to form the optimal parameter combination of the Chinese gooseberry voiceprint verification model, thereby obtaining the trained Chinese gooseberry voiceprint verification model.
7. The end-to-end macaque voiceprint verification method based on cycle frame level feature fusion as claimed in claim 6, wherein the specific process of calculating the loss function is as follows:
according to the sentence-level characteristics of each group of Chinese macaque speech segments, calculating the cosine distance dist (A, A') as follows:
Figure FDA0002990277950000031
wherein, A represents a sentence-level feature, A' represents another sentence-level feature, | | | | | represents a second-order norm;
from the cosine distance dist (A, A'), the intra-class loss and inter-class loss are calculated as follows:
Figure FDA0002990277950000032
wherein j represents a voice of the ith macaque, k represents another voice segment, j-k represents that two voice segments belong to the same macaque, and the calculated loss value Sji,kIs an intra-class loss; j ≠ k denotes that two voice segments belong to different macaques, and the calculated loss value Sji,kIs an inter-class loss; w is a weighted value, b is an offset;
calculating Loss function Loss of the Chinese gooseberry voiceprint verification model by the following formulajiComprises the following steps:
Figure FDA0002990277950000033
8. an end-to-end macaque voiceprint validation system based on cyclic frame-level feature fusion, the system comprising: the trained kiwi voiceprint verification model, the data processing module and the verification module; wherein,
the data processing module is used for preprocessing a voice pair of the macaque to be verified; the macaque voice pair is two macaque voice sections;
the verification module is used for inputting the preprocessed macaque voice pair into a pre-trained macaque voiceprint verification model to obtain a conclusion whether the macaque voice pair to be verified belongs to the same individual macaque or not, and therefore voiceprint verification is achieved.
CN202110313689.3A 2021-03-24 2021-03-24 End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion Active CN113129908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110313689.3A CN113129908B (en) 2021-03-24 2021-03-24 End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110313689.3A CN113129908B (en) 2021-03-24 2021-03-24 End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion

Publications (2)

Publication Number Publication Date
CN113129908A true CN113129908A (en) 2021-07-16
CN113129908B CN113129908B (en) 2022-07-26

Family

ID=76774077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110313689.3A Active CN113129908B (en) 2021-03-24 2021-03-24 End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion

Country Status (1)

Country Link
CN (1) CN113129908B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763966A (en) * 2021-09-09 2021-12-07 武汉理工大学 End-to-end text-independent voiceprint recognition method and system
CN115050373A (en) * 2022-04-29 2022-09-13 思必驰科技股份有限公司 Dual path embedded learning method, electronic device, and storage medium
CN116055394A (en) * 2022-12-30 2023-05-02 天翼云科技有限公司 Edge routing arrangement system based on vectorized backbone network
CN116386647A (en) * 2023-05-26 2023-07-04 北京瑞莱智慧科技有限公司 Audio verification method, related device, storage medium and program product

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050238207A1 (en) * 2004-04-23 2005-10-27 Clifford Tavares Biometric verification system and method utilizing a data classifier and fusion model
CN108417217A (en) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 Speaker Identification network model training method, method for distinguishing speek person and system
US20180277122A1 (en) * 2015-12-30 2018-09-27 Baidu Online Network Technology (Beijing) Co., Ltd. Artificial intelligence-based method and device for voiceprint authentication
CN110211595A (en) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 A kind of speaker clustering system based on deep learning
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
CN111243602A (en) * 2020-01-06 2020-06-05 天津大学 Voiceprint recognition method based on gender, nationality and emotional information
CN111524525A (en) * 2020-04-28 2020-08-11 平安科技(深圳)有限公司 Original voice voiceprint recognition method, device, equipment and storage medium
CN112331216A (en) * 2020-10-29 2021-02-05 同济大学 Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050238207A1 (en) * 2004-04-23 2005-10-27 Clifford Tavares Biometric verification system and method utilizing a data classifier and fusion model
US20180277122A1 (en) * 2015-12-30 2018-09-27 Baidu Online Network Technology (Beijing) Co., Ltd. Artificial intelligence-based method and device for voiceprint authentication
CN108417217A (en) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 Speaker Identification network model training method, method for distinguishing speek person and system
CN110211595A (en) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 A kind of speaker clustering system based on deep learning
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
CN111243602A (en) * 2020-01-06 2020-06-05 天津大学 Voiceprint recognition method based on gender, nationality and emotional information
CN111524525A (en) * 2020-04-28 2020-08-11 平安科技(深圳)有限公司 Original voice voiceprint recognition method, device, equipment and storage medium
CN112331216A (en) * 2020-10-29 2021-02-05 同济大学 Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QIHANG XU: "Speaker Recognition Based on Long Short-Term Memory Networks", 《2020 IEEE 5TH INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP)》 *
祁晓波: "基于机器学习的声纹识别研发", 《中国优秀硕士学位论文全文数据库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763966A (en) * 2021-09-09 2021-12-07 武汉理工大学 End-to-end text-independent voiceprint recognition method and system
CN113763966B (en) * 2021-09-09 2024-03-19 武汉理工大学 End-to-end text irrelevant voiceprint recognition method and system
CN115050373A (en) * 2022-04-29 2022-09-13 思必驰科技股份有限公司 Dual path embedded learning method, electronic device, and storage medium
CN116055394A (en) * 2022-12-30 2023-05-02 天翼云科技有限公司 Edge routing arrangement system based on vectorized backbone network
CN116386647A (en) * 2023-05-26 2023-07-04 北京瑞莱智慧科技有限公司 Audio verification method, related device, storage medium and program product
CN116386647B (en) * 2023-05-26 2023-08-22 北京瑞莱智慧科技有限公司 Audio verification method, related device, storage medium and program product

Also Published As

Publication number Publication date
CN113129908B (en) 2022-07-26

Similar Documents

Publication Publication Date Title
CN113129908B (en) End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN109410917B (en) Voice data classification method based on improved capsule network
CN113488058B (en) Voiceprint recognition method based on short voice
CN109065027B (en) Voice distinguishing model training method and device, computer equipment and storage medium
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN111429948B (en) Voice emotion recognition model and method based on attention convolution neural network
CN110852215B (en) Multi-mode emotion recognition method and system and storage medium
CN112908341B (en) Language learner voiceprint recognition method based on multitask self-attention mechanism
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
WO2021051628A1 (en) Method, apparatus and device for constructing speech recognition model, and storage medium
CN112669820B (en) Examination cheating recognition method and device based on voice recognition and computer equipment
CN111161715A (en) Specific sound event retrieval and positioning method based on sequence classification
CN111627419A (en) Sound generation method based on underwater target and environmental information characteristics
CN112183107A (en) Audio processing method and device
Golovko et al. A new technique for restricted Boltzmann machine learning
CN110349588A (en) A kind of LSTM network method for recognizing sound-groove of word-based insertion
CN116153337B (en) Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium
CN112035700B (en) Voice deep hash learning method and system based on CNN
CN105741853A (en) Digital speech perception hash method based on formant frequency
CN116052725B (en) Fine granularity borborygmus recognition method and device based on deep neural network
CN114898775B (en) Voice emotion recognition method and system based on cross-layer cross fusion
CN113488069B (en) Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network
Jati et al. An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks.
CN114937454A (en) Method, device and storage medium for preventing voice synthesis attack by voiceprint recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant