CN115083433A - DNN-based text irrelevant representation tone clustering method - Google Patents

DNN-based text irrelevant representation tone clustering method Download PDF

Info

Publication number
CN115083433A
CN115083433A CN202210634114.6A CN202210634114A CN115083433A CN 115083433 A CN115083433 A CN 115083433A CN 202210634114 A CN202210634114 A CN 202210634114A CN 115083433 A CN115083433 A CN 115083433A
Authority
CN
China
Prior art keywords
clustering
dnn
training
embedding
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210634114.6A
Other languages
Chinese (zh)
Inventor
蒋竺芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202210634114.6A priority Critical patent/CN115083433A/en
Publication of CN115083433A publication Critical patent/CN115083433A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Abstract

The invention discloses a text irrelevant representation tone clustering method based on DNN, which comprises the following steps: step S10: acquiring a large amount of speaker voice data as a training data set; step S20: constructing an acoustic recognition model; step S30: training an acoustic model; step S40: extracting a tone embedding feature vector; step S50: and (4) applying spectral clustering on the embedding feature vectors to perform timbre clustering. The invention constructs a multi-classification cross entropy target function, is characterized by tone driven by data, and completes the clustering of tone through a spectral clustering algorithm more suitable for high-dimensional clustering, thereby improving the matching rate of the voice of a specific person and further improving the accuracy of the voice recognition of the specific person.

Description

DNN-based text irrelevant representation tone clustering method
Technical Field
The invention relates to the technical field of voice signal processing, in particular to a text irrelevant representation tone clustering method based on DNN.
Background
Tone clustering is a process of dividing the tone in an input audio set into homogeneous segments according to the tone of multiple speakers, and further distinguishing the homogeneous segments from other tones. It attempts to solve the problem of "what characteristics a person's voice has", and has a wide range of applications including multimedia information retrieval, speaker analysis, and audio processing. At present, data sets with audio tone category labels are rarely available, so unsupervised clustering algorithms are mostly adopted for tone identification. A typical timbre clustering system generally consists of two parts: (1) audio feature extraction, in which specific features, such as MFCC, LPC, or LPCC, are extracted from the original audio; (2) and clustering, namely determining the number of tone categories and clustering the extracted audio features into the tone categories.
In previous research, MFCC is widely used as a speech feature, which includes both speaker information and channel information, and has a better performance for speech recognition and other problems sensitive to speech content, but for timbre clustering, it is more concerned about speaker identity characteristics, and it is desirable to reduce interference with speaker information in features, so timbre clustering requires a text-independent feature representation.
Clustering (Clustering): similar data are divided together, the specific meaning of the label is not concerned when the data are divided, the similar data are aggregated together, and clustering is an Unsupervised Learning (Unsupervised Learning) method. The most frequently used clustering algorithm in timbre clustering is K-means, and the problem of directly using the K-means to classify timbres is as follows: speech data is typically non-gaussian, in which case the center of the K-Means cluster is not sufficient to represent a class; furthermore, the speech characteristics are influenced by a number of factors such as gender, age, accent, etc., and this structure results in direct clustering being poor performance, e.g., the difference between male and female speakers is much larger than the difference between two female speakers, which in practice often results in K-Means falsely clustering all corresponding male speakers into one cluster and all corresponding female speakers into another cluster. The presetting of clustering categories is difficult to unify, meanwhile, the initial centroid selection of the K-means is generally selected randomly, when the initial centroid points are different in value, the final clustering effect is different, and the wrong classification caused by unreasonable initial centroid point selection is the most common problem.
Disclosure of Invention
The invention aims to provide a text irrelevant representation tone clustering method based on DNN (deep neural network) so as to overcome the defects in the prior art.
In order to achieve the purpose, the invention provides the following technical scheme:
a DNN-based text-independent characterization tone clustering method comprises the following steps:
step S10: acquiring a large amount of speaker voice data as a training data set;
step S20: constructing an acoustic recognition model, wherein the overall structure of the model is a feedforward DNN, calculating embedding characteristics from variable-length voice data, and dividing the calculation into acoustic characteristic extraction and DNN acoustic model construction;
step S30: training an acoustic model, and constructing a multi-classification cross entropy objective function as follows:
Figure BDA0003679842320000021
training the network through a multi-classification cross entropy objective function to classify the training speakers, wherein the training optimization process comprises the following steps: assuming that each person has a tone of one class, K tone classes in N training segments, and K speakers are K classes of tones, P (spkrk | x (N)1: T) is defined as the probability that the tone class is K given T input frames x (N)1, x (N)2,. x (N) T, if the speaker label of a speech segment N is K, dnk is 1, otherwise 0;
step S40: extracting timbre embedding feature vectors, training a network target to generate embedding feature vectors containing more speaker timbre information, generalizing the embedding features to speakers (timbres) which do not appear in training data, wherein the embedding features are used for capturing timbre features of the whole voice;
step S50: and (3) applying spectral clustering on the embedding characteristic vector to perform tone clustering, and regarding a similarity matrix S, regarding Sij as the weight of an edge between nodes i and j in an undirected graph. Spectral clustering divides an original graph into subgraphs by removing weak edges with smaller weights.
Preferably, the step S20 of extracting the acoustic features includes the following steps:
s201: receiving an unknown voice signal, and performing pre-emphasis, framing and windowing processing on the unknown voice signal;
s202: performing audio framing, wherein each 25ms frame is moved by 10 ms;
s203: carrying out FFT (fast Fourier transform) on the preprocessed unknown voice signal;
s204: the unknown voice signal after FFT is passed through a Mel filter bank to obtain Mel frequency spectrum;
s205: carrying out logarithmic energy processing on the Mel frequency spectrum to obtain a logarithmic frequency spectrum;
s206: performing DCT transformation on the logarithmic spectrum, extracting first 13-dimensional difference values and first-order and second-order difference values of the logarithmic spectrum, and combining the difference values with frame energy to form 40-dimensional features serving as feature parameters of a DNN model;
preferably, in the step S20, a DNN acoustic model is constructed, where the model overall architecture includes a plurality of frame-level TDNN layers, a statistical pooling layer representing aggregation at a segment level, two sentence-level fully-connected layers, and a last softmax output layer; the first 5 layers of the network are constructed at the frame level, and a time delay neural network is adopted, and the method comprises the following steps:
s211: a designated person reads a section of Chinese text within time t, and takes the section of voice as a voice signal to preprocess the voice signal;
s212: assuming that t is the current time step, firstly splicing the frames at (t-2, t-1, t, t +1, t +2) together as network input;
s213: respectively splicing the outputs of the previous layer at (t-2, t, t +2) and (t-3, t, t +3) together through the next two layers; the fourth and fifth layers are also run at the frame level, but without adding any additional temporal context information.
Preferably, the statistical pooling layer receives the final output of the frame level layer as input, aggregates the final output on input segments, calculates the mean and standard deviation of the TDNN network output, connects the statistical information of the segment levels together, and transmits the statistical information to the full connection layer of two sentence levels.
Preferably, the acoustic model optimizes the data set by removing any audio data less than 4 seconds in step S30, and the training data is hundreds of hours of speech data for a thousand-class speaker.
Preferably, the spectral clustering in step S50 includes the following steps:
s501: constructing a similarity matrix S according to the embedding feature vectors, wherein Sij is the cosine similarity of the embedding vectors of the ith and jth voice segments, and setting a diagonal element i as j to be 0;
s502: the laplacian matrix L is calculated and normalization is performed:
L=D-S
L norm =D -1 L
wherein D is a diagonal matrix, and
Figure BDA0003679842320000041
s503: calculating a characteristic value and a characteristic vector of Lnorm;
s504: calculating the cluster number k, wherein the cluster number in an attribute representation diagram of Lnorm is equal to the algebraic multiplicity of 0 characteristic value, a threshold value beta is set, and the number of the characteristic values lower than the beta is counted as k;
s505: k minimum eigenvalues λ 1, λ 2,. λ k and eigenvectors P1, P2,. pk corresponding to Lnorm are taken to construct a matrix P e Rn × k using P1, P2,. pk as columns.
S506: the row vectors y1, y 2.. yn for P are clustered using the k-means algorithm.
Compared with the prior art, the invention has the beneficial effects that: under the condition of no manual input, the invention completes the clustering of the tone colors by constructing a multi-classification cross entropy objective function, representing the tone colors driven by data and a spectral clustering algorithm more suitable for high-dimensional clustering so as to improve the matching rate of the specific human voice and improve the accuracy of the specific human voice recognition.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic structural diagram of a DNN acoustic model of the present invention;
FIG. 3 is a flow chart of the speech feature extraction of the speech recognition technique of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Referring to fig. 1 to fig. 3, a method for clustering text-independent characterization timbres based on DNN according to an embodiment of the present invention includes the following steps:
step S10: acquiring a large amount of speaker voice data as a training data set;
step S20: constructing an acoustic recognition model, wherein the overall structure of the model is a feedforward DNN, calculating embedding characteristics from variable-length voice data, and dividing the calculation into acoustic characteristic extraction and DNN acoustic model construction;
step S30: training an acoustic model, and constructing a multi-classification cross entropy objective function as follows:
Figure BDA0003679842320000051
training the network through a multi-classification cross entropy objective function to classify the training speakers, wherein the training optimization process comprises the following steps: assuming that each person has a class of timbre, K classes of timbres in N training segments, and K speakers are K classes of timbres, P (spkrk | x (N)1: T) is defined as the probability that the timbre class is K given T input frames x (N)1, x (N)2,. x (N) T, dnk is 1 if the speaker label of the speech segment N is K, otherwise 0; the acoustic model optimizes the data set by deleting any audio data less than 4 seconds, the training data being hundreds of hours of speech data for a thousand-level speaker;
step S40: extracting timbre embedding feature vectors, training a network target to generate embedding feature vectors containing more speaker timbre information, generalizing the embedding features to speakers (timbres) which do not appear in training data, and capturing timbre features of the whole voice through the embedding features instead of at a frame level; therefore, any fully connected layer after the statistical pool layer is a reasonable position from which to extract the imbedding feature, that is, the position of imbedding a or imbedding b in fig. 1; where embedding a is the fully-connected layer output directly above the statistical pooling layer and embedding b is extracted from the fully-connected layer after the activation function ReLU, so it is the non-linear output of statistical pooling.
Step S50: and (3) applying spectral clustering on the embedding characteristic vector to perform tone clustering, and regarding a similarity matrix S, regarding Sij as the weight of an edge between nodes i and j in an undirected graph. Spectral clustering divides an original graph into subgraphs by removing weak edges with smaller weights.
Specifically, the step S20 of extracting the acoustic features includes the following steps:
s201: receiving an unknown voice signal, and performing pre-emphasis, framing and windowing processing on the unknown voice signal; the pre-emphasis aims to promote a high-frequency part, emphasize the high-frequency part of the voice, remove the influence of lip radiation, increase the high-frequency resolution of the voice to enable the frequency spectrum of the signal to be flat, keep the frequency spectrum in the whole frequency band from low frequency to high frequency and solve the frequency spectrum by using the same signal-to-noise ratio;
s202: performing audio framing, wherein each frame is moved by 10ms every 25 ms; each smooth frame of the input voice can be obtained through frame division and windowing;
s203: carrying out FFT (fast Fourier transform) on the preprocessed unknown voice signal, namely carrying out FFT on each smooth frame of the voice;
s204: passing the unknown voice signal after FFT through a Mel filter bank to obtain Mel frequency spectrum S (m);
s205: the Mel frequency spectrum S (m) is processed by logarithmic energy to obtain a logarithmic frequency spectrum L (m), namely
Figure BDA0003679842320000071
S206: and (3) performing DCT (discrete cosine transform) on the logarithmic spectrum, extracting the first 13-dimensional difference values and the first-order difference values and the second-order difference values of the logarithmic spectrum, and combining the obtained values and the frame energy to form 40-dimensional features serving as the characteristic parameters of the DNN model.
Specifically, in step S20, a DNN acoustic model is constructed, where the model overall architecture includes a plurality of frame-level TDNN layers, a statistical pooling layer representing aggregation at a segment level, two sentence-level fully-connected layers, and a last softmax output layer; the first 5 layers of the network are constructed at the frame level, and a time delay neural network is adopted, and the method comprises the following steps:
s211: a designated person reads a section of Chinese text within time t, and takes the section of voice as a voice signal to preprocess the voice signal;
s212: assuming that t is the current time step, firstly splicing the frames at (t-2, t-1, t, t +1, t +2) together as network input;
s213: respectively splicing the outputs of the previous layer at (t-2, t, t +2) and (t-3, t, t +3) together through the next two layers; the fourth and fifth layers are also run at the frame level, but without adding any additional temporal context information.
Structurally, each layer of the TDNN is still CNN, and only the input of each layer is spliced by historical, current and future features, so that time sequence information is introduced; the advantage of the TDNN architecture is that it can parallelize training relative to LSTM, in turn adding timing context information relative to CNN.
Specifically, the statistical pooling layer receives the final output of the frame level layer as input, aggregates the final output on an input segment, calculates the mean value and standard deviation of the TDNN network output, connects the statistical information of the segment levels together, and transmits the statistical information to the full connection layers of the two sentence levels; thus, sentence-level feature expression is obtained, and the dimension of the full-connected layer is 512 (any one of the full-connected layer can be used for extracting the embedding feature vector). And finally, a softmax output layer is arranged on the acoustic model, and the node number of softmax is the number of speakers in the training set.
Specifically, the spectral clustering in step S50 includes the following steps:
s501: constructing a similarity matrix S according to the embedding feature vectors, wherein Sij is the cosine similarity of the embedding vectors of the ith and jth voice segments, and setting a diagonal element i as j to be 0;
s502: the laplacian matrix L is calculated and normalization is performed:
L=D-S
L norm =D -1 L
wherein D is a diagonal matrix, and
Figure BDA0003679842320000081
s503: calculating a characteristic value and a characteristic vector of Lnorm;
s504: calculating the cluster number k, wherein the cluster number in an attribute representation diagram of Lnorm is equal to the algebraic multiplicity of 0 characteristic value, a threshold value beta is set, and the number of the characteristic values lower than the beta is counted as k;
s505: k minimum eigenvalues λ 1, λ 2,. λ k and eigenvectors P1, P2,. pk corresponding to Lnorm are taken to construct a matrix P e Rn × k using P1, P2,. pk as columns.
S506: the row vectors y1, y 2.. yn for P are clustered using the k-means algorithm.
In conclusion, under the condition that no manual input is needed, the multi-classification cross entropy objective function is constructed, deep learning is applied to the feature extraction module, the tone characterization embedding is learned through the deep neural network, tone information is expected to be mapped to a vector space, in the space, the embedding distance obtained by voice mapping of similar tones is closer, the embedding distance obtained by voice mapping of different tones is as far as possible, the tone characterization driven by data is completed through spectral clustering more suitable for high-dimensional data feature clustering on the embedding feature vector, so that the matching rate of sounds of a specific person is improved, and the accuracy of voice recognition of the specific person is improved.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (6)

1. A DNN-based text-independent characterization tone clustering method is characterized by comprising the following steps:
step S10: acquiring a large amount of speaker voice data as a training data set;
step S20: constructing an acoustic recognition model, wherein the overall structure of the model is a feedforward DNN, calculating embedding characteristics from variable-length voice data, and dividing the calculation into acoustic characteristic extraction and DNN acoustic model construction;
step S30: training an acoustic model, and constructing a multi-classification cross entropy objective function as follows:
Figure FDA0003679842310000011
training the network through a multi-classification cross entropy objective function to classify the training speakers, wherein the training optimization process comprises the following steps: assuming that each person has a class of timbre, K classes of timbres in N training segments, and K speakers are K classes of timbres, P (spkrk | x (N)1: T) is defined as the probability that the timbre class is K given T input frames x (N)1, x (N)2,. x (N) T, dnk is 1 if the speaker label of the speech segment N is K, otherwise 0;
step S40: extracting timbre embedding feature vectors, training a network target to generate embedding feature vectors containing more speaker timbre information, generalizing the embedding features to speakers (timbres) which do not appear in training data, wherein the embedding features are used for capturing timbre features of the whole voice;
step S50: and (3) applying spectral clustering to the embedding characteristic vector to perform tone clustering, regarding a similarity matrix S based on a graph clustering algorithm, regarding Sij as the weight of an edge between nodes i and j in an undirected graph, and dividing the original graph into sub-graphs by spectral clustering by removing weak edges with smaller weights.
2. The method for clustering text-independent characterization timbres based on DNN (named DNN) according to claim 1, wherein the step of extracting acoustic features in step S20 comprises the following steps:
s201: receiving an unknown voice signal, and performing pre-emphasis, framing and windowing processing on the unknown voice signal;
s202: performing audio framing, wherein each frame is moved by 10ms every 25 ms;
s203: carrying out FFT (fast Fourier transform) on the preprocessed unknown voice signal;
s204: the unknown voice signal after FFT is passed through a Mel filter bank to obtain Mel frequency spectrum;
s205: carrying out logarithmic energy processing on the Mel frequency spectrum to obtain a logarithmic frequency spectrum;
s206: and (3) performing DCT (discrete cosine transformation) on the logarithmic spectrum, extracting the first 13-dimensional difference value and the first-order difference value and the second-order difference value of the logarithmic spectrum, and combining the obtained values with frame energy to form 40-dimensional features serving as the feature parameters of the DNN model.
3. The method for clustering text-independent characterization timbres based on DNN of claim 1, wherein the DNN acoustic model is constructed in step S20, and the model overall structure comprises a plurality of frame-level TDNN layers, a statistical pooling layer representing aggregation at segment level, two sentence-level fully-connected layers, and a last softmax output layer; the first 5 layers of the network are constructed at the frame level, and a time delay neural network is adopted, and the method comprises the following steps:
s211: a designated person reads a section of Chinese text within time t, and takes the section of voice as a voice signal to preprocess the voice signal;
s212: assuming that t is the current time step, firstly splicing the frames at (t-2, t-1, t, t +1, t +2) together as network input;
s213: respectively splicing the outputs of the previous layer at (t-2, t, t +2) and (t-3, t, t +3) together through the next two layers; the fourth and fifth layers are also run at the frame level, but without adding any additional temporal context information.
4. The DNN-based text-independent characterization timbre clustering method of claim 3 wherein the statistical pooling layer receives the final output of the frame level layers as input, aggregates over the input segments, calculates the mean and standard deviation of the TDNN network outputs, concatenates the segment-level statistics, and passes them to the fully-concatenated layer of the two sentence levels.
5. The method of claim 1, wherein the acoustic model optimizes the data set by removing any audio data less than 4 seconds, and the training data is hundreds of hours of speech data of a thousand-level speaker in step S30.
6. The DNN-based text-independent characterization timbre clustering method according to claim 1, wherein the spectral clustering in the step S50 comprises the following steps:
s501: constructing a similarity matrix S according to the embedding feature vectors, wherein Sij is the cosine similarity of the embedding vectors of the ith and jth voice segments, and setting a diagonal element i as j to be 0;
s502: the laplacian matrix L is calculated and normalization is performed:
L=D-S
L norm =D -1 L
wherein D is a diagonal matrix, and
Figure FDA0003679842310000031
s503: calculating a characteristic value and a characteristic vector of Lnorm;
s504: calculating the cluster number k, wherein the cluster number in an attribute representation diagram of Lnorm is equal to the algebraic multiplicity of 0 characteristic value, a threshold value beta is set, and the number of the characteristic values lower than the beta is counted as k;
s505: taking k minimum eigenvalues λ 1, λ 2,. λ k and the eigenvectors P1, P2,. pk corresponding to Lnorm to construct a matrix P e Rn × k using P1, P2,. pk as columns,
s506: the row vectors y1, y 2.. yn of P are clustered using the k-means algorithm.
CN202210634114.6A 2022-06-06 2022-06-06 DNN-based text irrelevant representation tone clustering method Withdrawn CN115083433A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210634114.6A CN115083433A (en) 2022-06-06 2022-06-06 DNN-based text irrelevant representation tone clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210634114.6A CN115083433A (en) 2022-06-06 2022-06-06 DNN-based text irrelevant representation tone clustering method

Publications (1)

Publication Number Publication Date
CN115083433A true CN115083433A (en) 2022-09-20

Family

ID=83251980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210634114.6A Withdrawn CN115083433A (en) 2022-06-06 2022-06-06 DNN-based text irrelevant representation tone clustering method

Country Status (1)

Country Link
CN (1) CN115083433A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727333A (en) * 2024-02-18 2024-03-19 百鸟数据科技(北京)有限责任公司 Biological diversity monitoring method and system based on acoustic recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727333A (en) * 2024-02-18 2024-03-19 百鸟数据科技(北京)有限责任公司 Biological diversity monitoring method and system based on acoustic recognition
CN117727333B (en) * 2024-02-18 2024-04-23 百鸟数据科技(北京)有限责任公司 Biological diversity monitoring method and system based on acoustic recognition

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
Xie et al. Utterance-level aggregation for speaker recognition in the wild
US7245767B2 (en) Method and apparatus for object identification, classification or verification
CN111292764A (en) Identification system and identification method
US20040260550A1 (en) Audio processing system and method for classifying speakers in audio data
CN108694949B (en) Speaker identification method and device based on reordering supervectors and residual error network
CN105096955B (en) A kind of speaker's method for quickly identifying and system based on model growth cluster
CN111161744B (en) Speaker clustering method for simultaneously optimizing deep characterization learning and speaker identification estimation
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN111653267A (en) Rapid language identification method based on time delay neural network
CN113488060B (en) Voiceprint recognition method and system based on variation information bottleneck
CN111508505A (en) Speaker identification method, device, equipment and storage medium
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN115083433A (en) DNN-based text irrelevant representation tone clustering method
CN112863521B (en) Speaker identification method based on mutual information estimation
US20080019595A1 (en) System And Method For Identifying Patterns
CN111932056A (en) Customer service quality scoring method and device, computer equipment and storage medium
CN116842460A (en) Cough-related disease identification method and system based on attention mechanism and residual neural network
CN111462762A (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN114970695B (en) Speaker segmentation clustering method based on non-parametric Bayesian model
CN115101077A (en) Voiceprint detection model training method and voiceprint recognition method
CN115064175A (en) Speaker recognition method
CN114997266A (en) Feature migration learning method and system for speech recognition
CN114333840A (en) Voice identification method and related device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20220920