CN114360551A - Gender and language-based speaker identification method and system - Google Patents

Gender and language-based speaker identification method and system Download PDF

Info

Publication number
CN114360551A
CN114360551A CN202210014706.8A CN202210014706A CN114360551A CN 114360551 A CN114360551 A CN 114360551A CN 202210014706 A CN202210014706 A CN 202210014706A CN 114360551 A CN114360551 A CN 114360551A
Authority
CN
China
Prior art keywords
speaker
recognition
voice
layer
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210014706.8A
Other languages
Chinese (zh)
Inventor
徐文渊
冀晓宇
程雨诗
高逸卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210014706.8A priority Critical patent/CN114360551A/en
Publication of CN114360551A publication Critical patent/CN114360551A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a method and a system for identifying a speaker based on gender and language, and belongs to the field of speaker identification. The method comprises the following steps: acquiring voice data to be recognized, specifically an audio file containing effective speaker audio; carrying out noise reduction processing on the audio file to obtain low-noise voice audio; extracting the voice audio subjected to noise reduction through SMAC (simple message access control) characteristics to obtain a voice spectrum characteristic diagram; inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector; inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker; and performing weighted fusion on the three recognition task results to obtain a speaker recognition result corresponding to the voice data to be recognized. The invention comprehensively utilizes the gender information and the language information in the voice, effectively improves the robustness of the recognition and identification of the speaker, and particularly has high identification precision under the condition that the voice of the speaker changes.

Description

Gender and language-based speaker identification method and system
Technical Field
The invention relates to the field of speaker identification, in particular to a speaker identification method and system based on gender and language.
Background
With the continuous development of artificial intelligence, more and more intelligent identification technologies are applied in life, including face recognition, fingerprint recognition and voiceprint recognition emerging in recent years. Voiceprint recognition, also known as speaker recognition, identifies to which speaker the audio belongs by analyzing a piece of audio content. Speakers can be used for identity authentication, which is a widespread concern because of its convenience.
In the prior art, most of speaker identification methods concern a single factor, namely identification of a speaker, and the method requires that the speaker needs to keep similar speaking modes in two stages of voiceprint registration and voiceprint identification, so that identification accuracy is reduced when the speaker uses different tones.
Disclosure of Invention
The invention provides a gender and language based speech recognition method and system, which combines gender information and language information contained in speech content to recognize a speaker, and solves the technical problem that the accuracy of a single factor recognition method is reduced under the condition of changing language tones.
In order to achieve the purpose, the invention adopts the following technical scheme:
a first object of the present invention is to provide a gender and language based speaker recognition method, the method comprising:
acquiring voice data to be recognized, wherein the voice data is an audio file in wav format containing effective speaker audio;
carrying out noise reduction processing on the audio file to obtain low-noise voice audio;
extracting voice audio frequency through SMAC characteristics to obtain a voice frequency spectrum characteristic diagram;
inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector;
inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker;
and performing weighted fusion on the identified speaker identity, the speaker gender and the language information used by the speaker to obtain a speaker identification result corresponding to the voice data to be identified.
Further, the multi-target learning model comprises three identification tasks: the system comprises a speaker identity recognition layer, a speaker identification layer and a language information recognition layer, wherein the speaker identity recognition layer, the speaker identification layer and the language information recognition layer are used by the speaker and comprise an N-layer sharing layer, three hidden layers and a fusion layer;
the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task;
the fusion layer is used for fusing the output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting results of the three recognition tasks as a final recognition result.
The second objective of the present invention is to provide a speaker recognition system based on gender and language, which is used for implementing the speaker recognition method; the system comprises:
the voice acquisition module is used for acquiring voice audio data of a speaker;
the audio filtering module is used for filtering the collected voice audio data and eliminating noise;
the speaker recognition module is used for carrying out speaker recognition on the voice audio data after the filtering processing;
and the recognition result display module is used for carrying out visualization processing on the recognition result.
The invention has the beneficial effects that: the invention comprehensively utilizes the gender information and the language information in the voice, effectively improves the robustness of the recognition and identification of the speaker, and particularly has high identification precision under the condition that the voice of the speaker changes.
Drawings
FIG. 1 is a block diagram of a gender and language based speaker recognition method and system according to the present invention.
FIG. 2 is a schematic diagram of a speaker recognition framework according to an example of the present invention.
FIG. 3 is a schematic diagram of a speaker recognition system based on speech and text according to the present invention.
Detailed Description
The technical framework of the invention is explained below with reference to the accompanying drawings.
In the prior art, most of speaker identification methods concern a single factor, namely identification of a speaker, and the method requires that the speaker needs to keep similar speaking modes in two stages of voiceprint registration and voiceprint identification, so that identification accuracy is reduced when the speaker uses different tones.
In order to solve the technical problem that in the prior art, the robustness of speaker identification is low due to the fact that most of speakers are identified based on a single factor, embodiments of the present invention provide a speaker identification method and system based on gender and language.
The technical solutions provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
A method for speaker recognition based on speech and text, as shown in fig. 1, the method comprising:
step S101, acquiring data of the voice to be recognized.
The voice data is an audio file in wav format containing effective speaker audio.
Step S102, carrying out noise reduction processing on the audio file to obtain low-noise voice frequency, and carrying out spectrum conversion on the voice frequency after noise reduction to obtain a voice spectrum characteristic diagram;
step S103, inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector;
step S104, inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker;
with respect to step S104, in one example, as shown in FIG. 2, the model framework of the multi-objective learning model includes a plurality of recognition tasks: speaker recognition (primary task), gender recognition, and language recognition, by introducing multiple secondary recognition factors, improve the accuracy of speaker recognition. In addition, the framework comprises a group of sharing layers, wherein parameters in the sharing layers are common to a plurality of recognition tasks, and each recognition task can optimize the parameters of the sharing layers under the model in the training process. The framework comprises a plurality of hidden layers specific to the tasks, wherein the hidden layers are specific to each recognition task and are embodied in the training process, and only the results of the corresponding recognition tasks can influence the parameters of the hidden layers.
And step S105, carrying out weighted fusion on the speaker identity obtained by identification, the speaker gender and the language information used by the speaker to obtain a speaker identification result of the voice to be identified. In this embodiment, the implementation is realized by introducing a fusion layer after the three hidden layers of the multi-target learning model, the fusion layer is used for fusing output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting result of the three recognition tasks as a final recognition result.
And selecting an optimal weight coefficient combination in the fusion process of the three recognition results, wherein the weight coefficient combination is used for testing the accuracy of different weight coefficient combinations during model training, and one group with the highest accuracy is selected as a final weight coefficient. The weighting factors are influenced by the recognition resolution of the recognition task and the relative relationship of the subtasks to the main task.
In one specific implementation of the present invention, the speech spectrum feature map is formed by SMAC features of speech, and the SMAC feature extraction method includes:
the speech audio is processed through a filter:
Figure RE-GDA0003509501050000041
Xq(ω,t)=X(ω,t)Hq(ω)
q=1,2,...,Q
wherein t represents the information intensity when omega is the independent variable of the frequency spectrum at the t frame time, and X (omega, t) represents the t frame time frequency omega; hq(ω) denotes the qth filter, α denotes a parameter for controlling the bandwidth of the filter, ω Q is the center frequency of the qth filter, Q is the number of filters, XqAnd (ω, t) represents the filtering result of the qth filter.
And calculating the 0 order central moment and the 1 order central moment of the filtering result:
Figure RE-GDA0003509501050000042
wherein M represents the order of the center distance, Mm(q, t) represents the m-order central moment of the filtering result;
taking the ratio of the 1 st order central moment to the 0 th order central moment as the speech spectrum characteristic:
Figure RE-GDA0003509501050000043
wherein R is1(Q, t) represents the Q-th speech spectral feature, and the Q speech spectral features form a speech spectral feature map.
Unlike the general speaker identification method, the invention optimizes the speaker identification with variable pitch, and uses the SMAC characteristic rather than the common MFCC characteristic, wherein the characteristic has the characteristic of pitch robustness and can meet the requirement of identity authentication of the speaker in the pitch change scene. A system for speech and text based speaker recognition, as shown in fig. 3, said system comprising:
the voice acquisition module is used for acquiring the audio data of the speaker;
the audio filtering module is used for filtering the collected sound audio and eliminating noise;
the speaker recognition module is used for carrying out speaker recognition on the audio frequency after the filtering processing; the module comprises:
the audio frequency spectrum conversion module is used for carrying out frequency spectrum analysis on the input audio and carrying out SMAC (simple message access control) feature extraction on the input original audio to obtain a voice frequency spectrum feature map;
and the spectrum feature extraction module is used for extracting the feature vector of the voice spectrum feature map. (ii) a Inputting a voice frequency spectrum characteristic diagram, and extracting a characteristic vector through a deep network;
a multi-objective learning model module comprising three recognition tasks: the system comprises a speaker identity recognition layer, a speaker identification layer and a language information recognition layer, wherein the speaker identity recognition layer, the speaker identification layer and the language information recognition layer are used by the speaker and comprise an N-layer sharing layer, three hidden layers and a fusion layer; the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task; the fusion layer is used for fusing the output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting results of the three recognition tasks as a final recognition result.
And the identification result display module is used for carrying out visual processing on the identification result. The module comprises:
the voice prompt module plays the identified result through voice, and if the speaker is not in the registrant list, the voice prompt module plays the alarm sound;
and the character display module displays the identified speaker information through characters. If a problem occurs during the identification process, an error is displayed on the module.
It will be appreciated by persons skilled in the art that the embodiments of the invention described above and shown in the drawings are given by way of example only and are not limiting of the invention. The objects of the present invention have been fully and effectively accomplished. The functional and structural principles of the present invention have been shown and described in the examples, and any variations or modifications of the embodiments of the present invention may be made without departing from the principles.

Claims (8)

1. A method for speaker recognition based on gender and language, comprising:
acquiring voice data to be recognized, wherein the voice data is an audio file containing effective speaker audio;
carrying out noise reduction processing on the audio file to obtain low-noise voice audio;
extracting the voice audio subjected to noise reduction through SMAC (simple message access control) characteristics to obtain a voice spectrum characteristic diagram;
inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector;
inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker;
and performing weighted fusion on the identified speaker identity, the speaker gender and the language information used by the speaker to obtain a speaker identification result corresponding to the voice data to be identified.
2. The method of claim 1, wherein the SMAC feature extraction method comprises:
the speech audio is processed through a filter:
Figure FDA0003459868030000011
xq(ω,t)=x(ω,t)Hq(ω)
q=1,2,...,Q
wherein t represents the information intensity at different frequencies at the t moment, X (ω, t) represents the independent variable of the frequency spectrum at the t moment; hq(ω) denotes the qth filter, α denotes a parameter controlling the bandwidth of the filter, ωqIs the center frequency of the qth filter, Q is the number of filters; xq(ω, t) denotes the qth filterThe filtering result of (1);
and calculating the 0 order central moment and the 1 order central moment of the filtering result:
Figure FDA0003459868030000012
wherein M represents the order of the center distance, Mm(q, t) represents the m-order central moment of the filtering result;
taking the ratio of the 1 st order central moment to the 0 th order central moment as the speech spectrum characteristic:
Figure FDA0003459868030000013
wherein R is1(Q, t) represents the Q-th speech spectral feature, and the Q speech spectral features form a speech spectral feature map.
3. The method as claimed in claim 1, wherein the multi-objective learning model comprises three recognition tasks: the system comprises a speaker identity recognition device, a speaker identity recognition device and a speaker information recognition device, wherein the speaker identity recognition device, the speaker identity recognition device and the speaker used language information recognition device are composed of N layers of sharing layers and three layers of hidden layers;
the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task.
4. The method as claimed in claim 3, wherein the multi-objective learning model further comprises a fusion layer, the fusion layer is configured to fuse the output results of the three recognition tasks, each output result of the recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighted result of the three recognition tasks as the final recognition result.
5. A gender and language based speaker recognition system for implementing the speaker recognition method of claim 1, the speaker recognition system comprising:
the voice acquisition module is used for acquiring voice audio data of a speaker;
the audio filtering module is used for filtering the collected voice audio data and eliminating noise;
the speaker recognition module is used for carrying out speaker recognition on the voice audio data after the filtering processing;
and the recognition result display module is used for carrying out visualization processing on the recognition result.
6. The system of claim 5, wherein the speaker recognition module comprises:
the audio frequency spectrum conversion module is used for extracting SMAC characteristics of the voice audio and converting the SMAC characteristics to obtain a voice frequency spectrum characteristic diagram;
and the spectrum feature extraction module is used for extracting the feature vector of the voice spectrum feature map.
A multi-objective learning model module comprising three recognition tasks: the system comprises a speaker identity recognition layer, a speaker identification layer and a language information recognition layer, wherein the speaker identity recognition layer, the speaker identification layer and the language information recognition layer are used by the speaker and comprise an N-layer sharing layer, three hidden layers and a fusion layer; the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task; the fusion layer is used for fusing the output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting results of the three recognition tasks as a final recognition result.
7. The system according to claim 5, wherein the audio filtering module cuts out all noise signals below a threshold by low frequency cutting.
8. The system for gender and language based speaker recognition as claimed in claim 5, wherein the recognition result presentation module comprises:
the voice prompt module plays the identified result through voice, and if the speaker is not in the registrant list, the voice prompt module plays the alarm sound;
and the character display module displays the identified speaker information through characters, and if a problem occurs in the identification process, the module displays an error.
CN202210014706.8A 2022-01-07 2022-01-07 Gender and language-based speaker identification method and system Pending CN114360551A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210014706.8A CN114360551A (en) 2022-01-07 2022-01-07 Gender and language-based speaker identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210014706.8A CN114360551A (en) 2022-01-07 2022-01-07 Gender and language-based speaker identification method and system

Publications (1)

Publication Number Publication Date
CN114360551A true CN114360551A (en) 2022-04-15

Family

ID=81107786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210014706.8A Pending CN114360551A (en) 2022-01-07 2022-01-07 Gender and language-based speaker identification method and system

Country Status (1)

Country Link
CN (1) CN114360551A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116913278A (en) * 2023-09-12 2023-10-20 腾讯科技(深圳)有限公司 Voice processing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116913278A (en) * 2023-09-12 2023-10-20 腾讯科技(深圳)有限公司 Voice processing method, device, equipment and storage medium
CN116913278B (en) * 2023-09-12 2023-11-17 腾讯科技(深圳)有限公司 Voice processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109326302B (en) Voice enhancement method based on voiceprint comparison and generation of confrontation network
CN110827837B (en) Whale activity audio classification method based on deep learning
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN111816218A (en) Voice endpoint detection method, device, equipment and storage medium
CN102543073B (en) Shanghai dialect phonetic recognition information processing method
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN110600014B (en) Model training method and device, storage medium and electronic equipment
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
CN115602165B (en) Digital employee intelligent system based on financial system
CN113823293A (en) Speaker recognition method and system based on voice enhancement
CN111145726A (en) Deep learning-based sound scene classification method, system, device and storage medium
Hamsa et al. Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG
Sekkate et al. Speaker identification for OFDM-based aeronautical communication system
CN113327631B (en) Emotion recognition model training method, emotion recognition method and emotion recognition device
CN114360551A (en) Gender and language-based speaker identification method and system
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN106887226A (en) Speech recognition algorithm based on artificial intelligence recognition
CN115602158A (en) Voice recognition acoustic model construction method and system based on telephone channel
Wang et al. Speech enhancement based on noise classification and deep neural network
CN113345427A (en) Residual error network-based environmental sound identification system and method
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
CN111833897A (en) Voice enhancement method for interactive education
CN118430541B (en) Intelligent voice robot system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination