CN114360551A - Gender and language-based speaker identification method and system - Google Patents
Gender and language-based speaker identification method and system Download PDFInfo
- Publication number
- CN114360551A CN114360551A CN202210014706.8A CN202210014706A CN114360551A CN 114360551 A CN114360551 A CN 114360551A CN 202210014706 A CN202210014706 A CN 202210014706A CN 114360551 A CN114360551 A CN 114360551A
- Authority
- CN
- China
- Prior art keywords
- speaker
- recognition
- voice
- layer
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000001228 spectrum Methods 0.000 claims abstract description 23
- 230000004927 fusion Effects 0.000 claims abstract description 19
- 238000010586 diagram Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 10
- 238000001914 filtration Methods 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 11
- 230000003595 spectral effect Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000012800 visualization Methods 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Landscapes
- User Interface Of Digital Computer (AREA)
Abstract
The invention discloses a method and a system for identifying a speaker based on gender and language, and belongs to the field of speaker identification. The method comprises the following steps: acquiring voice data to be recognized, specifically an audio file containing effective speaker audio; carrying out noise reduction processing on the audio file to obtain low-noise voice audio; extracting the voice audio subjected to noise reduction through SMAC (simple message access control) characteristics to obtain a voice spectrum characteristic diagram; inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector; inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker; and performing weighted fusion on the three recognition task results to obtain a speaker recognition result corresponding to the voice data to be recognized. The invention comprehensively utilizes the gender information and the language information in the voice, effectively improves the robustness of the recognition and identification of the speaker, and particularly has high identification precision under the condition that the voice of the speaker changes.
Description
Technical Field
The invention relates to the field of speaker identification, in particular to a speaker identification method and system based on gender and language.
Background
With the continuous development of artificial intelligence, more and more intelligent identification technologies are applied in life, including face recognition, fingerprint recognition and voiceprint recognition emerging in recent years. Voiceprint recognition, also known as speaker recognition, identifies to which speaker the audio belongs by analyzing a piece of audio content. Speakers can be used for identity authentication, which is a widespread concern because of its convenience.
In the prior art, most of speaker identification methods concern a single factor, namely identification of a speaker, and the method requires that the speaker needs to keep similar speaking modes in two stages of voiceprint registration and voiceprint identification, so that identification accuracy is reduced when the speaker uses different tones.
Disclosure of Invention
The invention provides a gender and language based speech recognition method and system, which combines gender information and language information contained in speech content to recognize a speaker, and solves the technical problem that the accuracy of a single factor recognition method is reduced under the condition of changing language tones.
In order to achieve the purpose, the invention adopts the following technical scheme:
a first object of the present invention is to provide a gender and language based speaker recognition method, the method comprising:
acquiring voice data to be recognized, wherein the voice data is an audio file in wav format containing effective speaker audio;
carrying out noise reduction processing on the audio file to obtain low-noise voice audio;
extracting voice audio frequency through SMAC characteristics to obtain a voice frequency spectrum characteristic diagram;
inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector;
inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker;
and performing weighted fusion on the identified speaker identity, the speaker gender and the language information used by the speaker to obtain a speaker identification result corresponding to the voice data to be identified.
Further, the multi-target learning model comprises three identification tasks: the system comprises a speaker identity recognition layer, a speaker identification layer and a language information recognition layer, wherein the speaker identity recognition layer, the speaker identification layer and the language information recognition layer are used by the speaker and comprise an N-layer sharing layer, three hidden layers and a fusion layer;
the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task;
the fusion layer is used for fusing the output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting results of the three recognition tasks as a final recognition result.
The second objective of the present invention is to provide a speaker recognition system based on gender and language, which is used for implementing the speaker recognition method; the system comprises:
the voice acquisition module is used for acquiring voice audio data of a speaker;
the audio filtering module is used for filtering the collected voice audio data and eliminating noise;
the speaker recognition module is used for carrying out speaker recognition on the voice audio data after the filtering processing;
and the recognition result display module is used for carrying out visualization processing on the recognition result.
The invention has the beneficial effects that: the invention comprehensively utilizes the gender information and the language information in the voice, effectively improves the robustness of the recognition and identification of the speaker, and particularly has high identification precision under the condition that the voice of the speaker changes.
Drawings
FIG. 1 is a block diagram of a gender and language based speaker recognition method and system according to the present invention.
FIG. 2 is a schematic diagram of a speaker recognition framework according to an example of the present invention.
FIG. 3 is a schematic diagram of a speaker recognition system based on speech and text according to the present invention.
Detailed Description
The technical framework of the invention is explained below with reference to the accompanying drawings.
In the prior art, most of speaker identification methods concern a single factor, namely identification of a speaker, and the method requires that the speaker needs to keep similar speaking modes in two stages of voiceprint registration and voiceprint identification, so that identification accuracy is reduced when the speaker uses different tones.
In order to solve the technical problem that in the prior art, the robustness of speaker identification is low due to the fact that most of speakers are identified based on a single factor, embodiments of the present invention provide a speaker identification method and system based on gender and language.
The technical solutions provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
A method for speaker recognition based on speech and text, as shown in fig. 1, the method comprising:
step S101, acquiring data of the voice to be recognized.
The voice data is an audio file in wav format containing effective speaker audio.
Step S102, carrying out noise reduction processing on the audio file to obtain low-noise voice frequency, and carrying out spectrum conversion on the voice frequency after noise reduction to obtain a voice spectrum characteristic diagram;
step S103, inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector;
step S104, inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker;
with respect to step S104, in one example, as shown in FIG. 2, the model framework of the multi-objective learning model includes a plurality of recognition tasks: speaker recognition (primary task), gender recognition, and language recognition, by introducing multiple secondary recognition factors, improve the accuracy of speaker recognition. In addition, the framework comprises a group of sharing layers, wherein parameters in the sharing layers are common to a plurality of recognition tasks, and each recognition task can optimize the parameters of the sharing layers under the model in the training process. The framework comprises a plurality of hidden layers specific to the tasks, wherein the hidden layers are specific to each recognition task and are embodied in the training process, and only the results of the corresponding recognition tasks can influence the parameters of the hidden layers.
And step S105, carrying out weighted fusion on the speaker identity obtained by identification, the speaker gender and the language information used by the speaker to obtain a speaker identification result of the voice to be identified. In this embodiment, the implementation is realized by introducing a fusion layer after the three hidden layers of the multi-target learning model, the fusion layer is used for fusing output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting result of the three recognition tasks as a final recognition result.
And selecting an optimal weight coefficient combination in the fusion process of the three recognition results, wherein the weight coefficient combination is used for testing the accuracy of different weight coefficient combinations during model training, and one group with the highest accuracy is selected as a final weight coefficient. The weighting factors are influenced by the recognition resolution of the recognition task and the relative relationship of the subtasks to the main task.
In one specific implementation of the present invention, the speech spectrum feature map is formed by SMAC features of speech, and the SMAC feature extraction method includes:
the speech audio is processed through a filter:
Xq(ω,t)=X(ω,t)Hq(ω)
q=1,2,...,Q
wherein t represents the information intensity when omega is the independent variable of the frequency spectrum at the t frame time, and X (omega, t) represents the t frame time frequency omega; hq(ω) denotes the qth filter, α denotes a parameter for controlling the bandwidth of the filter, ω Q is the center frequency of the qth filter, Q is the number of filters, XqAnd (ω, t) represents the filtering result of the qth filter.
And calculating the 0 order central moment and the 1 order central moment of the filtering result:
wherein M represents the order of the center distance, Mm(q, t) represents the m-order central moment of the filtering result;
taking the ratio of the 1 st order central moment to the 0 th order central moment as the speech spectrum characteristic:
wherein R is1(Q, t) represents the Q-th speech spectral feature, and the Q speech spectral features form a speech spectral feature map.
Unlike the general speaker identification method, the invention optimizes the speaker identification with variable pitch, and uses the SMAC characteristic rather than the common MFCC characteristic, wherein the characteristic has the characteristic of pitch robustness and can meet the requirement of identity authentication of the speaker in the pitch change scene. A system for speech and text based speaker recognition, as shown in fig. 3, said system comprising:
the voice acquisition module is used for acquiring the audio data of the speaker;
the audio filtering module is used for filtering the collected sound audio and eliminating noise;
the speaker recognition module is used for carrying out speaker recognition on the audio frequency after the filtering processing; the module comprises:
the audio frequency spectrum conversion module is used for carrying out frequency spectrum analysis on the input audio and carrying out SMAC (simple message access control) feature extraction on the input original audio to obtain a voice frequency spectrum feature map;
and the spectrum feature extraction module is used for extracting the feature vector of the voice spectrum feature map. (ii) a Inputting a voice frequency spectrum characteristic diagram, and extracting a characteristic vector through a deep network;
a multi-objective learning model module comprising three recognition tasks: the system comprises a speaker identity recognition layer, a speaker identification layer and a language information recognition layer, wherein the speaker identity recognition layer, the speaker identification layer and the language information recognition layer are used by the speaker and comprise an N-layer sharing layer, three hidden layers and a fusion layer; the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task; the fusion layer is used for fusing the output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting results of the three recognition tasks as a final recognition result.
And the identification result display module is used for carrying out visual processing on the identification result. The module comprises:
the voice prompt module plays the identified result through voice, and if the speaker is not in the registrant list, the voice prompt module plays the alarm sound;
and the character display module displays the identified speaker information through characters. If a problem occurs during the identification process, an error is displayed on the module.
It will be appreciated by persons skilled in the art that the embodiments of the invention described above and shown in the drawings are given by way of example only and are not limiting of the invention. The objects of the present invention have been fully and effectively accomplished. The functional and structural principles of the present invention have been shown and described in the examples, and any variations or modifications of the embodiments of the present invention may be made without departing from the principles.
Claims (8)
1. A method for speaker recognition based on gender and language, comprising:
acquiring voice data to be recognized, wherein the voice data is an audio file containing effective speaker audio;
carrying out noise reduction processing on the audio file to obtain low-noise voice audio;
extracting the voice audio subjected to noise reduction through SMAC (simple message access control) characteristics to obtain a voice spectrum characteristic diagram;
inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector;
inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker;
and performing weighted fusion on the identified speaker identity, the speaker gender and the language information used by the speaker to obtain a speaker identification result corresponding to the voice data to be identified.
2. The method of claim 1, wherein the SMAC feature extraction method comprises:
the speech audio is processed through a filter:
xq(ω,t)=x(ω,t)Hq(ω)
q=1,2,...,Q
wherein t represents the information intensity at different frequencies at the t moment, X (ω, t) represents the independent variable of the frequency spectrum at the t moment; hq(ω) denotes the qth filter, α denotes a parameter controlling the bandwidth of the filter, ωqIs the center frequency of the qth filter, Q is the number of filters; xq(ω, t) denotes the qth filterThe filtering result of (1);
and calculating the 0 order central moment and the 1 order central moment of the filtering result:
wherein M represents the order of the center distance, Mm(q, t) represents the m-order central moment of the filtering result;
taking the ratio of the 1 st order central moment to the 0 th order central moment as the speech spectrum characteristic:
wherein R is1(Q, t) represents the Q-th speech spectral feature, and the Q speech spectral features form a speech spectral feature map.
3. The method as claimed in claim 1, wherein the multi-objective learning model comprises three recognition tasks: the system comprises a speaker identity recognition device, a speaker identity recognition device and a speaker information recognition device, wherein the speaker identity recognition device, the speaker identity recognition device and the speaker used language information recognition device are composed of N layers of sharing layers and three layers of hidden layers;
the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task.
4. The method as claimed in claim 3, wherein the multi-objective learning model further comprises a fusion layer, the fusion layer is configured to fuse the output results of the three recognition tasks, each output result of the recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighted result of the three recognition tasks as the final recognition result.
5. A gender and language based speaker recognition system for implementing the speaker recognition method of claim 1, the speaker recognition system comprising:
the voice acquisition module is used for acquiring voice audio data of a speaker;
the audio filtering module is used for filtering the collected voice audio data and eliminating noise;
the speaker recognition module is used for carrying out speaker recognition on the voice audio data after the filtering processing;
and the recognition result display module is used for carrying out visualization processing on the recognition result.
6. The system of claim 5, wherein the speaker recognition module comprises:
the audio frequency spectrum conversion module is used for extracting SMAC characteristics of the voice audio and converting the SMAC characteristics to obtain a voice frequency spectrum characteristic diagram;
and the spectrum feature extraction module is used for extracting the feature vector of the voice spectrum feature map.
A multi-objective learning model module comprising three recognition tasks: the system comprises a speaker identity recognition layer, a speaker identification layer and a language information recognition layer, wherein the speaker identity recognition layer, the speaker identification layer and the language information recognition layer are used by the speaker and comprise an N-layer sharing layer, three hidden layers and a fusion layer; the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task; the fusion layer is used for fusing the output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting results of the three recognition tasks as a final recognition result.
7. The system according to claim 5, wherein the audio filtering module cuts out all noise signals below a threshold by low frequency cutting.
8. The system for gender and language based speaker recognition as claimed in claim 5, wherein the recognition result presentation module comprises:
the voice prompt module plays the identified result through voice, and if the speaker is not in the registrant list, the voice prompt module plays the alarm sound;
and the character display module displays the identified speaker information through characters, and if a problem occurs in the identification process, the module displays an error.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210014706.8A CN114360551A (en) | 2022-01-07 | 2022-01-07 | Gender and language-based speaker identification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210014706.8A CN114360551A (en) | 2022-01-07 | 2022-01-07 | Gender and language-based speaker identification method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114360551A true CN114360551A (en) | 2022-04-15 |
Family
ID=81107786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210014706.8A Pending CN114360551A (en) | 2022-01-07 | 2022-01-07 | Gender and language-based speaker identification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114360551A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116913278A (en) * | 2023-09-12 | 2023-10-20 | 腾讯科技(深圳)有限公司 | Voice processing method, device, equipment and storage medium |
-
2022
- 2022-01-07 CN CN202210014706.8A patent/CN114360551A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116913278A (en) * | 2023-09-12 | 2023-10-20 | 腾讯科技(深圳)有限公司 | Voice processing method, device, equipment and storage medium |
CN116913278B (en) * | 2023-09-12 | 2023-11-17 | 腾讯科技(深圳)有限公司 | Voice processing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109326302B (en) | Voice enhancement method based on voiceprint comparison and generation of confrontation network | |
CN110827837B (en) | Whale activity audio classification method based on deep learning | |
CN108597496B (en) | Voice generation method and device based on generation type countermeasure network | |
CN111916111B (en) | Intelligent voice outbound method and device with emotion, server and storage medium | |
CN106952649A (en) | Method for distinguishing speek person based on convolutional neural networks and spectrogram | |
CN111816218A (en) | Voice endpoint detection method, device, equipment and storage medium | |
CN102543073B (en) | Shanghai dialect phonetic recognition information processing method | |
CN107731233A (en) | A kind of method for recognizing sound-groove based on RNN | |
CN110600014B (en) | Model training method and device, storage medium and electronic equipment | |
CN111489763B (en) | GMM model-based speaker recognition self-adaption method in complex environment | |
CN115602165B (en) | Digital employee intelligent system based on financial system | |
CN113823293A (en) | Speaker recognition method and system based on voice enhancement | |
CN111145726A (en) | Deep learning-based sound scene classification method, system, device and storage medium | |
Hamsa et al. | Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG | |
Sekkate et al. | Speaker identification for OFDM-based aeronautical communication system | |
CN113327631B (en) | Emotion recognition model training method, emotion recognition method and emotion recognition device | |
CN114360551A (en) | Gender and language-based speaker identification method and system | |
CN113516987B (en) | Speaker recognition method, speaker recognition device, storage medium and equipment | |
CN106887226A (en) | Speech recognition algorithm based on artificial intelligence recognition | |
CN115602158A (en) | Voice recognition acoustic model construction method and system based on telephone channel | |
Wang et al. | Speech enhancement based on noise classification and deep neural network | |
CN113345427A (en) | Residual error network-based environmental sound identification system and method | |
Tzudir et al. | Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients | |
CN111833897A (en) | Voice enhancement method for interactive education | |
CN118430541B (en) | Intelligent voice robot system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |