CN114360551A

CN114360551A - Gender and language-based speaker identification method and system

Info

Publication number: CN114360551A
Application number: CN202210014706.8A
Authority: CN
Inventors: 徐文渊; 冀晓宇; 程雨诗; 高逸卓
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-04-15

Abstract

The invention discloses a method and a system for identifying a speaker based on gender and language, and belongs to the field of speaker identification. The method comprises the following steps: acquiring voice data to be recognized, specifically an audio file containing effective speaker audio; carrying out noise reduction processing on the audio file to obtain low-noise voice audio; extracting the voice audio subjected to noise reduction through SMAC (simple message access control) characteristics to obtain a voice spectrum characteristic diagram; inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector; inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker; and performing weighted fusion on the three recognition task results to obtain a speaker recognition result corresponding to the voice data to be recognized. The invention comprehensively utilizes the gender information and the language information in the voice, effectively improves the robustness of the recognition and identification of the speaker, and particularly has high identification precision under the condition that the voice of the speaker changes.

Description

Gender and language-based speaker identification method and system

Technical Field

The invention relates to the field of speaker identification, in particular to a speaker identification method and system based on gender and language.

Background

With the continuous development of artificial intelligence, more and more intelligent identification technologies are applied in life, including face recognition, fingerprint recognition and voiceprint recognition emerging in recent years. Voiceprint recognition, also known as speaker recognition, identifies to which speaker the audio belongs by analyzing a piece of audio content. Speakers can be used for identity authentication, which is a widespread concern because of its convenience.

In the prior art, most of speaker identification methods concern a single factor, namely identification of a speaker, and the method requires that the speaker needs to keep similar speaking modes in two stages of voiceprint registration and voiceprint identification, so that identification accuracy is reduced when the speaker uses different tones.

Disclosure of Invention

The invention provides a gender and language based speech recognition method and system, which combines gender information and language information contained in speech content to recognize a speaker, and solves the technical problem that the accuracy of a single factor recognition method is reduced under the condition of changing language tones.

In order to achieve the purpose, the invention adopts the following technical scheme:

a first object of the present invention is to provide a gender and language based speaker recognition method, the method comprising:

acquiring voice data to be recognized, wherein the voice data is an audio file in wav format containing effective speaker audio;

carrying out noise reduction processing on the audio file to obtain low-noise voice audio;

extracting voice audio frequency through SMAC characteristics to obtain a voice frequency spectrum characteristic diagram;

inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector;

inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker;

and performing weighted fusion on the identified speaker identity, the speaker gender and the language information used by the speaker to obtain a speaker identification result corresponding to the voice data to be identified.

Further, the multi-target learning model comprises three identification tasks: the system comprises a speaker identity recognition layer, a speaker identification layer and a language information recognition layer, wherein the speaker identity recognition layer, the speaker identification layer and the language information recognition layer are used by the speaker and comprise an N-layer sharing layer, three hidden layers and a fusion layer;

the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task;

the fusion layer is used for fusing the output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting results of the three recognition tasks as a final recognition result.

The second objective of the present invention is to provide a speaker recognition system based on gender and language, which is used for implementing the speaker recognition method; the system comprises:

the voice acquisition module is used for acquiring voice audio data of a speaker;

the audio filtering module is used for filtering the collected voice audio data and eliminating noise;

the speaker recognition module is used for carrying out speaker recognition on the voice audio data after the filtering processing;

and the recognition result display module is used for carrying out visualization processing on the recognition result.

The invention has the beneficial effects that: the invention comprehensively utilizes the gender information and the language information in the voice, effectively improves the robustness of the recognition and identification of the speaker, and particularly has high identification precision under the condition that the voice of the speaker changes.

Drawings

FIG. 1 is a block diagram of a gender and language based speaker recognition method and system according to the present invention.

FIG. 2 is a schematic diagram of a speaker recognition framework according to an example of the present invention.

FIG. 3 is a schematic diagram of a speaker recognition system based on speech and text according to the present invention.

Detailed Description

The technical framework of the invention is explained below with reference to the accompanying drawings.

In order to solve the technical problem that in the prior art, the robustness of speaker identification is low due to the fact that most of speakers are identified based on a single factor, embodiments of the present invention provide a speaker identification method and system based on gender and language.

The technical solutions provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

A method for speaker recognition based on speech and text, as shown in fig. 1, the method comprising:

step S101, acquiring data of the voice to be recognized.

The voice data is an audio file in wav format containing effective speaker audio.

Step S102, carrying out noise reduction processing on the audio file to obtain low-noise voice frequency, and carrying out spectrum conversion on the voice frequency after noise reduction to obtain a voice spectrum characteristic diagram;

step S103, inputting the voice frequency spectrum characteristic diagram into a ResNet model to obtain a voice characteristic vector;

step S104, inputting the voice characteristic vector into a multi-target learning model, and identifying to obtain the identity of a speaker, the gender of the speaker and language information used by the speaker;

with respect to step S104, in one example, as shown in FIG. 2, the model framework of the multi-objective learning model includes a plurality of recognition tasks: speaker recognition (primary task), gender recognition, and language recognition, by introducing multiple secondary recognition factors, improve the accuracy of speaker recognition. In addition, the framework comprises a group of sharing layers, wherein parameters in the sharing layers are common to a plurality of recognition tasks, and each recognition task can optimize the parameters of the sharing layers under the model in the training process. The framework comprises a plurality of hidden layers specific to the tasks, wherein the hidden layers are specific to each recognition task and are embodied in the training process, and only the results of the corresponding recognition tasks can influence the parameters of the hidden layers.

And step S105, carrying out weighted fusion on the speaker identity obtained by identification, the speaker gender and the language information used by the speaker to obtain a speaker identification result of the voice to be identified. In this embodiment, the implementation is realized by introducing a fusion layer after the three hidden layers of the multi-target learning model, the fusion layer is used for fusing output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting result of the three recognition tasks as a final recognition result.

And selecting an optimal weight coefficient combination in the fusion process of the three recognition results, wherein the weight coefficient combination is used for testing the accuracy of different weight coefficient combinations during model training, and one group with the highest accuracy is selected as a final weight coefficient. The weighting factors are influenced by the recognition resolution of the recognition task and the relative relationship of the subtasks to the main task.

In one specific implementation of the present invention, the speech spectrum feature map is formed by SMAC features of speech, and the SMAC feature extraction method includes:

the speech audio is processed through a filter:

X_q(ω，t)＝X(ω，t)H_q(ω)

q＝1，2，...，Q

wherein t represents the information intensity when omega is the independent variable of the frequency spectrum at the t frame time, and X (omega, t) represents the t frame time frequency omega; h_q(ω) denotes the qth filter, α denotes a parameter for controlling the bandwidth of the filter, ω Q is the center frequency of the qth filter, Q is the number of filters, X_qAnd (ω, t) represents the filtering result of the qth filter.

And calculating the 0 order central moment and the 1 order central moment of the filtering result:

wherein M represents the order of the center distance, M^m(q, t) represents the m-order central moment of the filtering result;

taking the ratio of the 1 st order central moment to the 0 th order central moment as the speech spectrum characteristic:

wherein R is¹(Q, t) represents the Q-th speech spectral feature, and the Q speech spectral features form a speech spectral feature map.

Unlike the general speaker identification method, the invention optimizes the speaker identification with variable pitch, and uses the SMAC characteristic rather than the common MFCC characteristic, wherein the characteristic has the characteristic of pitch robustness and can meet the requirement of identity authentication of the speaker in the pitch change scene. A system for speech and text based speaker recognition, as shown in fig. 3, said system comprising:

the voice acquisition module is used for acquiring the audio data of the speaker;

the audio filtering module is used for filtering the collected sound audio and eliminating noise;

the speaker recognition module is used for carrying out speaker recognition on the audio frequency after the filtering processing; the module comprises:

the audio frequency spectrum conversion module is used for carrying out frequency spectrum analysis on the input audio and carrying out SMAC (simple message access control) feature extraction on the input original audio to obtain a voice frequency spectrum feature map;

and the spectrum feature extraction module is used for extracting the feature vector of the voice spectrum feature map. (ii) a Inputting a voice frequency spectrum characteristic diagram, and extracting a characteristic vector through a deep network;

a multi-objective learning model module comprising three recognition tasks: the system comprises a speaker identity recognition layer, a speaker identification layer and a language information recognition layer, wherein the speaker identity recognition layer, the speaker identification layer and the language information recognition layer are used by the speaker and comprise an N-layer sharing layer, three hidden layers and a fusion layer; the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task; the fusion layer is used for fusing the output results of the three recognition tasks, the output result of each recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighting results of the three recognition tasks as a final recognition result.

And the identification result display module is used for carrying out visual processing on the identification result. The module comprises:

the voice prompt module plays the identified result through voice, and if the speaker is not in the registrant list, the voice prompt module plays the alarm sound;

and the character display module displays the identified speaker information through characters. If a problem occurs during the identification process, an error is displayed on the module.

It will be appreciated by persons skilled in the art that the embodiments of the invention described above and shown in the drawings are given by way of example only and are not limiting of the invention. The objects of the present invention have been fully and effectively accomplished. The functional and structural principles of the present invention have been shown and described in the examples, and any variations or modifications of the embodiments of the present invention may be made without departing from the principles.

Claims

1. A method for speaker recognition based on gender and language, comprising:

acquiring voice data to be recognized, wherein the voice data is an audio file containing effective speaker audio;

extracting the voice audio subjected to noise reduction through SMAC (simple message access control) characteristics to obtain a voice spectrum characteristic diagram;

2. The method of claim 1, wherein the SMAC feature extraction method comprises:

the speech audio is processed through a filter:

x_q(ω，t)＝x(ω，t)H_q(ω)

q＝1,2,...，Q

wherein t represents the information intensity at different frequencies at the t moment, X (ω, t) represents the independent variable of the frequency spectrum at the t moment; h_q(ω) denotes the qth filter, α denotes a parameter controlling the bandwidth of the filter, ω_qIs the center frequency of the qth filter, Q is the number of filters; x_q(ω, t) denotes the qth filterThe filtering result of (1);

3. The method as claimed in claim 1, wherein the multi-objective learning model comprises three recognition tasks: the system comprises a speaker identity recognition device, a speaker identity recognition device and a speaker information recognition device, wherein the speaker identity recognition device, the speaker identity recognition device and the speaker used language information recognition device are composed of N layers of sharing layers and three layers of hidden layers;

the N layers of sharing layers are connected in sequence, and in the training process, the parameters of the sharing layers are influenced by the recognition results of the three tasks; the input of the three hidden layers is respectively connected with the output of the Nth layer sharing layer, the output of the three hidden layers is respectively the identity of the speaker, the gender of the speaker and the recognition result of the language information used by the speaker, and in the training process, the parameters of the hidden layers are only influenced by the corresponding recognition task.

4. The method as claimed in claim 3, wherein the multi-objective learning model further comprises a fusion layer, the fusion layer is configured to fuse the output results of the three recognition tasks, each output result of the recognition task is provided with a trainable weight parameter, and the fusion layer takes the weighted result of the three recognition tasks as the final recognition result.

5. A gender and language based speaker recognition system for implementing the speaker recognition method of claim 1, the speaker recognition system comprising:

6. The system of claim 5, wherein the speaker recognition module comprises:

the audio frequency spectrum conversion module is used for extracting SMAC characteristics of the voice audio and converting the SMAC characteristics to obtain a voice frequency spectrum characteristic diagram;

and the spectrum feature extraction module is used for extracting the feature vector of the voice spectrum feature map.

7. The system according to claim 5, wherein the audio filtering module cuts out all noise signals below a threshold by low frequency cutting.

8. The system for gender and language based speaker recognition as claimed in claim 5, wherein the recognition result presentation module comprises:

and the character display module displays the identified speaker information through characters, and if a problem occurs in the identification process, the module displays an error.