CN113077794A

CN113077794A - Human voice recognition system

Info

Publication number: CN113077794A
Application number: CN202110367218.0A
Authority: CN
Inventors: 程杰
Original assignee: Nanjing Xinzhiyicai Technology Co ltd
Current assignee: Nanjing Xinzhiyicai Technology Co ltd
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2021-07-06

Abstract

The invention discloses a voice recognition system, which is applied to the technical field of voice recognition and comprises the following components: the voice recognition system comprises a voice acquisition module, a preprocessing module, a voiceprint feature extraction module, a function switching module, a model training module and a voice recognition module. The invention can make the input characteristic more perfect, the noise is smaller, the algorithm precision is higher; the deep neural convolution network algorithm is adopted to extract and classify the high-dimensional characteristics of the voice, the voice characteristics of the speaker are extracted and directly identified, the defect that the speaker is identified through the speaking content of the speaker is avoided, and the identification accuracy is improved.

Description

Human voice recognition system

Technical Field

The invention relates to the technical field of voice recognition, in particular to a human voice recognition system.

Background

The speech recognition technology is an information technology for converting a voice, a byte, or a phrase uttered by a person into a corresponding word or symbol, or giving a response, through a recognition and understanding process of a machine. The voiceprint is a sound wave spectrum which is displayed by an electro-acoustic instrument and carries speech information, the generation of human language is a complex physiological and physical process between a human language center and a pronunciation organ, and the size and the shape of the pronunciation organ used by a person during speaking, such as a tongue, teeth, a larynx, a lung and a nasal cavity, are greatly different, so that the voiceprint spectrums of any two persons are different. Corresponding sound wave frequency spectrums of different users have difference when speaking, so that the unique user can be identified through the voiceprint.

In the prior art, the voiceprint recognition mode has the defect of inaccurate recognition, and compared with identity recognition modes such as face recognition, fingerprint recognition and the like, the voiceprint recognition mode is not widely applied at present due to the defect.

Therefore, it is an urgent problem to be solved by those skilled in the art to provide a voice recognition system capable of accurately recognizing human voice.

Disclosure of Invention

In view of this, the present invention provides a voice recognition system, which can accurately recognize the user identity through the voiceprint.

In order to achieve the purpose, the invention adopts the following technical scheme:

a voice recognition system comprises a voice acquisition module, a preprocessing module, a voiceprint feature extraction module, a function switching module, a voiceprint recognition module and a model training module;

the voice acquisition module is connected with the input end of the preprocessing module and is used for acquiring voice print information after acquiring voice;

the preprocessing module is connected with the input end of the voiceprint feature extraction module and is used for carrying out noise reduction processing on the voiceprint information;

the voiceprint feature extraction module is connected with the input end of the function switching module and used for extracting voiceprint features;

the function switching module is used for selecting the voiceprint recognition function and the model training function;

the model training module is connected with the first output end of the function switching module and used for performing model training on the voiceprint features to obtain a voiceprint template;

the voiceprint template library is connected with the output end of the model training template and used for acquiring and storing the voiceprint template;

the input end of the voiceprint recognition module is connected with the second output end of the function switching module, and the first input/output end of the voiceprint recognition module is connected with the input/output end of the voiceprint template library and used for recognizing the identity of the user according to the voiceprint template.

Preferably, the human voice collecting module includes: a sound collection unit and a volume adaptive unit;

the sound collection unit is connected with the input end of the volume self-adaptive unit and is used for collecting user sound for human voice recognition; and the volume self-adaptive unit is used for carrying out self-adaptive processing on the volume of the sound of the user, and carrying out overall normalization processing on the volume of the sound of the user after carrying out recognition model training to the volume of the sound of the user to the same maximum value.

The technical effect realized by the technical scheme is as follows: the sound volume of the user is processed to obtain the same maximum value, so that the intensity of the sound signal is balanced, and the voiceprint feature extraction is convenient.

Preferably, the preprocessing module comprises: a noise reduction unit and a signal enhancement unit;

the noise reduction unit is used for carrying out noise reduction processing on the voiceprint information to obtain the voiceprint information subjected to noise reduction; at least one of a spectrum elimination method and/or a learning identification method and/or a noise reduction automatic encoder is adopted for noise suppression; the signal enhancement unit is connected with the input end of the noise reduction unit and used for enhancing the voiceprint information of the human voice acquisition module.

The technical effect realized by the technical scheme is as follows: the noise of the user voiceprint information is reduced, and the final distinguishing accuracy is improved.

Preferably, the voiceprint feature extraction module includes: the voice print feature extraction unit and the voice spectrum chip conversion unit; the voiceprint feature extraction unit is used for extracting the voiceprint features of the voice of the user through a trained neural network algorithm model; and the voice spectrum picture conversion unit is connected with the output end of the voiceprint feature extraction unit and is used for converting the obtained voiceprint features into a voice spectrum.

Preferably, the method further comprises the following steps: a feedback voice module and a voice output module;

the feedback voice module is connected with the input/output end of the voiceprint recognition module, acquires the recognition result of the voiceprint recognition module and outputs a corresponding voice feedback signal to the voiceprint recognition module; and the sound output module is connected with the third input end of the voiceprint recognition module and used for receiving and outputting the voice feedback signal.

The technical effect realized by the technical scheme is as follows: the result of the human voice (identity) recognition is directly obtained through voice.

Through the technical scheme, compared with the prior art, the invention provides a human voice recognition system which comprises the following components: the invention can make the input characteristic more perfect, the noise is smaller, the algorithm precision is higher; the deep neural convolution network algorithm is adopted to extract and classify the high-dimensional characteristics of the voice, the voice characteristics of the speaker are extracted and directly identified, the defect that the speaker is identified through the speaking content of the speaker is avoided, and the identification accuracy is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of a human voice recognition system according to the present invention;

FIG. 2 is a block diagram of a human voice acquisition module according to the present invention;

FIG. 3 is a block diagram of a preprocessing module according to the present invention;

FIG. 4 is a block diagram of the voiceprint feature extraction module of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1, the present embodiment discloses a human voice recognition system,

the voice recognition system comprises a voice acquisition module, a preprocessing module, a voiceprint feature extraction module, a function switching module, a voiceprint recognition module and a model training module;

the model training module is connected with the first output end of the function switching module and used for carrying out model training on the voiceprint characteristics to obtain a voiceprint template;

In one embodiment, the function switching module includes: acquiring a voiceprint template and recognizing a voiceprint; acquiring voice information with fixed content under the functional state of voiceprint template acquisition, and obtaining a voiceprint template through model training; under the more functional state of voiceprint recognition, the voice sent by the current voice recognizer is collected, and after voice preprocessing and voiceprint feature extraction, the voice is compared with the voiceprint in the voiceprint template library for direct recognition.

In a specific embodiment, the model trained after extracting the voiceprint features in the model training module is any neural network model capable of realizing the technical effect in the prior art.

In one embodiment, the human voice collecting module comprises: a sound collection unit and a volume adaptive unit;

the sound collection unit is connected with the input end of the volume self-adaptive unit and is used for collecting user sound for human voice recognition;

and the volume self-adaption unit is used for carrying out self-adaption processing on the volume of the sound of the user, carrying out recognition model training on the volume of the sound of the user, and carrying out overall normalization processing to the same maximum value.

In one embodiment, the adaptive processing of the volume includes voiced regions and unvoiced regions, for voiced regions

1): solving the maximum value of the current sound data;

2): calculating a coefficient value with respect to a constant with reference to the maximum value of the sound volume;

3): the current sound data is multiplied by the coefficient to obtain the data of the adaptive volume.

Specifically, the adaptive processing of the silence area adopts a silence area self-truncation algorithm of the input sound characteristics, and the silence area self-truncation algorithm of the input sound characteristics comprises the following steps:

1) averaging the absolute values of about 0.1s of 1600 sample values;

2) judging whether the current data is a silent area or not through a threshold value;

3) the truncation processing is performed for data that is a silent zone.

In one particular embodiment, the pre-processing module includes: a noise reduction unit and a signal enhancement unit;

the noise reduction unit is used for carrying out noise reduction processing on the voiceprint information to obtain the voiceprint information subjected to noise reduction; at least one of a spectrum elimination method and/or a learning identification method and/or a noise reduction automatic encoder is adopted for noise suppression;

the signal enhancement unit is connected with the input end of the noise reduction unit and used for enhancing the voiceprint information of the human voice acquisition module.

In a specific embodiment, the voiceprint feature extraction module comprises: the voice print feature extraction unit and the voice spectrum chip conversion unit; the voiceprint feature extraction unit is used for extracting the voiceprint features of the voice of the user through the trained neural network algorithm model; and the voice spectrogram slice conversion unit is connected with the output end of the voiceprint feature extraction unit and is used for converting the obtained voiceprint features into a voice spectrogram.

In a specific embodiment, a convolutional neural network is combined with an attention mechanism to perform feature extraction on user voice to obtain a frame-level feature vector sequence; the method comprises the steps of performing down-sampling on a frame-level feature vector sequence by combining an attention mechanism to convert the frame-level feature vector sequence into an intermediate feature vector with a preset dimension; and performing full-concatenation operation on the intermediate feature vector to obtain a sentence-level voiceprint feature vector.

In a specific embodiment, the extracting features of the input target speech data by using a convolutional neural network in combination with an attention mechanism to obtain a frame-level feature vector sequence includes:

sequentially performing convolution and rectification operation for at least one time on the target voice data to obtain a first feature vector sequence; calculating a channel attention vector according to the first feature vector sequence, and weighting the first feature vector sequence by using the channel attention vector sequence to obtain a second feature vector sequence; calculating a time attention vector according to the second feature vector sequence, and weighting the second feature vector sequence by using the time attention vector sequence to obtain a third feature vector sequence; and rectifying the third feature vector sequence to obtain a frame-level feature vector sequence.

In one embodiment, the calculating the channel attention vector by the first feature vector sequence, and the weighting the first feature vector sequence by the channel attention vector to obtain the second feature vector sequence includes: aggregating time information of each channel of the first sequence of feature vectors using an average pooling operation and a maximum pooling operation, respectively; respectively inputting the results of the average pooling operation and the maximum pooling operation into a multilayer perceptron; calculating a channel attention vector according to a result output by the multilayer perceptron by using a Sigmoid function; and multiplying the first feature vector sequence and the channel attention vector element by element to obtain the second feature vector sequence.

In a specific embodiment, the calculating the time attention vector by the second feature vector sequence, and weighting the second feature vector sequence by using the time attention vector to obtain the third feature vector sequence includes: aggregating channel information of each time point of the second feature vector sequence by utilizing average pooling operation and maximum pooling operation respectively; merging results of the average pooling operation and the maximum pooling operation into a multi-dimensional vector; carrying out convolution processing on the multidimensional vector by adopting a preset convolution kernel to obtain a time attention vector; and multiplying the second feature vector sequence and the time attention vector element by element to obtain a third feature vector sequence.

In a specific embodiment, after the calculating a time attention vector by the second feature vector sequence and weighting the second feature vector sequence by using the time attention vector to obtain a third feature vector sequence, the method further includes: adding a residual error on the basis of the third feature vector sequence; performing rectification operation on the third feature vector sequence to obtain a frame-level feature vector sequence comprises: and performing rectification operation on the sum of the third feature vector sequence and the residual error to obtain a frame-level feature vector sequence.

In a specific embodiment, the neural network algorithm model comprises an input layer, an SVM layer, a convolution layer, a pooling layer and a full-connection layer, wherein the input layer is spectrum information obtained by Laplace transform of voiceprint information output by the preprocessing module, the spectrum information of the full-connection layer input by the SVM layer is a feature vector obtained by a voiceprint feature extraction module, and the convolution layer adopts a convolution kernel of 5 × 5 and 8 filters; the size of the pooling window of the pooling layer is 3 x 3, and the number of channels is 16; the full connection layer adopts 16 filters and a convolution kernel of 3 x 3; the input of the full connection layer is from the output of the pooling layer;

the pooling method of the pooling layer is as follows:

X^e＝f(u^e+φ(u^e))

wherein x is^eRepresents the output of the current layer, u^eRepresenting the input of an activation function, f () representing an activation function, w^eRepresents the weight of the current layer, phi represents the loss function, x^e-1Represents the output of the next layer, b^eRepresents the offset and δ represents a constant.

In a specific embodiment, the method further comprises the following steps: a feedback voice module and a voice output module;

the voice feedback module is connected with the input/output end of the voiceprint recognition module, acquires the recognition result of the voiceprint recognition module and outputs a corresponding voice feedback signal to the voiceprint recognition module;

and the sound output module is connected with the third input end of the voiceprint recognition module and used for receiving and outputting a voice feedback signal, such as: successful recognition, name Zhang III, study number 001, examination subject being math, etc.

In one embodiment, the user is a student taking a test, and voice recognition is performed to identify identity information of the student, so that cheating of a test in replacement is prevented.

In a particular embodiment, the sound collection unit comprises a microphone.

In a particular embodiment, the sound output module comprises a speaker.

In one particular embodiment, the noise reduction unit includes an XFM10412 chip.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention in a progressive manner. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A human voice recognition system, comprising: the voice recognition system comprises a voice acquisition module, a preprocessing module, a voiceprint feature extraction module, a function switching module, a voiceprint recognition module and a model training module;

and the input end of the voiceprint recognition module is connected with the second output end of the function switching module, and the first input/output end of the voiceprint recognition module is connected with the input/output end of the voiceprint template library and is used for recognizing the identity of the user according to the voiceprint template.

2. A human voice recognition system according to claim 1, wherein:

the human voice acquisition module comprises: a sound collection unit and a volume adaptive unit;

and the volume self-adaptive unit is used for carrying out self-adaptive processing on the volume of the sound of the user, and carrying out overall normalization processing on the volume of the sound of the user after carrying out recognition model training to the volume of the sound of the user to the same maximum value.

3. A human voice recognition system according to claim 1, wherein:

the preprocessing module comprises: a noise reduction unit and a signal enhancement unit;

the noise reduction unit is used for carrying out noise reduction processing on the voiceprint information to obtain the voiceprint information subjected to noise reduction; performing noise suppression by at least one of a spectrum elimination method and/or a learning similarity method and/or a noise reduction automatic encoder;

4. A human voice recognition system according to claim 1, wherein:

the voiceprint feature extraction module comprises: the voice print feature extraction unit and the voice spectrum chip conversion unit;

the voiceprint feature extraction unit is used for extracting the voiceprint features of the voice of the user through a trained neural network algorithm model;

and the voice spectrum picture conversion unit is connected with the output end of the voiceprint feature extraction unit and is used for converting the obtained voiceprint features into a voice spectrum.

5. A human voice recognition system according to any one of claims 1 to 4, wherein:

further comprising: a feedback voice module and a voice output module;

the feedback voice module is connected with the input/output end of the voiceprint recognition module, acquires the recognition result of the voiceprint recognition module and outputs a corresponding voice feedback signal to the voiceprint recognition module;

and the sound output module is connected with the third input end of the voiceprint recognition module and used for receiving and outputting the voice feedback signal.