CN113077794A - Human voice recognition system - Google Patents

Human voice recognition system Download PDF

Info

Publication number
CN113077794A
CN113077794A CN202110367218.0A CN202110367218A CN113077794A CN 113077794 A CN113077794 A CN 113077794A CN 202110367218 A CN202110367218 A CN 202110367218A CN 113077794 A CN113077794 A CN 113077794A
Authority
CN
China
Prior art keywords
module
voiceprint
voice
recognition
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110367218.0A
Other languages
Chinese (zh)
Inventor
程杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xinzhiyicai Technology Co ltd
Original Assignee
Nanjing Xinzhiyicai Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Xinzhiyicai Technology Co ltd filed Critical Nanjing Xinzhiyicai Technology Co ltd
Priority to CN202110367218.0A priority Critical patent/CN113077794A/en
Publication of CN113077794A publication Critical patent/CN113077794A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a voice recognition system, which is applied to the technical field of voice recognition and comprises the following components: the voice recognition system comprises a voice acquisition module, a preprocessing module, a voiceprint feature extraction module, a function switching module, a model training module and a voice recognition module. The invention can make the input characteristic more perfect, the noise is smaller, the algorithm precision is higher; the deep neural convolution network algorithm is adopted to extract and classify the high-dimensional characteristics of the voice, the voice characteristics of the speaker are extracted and directly identified, the defect that the speaker is identified through the speaking content of the speaker is avoided, and the identification accuracy is improved.

Description

Human voice recognition system
Technical Field
The invention relates to the technical field of voice recognition, in particular to a human voice recognition system.
Background
The speech recognition technology is an information technology for converting a voice, a byte, or a phrase uttered by a person into a corresponding word or symbol, or giving a response, through a recognition and understanding process of a machine. The voiceprint is a sound wave spectrum which is displayed by an electro-acoustic instrument and carries speech information, the generation of human language is a complex physiological and physical process between a human language center and a pronunciation organ, and the size and the shape of the pronunciation organ used by a person during speaking, such as a tongue, teeth, a larynx, a lung and a nasal cavity, are greatly different, so that the voiceprint spectrums of any two persons are different. Corresponding sound wave frequency spectrums of different users have difference when speaking, so that the unique user can be identified through the voiceprint.
In the prior art, the voiceprint recognition mode has the defect of inaccurate recognition, and compared with identity recognition modes such as face recognition, fingerprint recognition and the like, the voiceprint recognition mode is not widely applied at present due to the defect.
Therefore, it is an urgent problem to be solved by those skilled in the art to provide a voice recognition system capable of accurately recognizing human voice.
Disclosure of Invention
In view of this, the present invention provides a voice recognition system, which can accurately recognize the user identity through the voiceprint.
In order to achieve the purpose, the invention adopts the following technical scheme:
a voice recognition system comprises a voice acquisition module, a preprocessing module, a voiceprint feature extraction module, a function switching module, a voiceprint recognition module and a model training module;
the voice acquisition module is connected with the input end of the preprocessing module and is used for acquiring voice print information after acquiring voice;
the preprocessing module is connected with the input end of the voiceprint feature extraction module and is used for carrying out noise reduction processing on the voiceprint information;
the voiceprint feature extraction module is connected with the input end of the function switching module and used for extracting voiceprint features;
the function switching module is used for selecting the voiceprint recognition function and the model training function;
the model training module is connected with the first output end of the function switching module and used for performing model training on the voiceprint features to obtain a voiceprint template;
the voiceprint template library is connected with the output end of the model training template and used for acquiring and storing the voiceprint template;
the input end of the voiceprint recognition module is connected with the second output end of the function switching module, and the first input/output end of the voiceprint recognition module is connected with the input/output end of the voiceprint template library and used for recognizing the identity of the user according to the voiceprint template.
Preferably, the human voice collecting module includes: a sound collection unit and a volume adaptive unit;
the sound collection unit is connected with the input end of the volume self-adaptive unit and is used for collecting user sound for human voice recognition; and the volume self-adaptive unit is used for carrying out self-adaptive processing on the volume of the sound of the user, and carrying out overall normalization processing on the volume of the sound of the user after carrying out recognition model training to the volume of the sound of the user to the same maximum value.
The technical effect realized by the technical scheme is as follows: the sound volume of the user is processed to obtain the same maximum value, so that the intensity of the sound signal is balanced, and the voiceprint feature extraction is convenient.
Preferably, the preprocessing module comprises: a noise reduction unit and a signal enhancement unit;
the noise reduction unit is used for carrying out noise reduction processing on the voiceprint information to obtain the voiceprint information subjected to noise reduction; at least one of a spectrum elimination method and/or a learning identification method and/or a noise reduction automatic encoder is adopted for noise suppression; the signal enhancement unit is connected with the input end of the noise reduction unit and used for enhancing the voiceprint information of the human voice acquisition module.
The technical effect realized by the technical scheme is as follows: the noise of the user voiceprint information is reduced, and the final distinguishing accuracy is improved.
Preferably, the voiceprint feature extraction module includes: the voice print feature extraction unit and the voice spectrum chip conversion unit; the voiceprint feature extraction unit is used for extracting the voiceprint features of the voice of the user through a trained neural network algorithm model; and the voice spectrum picture conversion unit is connected with the output end of the voiceprint feature extraction unit and is used for converting the obtained voiceprint features into a voice spectrum.
Preferably, the method further comprises the following steps: a feedback voice module and a voice output module;
the feedback voice module is connected with the input/output end of the voiceprint recognition module, acquires the recognition result of the voiceprint recognition module and outputs a corresponding voice feedback signal to the voiceprint recognition module; and the sound output module is connected with the third input end of the voiceprint recognition module and used for receiving and outputting the voice feedback signal.
The technical effect realized by the technical scheme is as follows: the result of the human voice (identity) recognition is directly obtained through voice.
Through the technical scheme, compared with the prior art, the invention provides a human voice recognition system which comprises the following components: the invention can make the input characteristic more perfect, the noise is smaller, the algorithm precision is higher; the deep neural convolution network algorithm is adopted to extract and classify the high-dimensional characteristics of the voice, the voice characteristics of the speaker are extracted and directly identified, the defect that the speaker is identified through the speaking content of the speaker is avoided, and the identification accuracy is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a block diagram of a human voice recognition system according to the present invention;
FIG. 2 is a block diagram of a human voice acquisition module according to the present invention;
FIG. 3 is a block diagram of a preprocessing module according to the present invention;
FIG. 4 is a block diagram of the voiceprint feature extraction module of the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Referring to fig. 1, the present embodiment discloses a human voice recognition system,
the voice recognition system comprises a voice acquisition module, a preprocessing module, a voiceprint feature extraction module, a function switching module, a voiceprint recognition module and a model training module;
the voice acquisition module is connected with the input end of the preprocessing module and is used for acquiring voice print information after acquiring voice;
the preprocessing module is connected with the input end of the voiceprint feature extraction module and is used for carrying out noise reduction processing on the voiceprint information;
the voiceprint feature extraction module is connected with the input end of the function switching module and used for extracting voiceprint features;
the function switching module is used for selecting the voiceprint recognition function and the model training function;
the model training module is connected with the first output end of the function switching module and used for carrying out model training on the voiceprint characteristics to obtain a voiceprint template;
the voiceprint template library is connected with the output end of the model training template and used for acquiring and storing the voiceprint template;
the input end of the voiceprint recognition module is connected with the second output end of the function switching module, and the first input/output end of the voiceprint recognition module is connected with the input/output end of the voiceprint template library and used for recognizing the identity of the user according to the voiceprint template.
In one embodiment, the function switching module includes: acquiring a voiceprint template and recognizing a voiceprint; acquiring voice information with fixed content under the functional state of voiceprint template acquisition, and obtaining a voiceprint template through model training; under the more functional state of voiceprint recognition, the voice sent by the current voice recognizer is collected, and after voice preprocessing and voiceprint feature extraction, the voice is compared with the voiceprint in the voiceprint template library for direct recognition.
In a specific embodiment, the model trained after extracting the voiceprint features in the model training module is any neural network model capable of realizing the technical effect in the prior art.
In one embodiment, the human voice collecting module comprises: a sound collection unit and a volume adaptive unit;
the sound collection unit is connected with the input end of the volume self-adaptive unit and is used for collecting user sound for human voice recognition;
and the volume self-adaption unit is used for carrying out self-adaption processing on the volume of the sound of the user, carrying out recognition model training on the volume of the sound of the user, and carrying out overall normalization processing to the same maximum value.
In one embodiment, the adaptive processing of the volume includes voiced regions and unvoiced regions, for voiced regions
1): solving the maximum value of the current sound data;
2): calculating a coefficient value with respect to a constant with reference to the maximum value of the sound volume;
3): the current sound data is multiplied by the coefficient to obtain the data of the adaptive volume.
Specifically, the adaptive processing of the silence area adopts a silence area self-truncation algorithm of the input sound characteristics, and the silence area self-truncation algorithm of the input sound characteristics comprises the following steps:
1) averaging the absolute values of about 0.1s of 1600 sample values;
2) judging whether the current data is a silent area or not through a threshold value;
3) the truncation processing is performed for data that is a silent zone.
In one particular embodiment, the pre-processing module includes: a noise reduction unit and a signal enhancement unit;
the noise reduction unit is used for carrying out noise reduction processing on the voiceprint information to obtain the voiceprint information subjected to noise reduction; at least one of a spectrum elimination method and/or a learning identification method and/or a noise reduction automatic encoder is adopted for noise suppression;
the signal enhancement unit is connected with the input end of the noise reduction unit and used for enhancing the voiceprint information of the human voice acquisition module.
In a specific embodiment, the voiceprint feature extraction module comprises: the voice print feature extraction unit and the voice spectrum chip conversion unit; the voiceprint feature extraction unit is used for extracting the voiceprint features of the voice of the user through the trained neural network algorithm model; and the voice spectrogram slice conversion unit is connected with the output end of the voiceprint feature extraction unit and is used for converting the obtained voiceprint features into a voice spectrogram.
In a specific embodiment, a convolutional neural network is combined with an attention mechanism to perform feature extraction on user voice to obtain a frame-level feature vector sequence; the method comprises the steps of performing down-sampling on a frame-level feature vector sequence by combining an attention mechanism to convert the frame-level feature vector sequence into an intermediate feature vector with a preset dimension; and performing full-concatenation operation on the intermediate feature vector to obtain a sentence-level voiceprint feature vector.
In a specific embodiment, the extracting features of the input target speech data by using a convolutional neural network in combination with an attention mechanism to obtain a frame-level feature vector sequence includes:
sequentially performing convolution and rectification operation for at least one time on the target voice data to obtain a first feature vector sequence; calculating a channel attention vector according to the first feature vector sequence, and weighting the first feature vector sequence by using the channel attention vector sequence to obtain a second feature vector sequence; calculating a time attention vector according to the second feature vector sequence, and weighting the second feature vector sequence by using the time attention vector sequence to obtain a third feature vector sequence; and rectifying the third feature vector sequence to obtain a frame-level feature vector sequence.
In one embodiment, the calculating the channel attention vector by the first feature vector sequence, and the weighting the first feature vector sequence by the channel attention vector to obtain the second feature vector sequence includes: aggregating time information of each channel of the first sequence of feature vectors using an average pooling operation and a maximum pooling operation, respectively; respectively inputting the results of the average pooling operation and the maximum pooling operation into a multilayer perceptron; calculating a channel attention vector according to a result output by the multilayer perceptron by using a Sigmoid function; and multiplying the first feature vector sequence and the channel attention vector element by element to obtain the second feature vector sequence.
In a specific embodiment, the calculating the time attention vector by the second feature vector sequence, and weighting the second feature vector sequence by using the time attention vector to obtain the third feature vector sequence includes: aggregating channel information of each time point of the second feature vector sequence by utilizing average pooling operation and maximum pooling operation respectively; merging results of the average pooling operation and the maximum pooling operation into a multi-dimensional vector; carrying out convolution processing on the multidimensional vector by adopting a preset convolution kernel to obtain a time attention vector; and multiplying the second feature vector sequence and the time attention vector element by element to obtain a third feature vector sequence.
In a specific embodiment, after the calculating a time attention vector by the second feature vector sequence and weighting the second feature vector sequence by using the time attention vector to obtain a third feature vector sequence, the method further includes: adding a residual error on the basis of the third feature vector sequence; performing rectification operation on the third feature vector sequence to obtain a frame-level feature vector sequence comprises: and performing rectification operation on the sum of the third feature vector sequence and the residual error to obtain a frame-level feature vector sequence.
In a specific embodiment, the neural network algorithm model comprises an input layer, an SVM layer, a convolution layer, a pooling layer and a full-connection layer, wherein the input layer is spectrum information obtained by Laplace transform of voiceprint information output by the preprocessing module, the spectrum information of the full-connection layer input by the SVM layer is a feature vector obtained by a voiceprint feature extraction module, and the convolution layer adopts a convolution kernel of 5 × 5 and 8 filters; the size of the pooling window of the pooling layer is 3 x 3, and the number of channels is 16; the full connection layer adopts 16 filters and a convolution kernel of 3 x 3; the input of the full connection layer is from the output of the pooling layer;
the pooling method of the pooling layer is as follows:
Xe=f(ue+φ(ue))
Figure BDA0003008004950000081
wherein x iseRepresents the output of the current layer, ueRepresenting the input of an activation function, f () representing an activation function, weRepresents the weight of the current layer, phi represents the loss function, xe-1Represents the output of the next layer, beRepresents the offset and δ represents a constant.
In a specific embodiment, the method further comprises the following steps: a feedback voice module and a voice output module;
the voice feedback module is connected with the input/output end of the voiceprint recognition module, acquires the recognition result of the voiceprint recognition module and outputs a corresponding voice feedback signal to the voiceprint recognition module;
and the sound output module is connected with the third input end of the voiceprint recognition module and used for receiving and outputting a voice feedback signal, such as: successful recognition, name Zhang III, study number 001, examination subject being math, etc.
In one embodiment, the user is a student taking a test, and voice recognition is performed to identify identity information of the student, so that cheating of a test in replacement is prevented.
In a particular embodiment, the sound collection unit comprises a microphone.
In a particular embodiment, the sound output module comprises a speaker.
In one particular embodiment, the noise reduction unit includes an XFM10412 chip.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention in a progressive manner. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (5)

1. A human voice recognition system, comprising: the voice recognition system comprises a voice acquisition module, a preprocessing module, a voiceprint feature extraction module, a function switching module, a voiceprint recognition module and a model training module;
the voice acquisition module is connected with the input end of the preprocessing module and is used for acquiring voice print information after acquiring voice;
the preprocessing module is connected with the input end of the voiceprint feature extraction module and is used for carrying out noise reduction processing on the voiceprint information;
the voiceprint feature extraction module is connected with the input end of the function switching module and used for extracting voiceprint features;
the function switching module is used for selecting the voiceprint recognition function and the model training function;
the model training module is connected with the first output end of the function switching module and used for performing model training on the voiceprint features to obtain a voiceprint template;
the voiceprint template library is connected with the output end of the model training template and used for acquiring and storing the voiceprint template;
and the input end of the voiceprint recognition module is connected with the second output end of the function switching module, and the first input/output end of the voiceprint recognition module is connected with the input/output end of the voiceprint template library and is used for recognizing the identity of the user according to the voiceprint template.
2. A human voice recognition system according to claim 1, wherein:
the human voice acquisition module comprises: a sound collection unit and a volume adaptive unit;
the sound collection unit is connected with the input end of the volume self-adaptive unit and is used for collecting user sound for human voice recognition;
and the volume self-adaptive unit is used for carrying out self-adaptive processing on the volume of the sound of the user, and carrying out overall normalization processing on the volume of the sound of the user after carrying out recognition model training to the volume of the sound of the user to the same maximum value.
3. A human voice recognition system according to claim 1, wherein:
the preprocessing module comprises: a noise reduction unit and a signal enhancement unit;
the noise reduction unit is used for carrying out noise reduction processing on the voiceprint information to obtain the voiceprint information subjected to noise reduction; performing noise suppression by at least one of a spectrum elimination method and/or a learning similarity method and/or a noise reduction automatic encoder;
the signal enhancement unit is connected with the input end of the noise reduction unit and used for enhancing the voiceprint information of the human voice acquisition module.
4. A human voice recognition system according to claim 1, wherein:
the voiceprint feature extraction module comprises: the voice print feature extraction unit and the voice spectrum chip conversion unit;
the voiceprint feature extraction unit is used for extracting the voiceprint features of the voice of the user through a trained neural network algorithm model;
and the voice spectrum picture conversion unit is connected with the output end of the voiceprint feature extraction unit and is used for converting the obtained voiceprint features into a voice spectrum.
5. A human voice recognition system according to any one of claims 1 to 4, wherein:
further comprising: a feedback voice module and a voice output module;
the feedback voice module is connected with the input/output end of the voiceprint recognition module, acquires the recognition result of the voiceprint recognition module and outputs a corresponding voice feedback signal to the voiceprint recognition module;
and the sound output module is connected with the third input end of the voiceprint recognition module and used for receiving and outputting the voice feedback signal.
CN202110367218.0A 2021-04-06 2021-04-06 Human voice recognition system Withdrawn CN113077794A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110367218.0A CN113077794A (en) 2021-04-06 2021-04-06 Human voice recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110367218.0A CN113077794A (en) 2021-04-06 2021-04-06 Human voice recognition system

Publications (1)

Publication Number Publication Date
CN113077794A true CN113077794A (en) 2021-07-06

Family

ID=76615844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110367218.0A Withdrawn CN113077794A (en) 2021-04-06 2021-04-06 Human voice recognition system

Country Status (1)

Country Link
CN (1) CN113077794A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113948109A (en) * 2021-10-14 2022-01-18 广州蓝仕威克软件开发有限公司 System for recognizing physiological phenomenon based on voice

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113948109A (en) * 2021-10-14 2022-01-18 广州蓝仕威克软件开发有限公司 System for recognizing physiological phenomenon based on voice

Similar Documents

Publication Publication Date Title
Tiwari MFCC and its applications in speaker recognition
US20130297299A1 (en) Sparse Auditory Reproducing Kernel (SPARK) Features for Noise-Robust Speech and Speaker Recognition
Rajisha et al. Performance analysis of Malayalam language speech emotion recognition system using ANN/SVM
KR101785500B1 (en) A monophthong recognition method based on facial surface EMG signals by optimizing muscle mixing
Murugappan et al. DWT and MFCC based human emotional speech classification using LDA
US20230298616A1 (en) System and Method For Identifying Sentiment (Emotions) In A Speech Audio Input with Haptic Output
Usman et al. Heart rate detection and classification from speech spectral features using machine learning
Kanabur et al. An extensive review of feature extraction techniques, challenges and trends in automatic speech recognition
Grewal et al. Isolated word recognition system for English language
CN113077794A (en) Human voice recognition system
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
Karthikeyan et al. Hybrid machine learning classification scheme for speaker identification
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
Tripathi et al. CNN based Parkinson's Disease Assessment using Empirical Mode Decomposition.
Hemmerling et al. Parkinson’s disease classification based on vowel sound
Sengupta et al. Optimization of cepstral features for robust lung sound classification
CN115050353A (en) Human voice recognition system
Abushariah et al. Voice based automatic person identification system using vector quantization
Nazifa et al. Gender prediction by speech analysis
CN114881668A (en) Multi-mode-based deception detection method
Daqrouq et al. Arabic vowels recognition based on wavelet average framing linear prediction coding and neural network
Sas et al. Gender recognition using neural networks and ASR techniques
Arpitha et al. Diagnosis of disordered speech using automatic speech recognition
CN111508503B (en) Method and device for identifying same speaker

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210706

WW01 Invention patent application withdrawn after publication