CN115641839A

CN115641839A - Intelligent voice recognition method and system

Info

Publication number: CN115641839A
Application number: CN202211093905.9A
Authority: CN
Inventors: 魏亚峰; 魏子晨
Original assignee: Xuzhou Taiyu Network Technology Co ltd
Current assignee: Xuzhou Taiyu Network Technology Co ltd
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2023-01-24

Abstract

The invention discloses an intelligent voice recognition method and system, wherein the method respectively obtains voice contents of a teacher and students by recognizing and classifying voices in a classroom, judges the correlation degree of the content of the teacher on class and courses, calculates the attention concentration degree of the students based on the voice contents of the students, calculates the understanding degree of the courses of the students based on the voice related to the classroom, and calculates the correlation between the content of the teacher on class and the attention concentration degree of the students; and the teacher is prompted according to the correlation and the class understanding degree of the students, so that the quantitative evaluation of the teaching quality of the teacher in the class and the attention concentration degree of the students is realized, the real-time feedback of the teaching of the teacher is realized, and the teaching quality of the class is improved.

Description

Intelligent voice recognition method and system

Technical Field

The invention relates to the field of voice recognition, in particular to an intelligent voice recognition method and system.

Background

At present, artificial intelligence has been incorporated into the aspects of our lives. Speech recognition is also used extensively in life. The rapid development and application of artificial intelligence improve the application effect of the voice recognition technology, and the requirements of people in various aspects are met by extracting character data from voice signals.

In the traditional teaching quality assessment, the quality assessment of the class of a teacher is usually obtained in an assessment or questionnaire mode, but the class quality of the teacher and the reaction condition of students cannot be supervised in real time in class. Moreover, such an evaluation result has subjective influence, and objective evaluation cannot be achieved. Meanwhile, the performance of students in class and the influence factors can not be quantitatively analyzed.

Disclosure of Invention

Technical problem to be solved

In order to solve the technical problems, the invention provides an intelligent voice recognition method and an intelligent voice recognition system, wherein the method respectively obtains voice contents of a teacher and students by recognizing and classifying voices in a classroom, judges the correlation degree of the content of the teacher on the class and courses, calculates the attention concentration degree of the students based on the voice contents of the students, calculates the understanding degree of the courses of the students based on the voices related to the classroom, and calculates the correlation between the content of the teacher on the class and the attention concentration degree of the students; and prompting the teacher according to the correlation and the understanding degree of the student courses.

(II) technical scheme

In order to solve the technical problems and achieve the purpose of the invention, the invention is realized by the following technical scheme:

an intelligent speech recognition method, comprising the steps of:

s1: acquiring classroom voice signals through a plurality of microphones, specifically, arranging a plurality of microphones on the front, rear, left and right walls and the ceiling of a classroom, and acquiring the classroom voice signals;

s2: performing signal preprocessing on a voice signal, wherein the voice signal preprocessing comprises the processes of pre-emphasis, endpoint detection, denoising, framing and windowing;

s3: distinguishing the ages of speakers in the voice signals by identifying the ages of the voices, judging roles according to the ages, and determining the identities of the speakers as teachers or students;

s4: respectively recognizing the voice contents of teacher and student, and converting the voice information into text information

S5: judging the correlation degree of the lesson taking content of the teacher and the lessons;

s6: classifying voice contents of students, dividing the voice contents into class-irrelevant voices and class-relevant voices, calculating to obtain the attention concentration degree of the students based on the class-irrelevant voices, and calculating to obtain the course understanding degree of the students based on the class-relevant voices;

s7: calculating the correlation between the content of the teacher in class and the attention concentration degree of the students; and prompting the teacher according to the correlation and the understanding degree of the student course.

Further, the endpoint detection adopts an LTSD algorithm and is classified based on the signal-to-noise ratio of the voice signals.

Further, the endpoint detection includes threshold discrimination, and based on the adaptive threshold value, the calculation method is as follows:

wherein E (k) is a noise estimation value, σ (k) is a noise mean square error estimation value, and f (SNR) is a signal-to-noise ratio correlation function.

And the signal-to-noise ratio correlation function is obtained by fitting according to historical data.

Further, the step S3 includes:

s31: and generating a spectrogram, wherein the spectrogram generating process comprises power spectrum calculation and obtaining Mel spectrogram output through a Mel filter bank.

S32: extracting characteristics, wherein the obtained MFCC characteristic parameters of the voice signals are used as characteristic parameters of voice recognition;

s33: and (3) establishing and training a model, training the extracted features based on an LSTM neural network model, and using the trained model for speech age recognition after training. And determining the speaker as a teacher or a student through the recognition result.

Further, the mel filter bank tests the background noise in the classroom environment, extracts the fundamental frequency fn of the background noise, and adds the fundamental frequency fn into the triangular filter bank.

Further, the background noise can be classified into classroom external noise and classroom internal noise, and the classroom internal noise includes sound of the rotation of the fan, sound of a computer or other machines, and the like, so that two fundamental frequency signals inside and outside the classroom can be respectively extracted, and the two fundamental frequency signals are integrated, and the integration mode is as follows:

f _add ＝af ₁ +bf ₂

wherein a and b are weight coefficients, f ₁ 、f ₂ Respectively, the fundamental frequencies of the noise inside and outside the classroom.

Further, the step S4 further includes correcting the character recognition result by recognizing the voice content. Further judging whether or not the speaker 'S character can be specified in the sentence relating to the own character in the recognized speech content, and if so, comparing the recognized speaker' S character with the character in step S3, and if not, modifying the original judgment result.

Further, the step S7 further includes:

the correlation of the content of the teacher's lesson and the attention concentration of the students is expressed as follows:

wherein H is the concentration of the subject of the student's voice irrelevant to the class, c ₁ To adjust the coefficients.

The invention also provides an intelligent voice recognition system, which specifically comprises:

the microphone array is used for acquiring classroom voice signals through a plurality of microphones, and optionally, a plurality of microphones are arranged on the front, rear, left and right walls and the ceiling of a classroom to acquire the classroom voice signals.

And the voice signal preprocessing module is used for preprocessing the collected voice signals, and the voice signal preprocessing comprises the processes of pre-emphasis, endpoint detection, denoising, framing and windowing.

And the voice age identification module is used for distinguishing the age of the speaker in the voice signal by identifying the voice age, judging the role according to the age and determining the identity of the speaker as a teacher or a student.

The voice recognition module is used for respectively recognizing the voice contents of the teacher and the students and converting the voice information into text information through recognition;

and the correlation degree calculation module is used for acquiring the electronic teaching plan of the teaching contents, extracting keywords from the electronic teaching plan, comparing the identified teaching contents of the teacher with the keywords in the teaching plan and acquiring the correlation degree between the teaching contents and the keywords in the teaching plan.

The student learning state acquisition module is used for classifying the voice content of the student, dividing the voice content into class-independent voice and class-related voice, calculating the attention concentration degree of the student based on the class-independent voice, and calculating the course understanding degree of the student based on the class-related voice.

The prompting module is used for calculating the correlation between the content of the lessons taken by the teacher and the attention concentration degree of the students; and prompting the teacher according to the correlation and the understanding degree of the student course.

(III) advantageous effects

The invention has the beneficial effects that:

(1) The intelligent voice recognition based quantitative assessment of the teaching quality of teachers and the attention concentration degree of students in a classroom is realized.

(2) Calculating the correlation between the content of the teacher in class and the attention concentration degree of the students through voice recognition; and the teacher is prompted according to the correlation and the class understanding degree of the students, so that real-time feedback of teaching of the teacher is realized, and the class teaching quality is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow diagram of an intelligent speech recognition method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a speech signal preprocessing process according to an embodiment of the present application.

Detailed Description

The embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

The embodiments of the present disclosure are described below with specific examples, and other advantages and effects of the present disclosure will be readily apparent to those skilled in the art from the disclosure in the specification. It is to be understood that the described embodiments are merely illustrative of some, and not restrictive, of the embodiments of the disclosure. The disclosure may be embodied or carried out in various other specific embodiments, and various modifications and changes may be made in the details within the description without departing from the spirit of the disclosure. It should be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be further noted that the drawings provided in the following embodiments are only schematic illustrations of the basic concepts of the present disclosure, and the drawings only show the components related to the present disclosure rather than the numbers, shapes and dimensions of the components in actual implementation, and the types, the numbers and the proportions of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

Referring to fig. 1, an intelligent speech recognition method includes:

s1: obtaining classroom speech signals through multiple microphones

A plurality of microphones are arranged on the front, rear, left and right walls and the ceiling of a classroom to collect voice signals in the classroom.

S2: signal preprocessing for speech signals

The voice signal is data collected by a microphone, and interference waves such as noise are introduced in the data collection process, so that the voice signal needs to be preprocessed in order to highlight acoustic features in the voice signal and reduce the influence of the noise on a recognition result.

As shown in fig. 2, the speech signal preprocessing includes pre-emphasis, endpoint detection, de-noising, framing, and windowing processes.

S21: pre-emphasis

In order to eliminate the influence of vocal cords and lips on the speech recognition definition in the process of phonation, increase the resolution of high frequency and highlight the high-frequency formant, the pre-emphasis operation needs to be performed on the speech signal, and the process is realized by adopting a high-pass filter and is expressed as follows:

wherein ρ ∈ [0.9,1].

S22: endpoint detection

Speech signals have discontinuities and therefore it is important to recognize speech segments and non-speech segments of a speech signal for subsequent speech recognition. The method performs end point detection based on the LTSD algorithm, calculates the signal-to-noise ratio in each window in a sliding window mode, and judges whether the signal-to-noise ratio meets a threshold value or not by using a self-adaptive threshold value method.

In the prior art, the traditional LTSD algorithm is adopted to calculate and process the signal energy of the full frequency band, the signal adaptability is poor, and the noise characteristic cannot be accurately identified. The invention analyzes the noise characteristics of classroom scenes to obtain, in the collected voice signals, the noise energy is mostly concentrated in a low frequency band, and the unvoiced sound which is easy to be identified by mistake is in a high frequency band, so that the noise interference has different influences in different frequency bands, on the basis, the invention improves the traditional LTSD algorithm, classifies based on the signal-to-noise ratio of the voice signals, and realizes a better detection result, and the method comprises the following steps:

(1) And calculating the long-time spectrum envelope in the following way:

wherein, X (k, l) is the amplitude spectrum of the first frame speech of the speech signal at the frequency k. The LTSE utilizes a long-time principle to add a long-time window for analysis on the basis of a short-time amplitude spectrum. The LTSD of order N is calculated using the following equation.

Where N (k) is the amplitude spectrum of the background noise at frequency k, k =0,1.. NFFT-1.

(2) Noise estimation

When calculating the LTSD features and updating the discrimination thresholds, the noise magnitude spectrum needs to be estimated. And setting data of a few frames before the beginning of a section of voice as non-voice, initializing the noise by the section of data, and then updating the noise according to statistical information. When the number of frames continuously determined as non-speech reaches p frames, updating of noise information is started.

(3) Threshold discrimination

On the basis of the traditional threshold discrimination, the invention improves the selection of the threshold value, sets the self-adaptive threshold value, and has the following calculation mode:

S23: framing and windowing

The invention adopts the Hamming window as the window function of the framing operation, and because the Hamming window has the characteristics of wider main lobe width and less side lobe quantity, the Hamming window is taken as the window function of the framing operation, so that the frequency leakage can be reduced.

S3: the age of the speaker in the voice signal is distinguished by identifying the age of the voice, and the identity of the speaker is determined to be a teacher or a student according to the role discrimination of the age.

The classroom application scene of the invention is applied to middle and primary school classrooms, and the ages of the students and teachers are greatly different, so that the speaker in the voice signal can be distinguished as a teacher or a student by identifying the ages of the voices.

The invention realizes the identification of the sound age based on a deep learning algorithm, and specifically comprises the following steps:

s31: generating a spectrogram

The process of generating the spectrogram comprises the steps of calculating a power spectrum, and obtaining output of the Mel spectrogram through a Mel filter bank.

(1) Power spectrum calculation

Performing fast Fourier transform on the voice signal, and calculating to obtain a power spectrum of a spectrogram:

(2) Passing through a mel filter bank

The mel filter bank of the present invention is based on a triangular filter bank, wherein the frequency response of each triangular filter is represented as:

where f (m) is the center frequency of the mth triangular filter.

Because the noise interference suffered by the classrooms at different positions is different, for example, the background noise interference is different between the classroom close to the road and the classroom at the position where the campus is quieter, in order to increase the filtering accuracy, the invention tests the noise interference in the classroom environmentBackground noise and extracting fundamental frequency f of the background noise _n And adding the triangular filter group into a traditional triangular filter group to form a new triangular filter group.

f _add ＝af ₁ +bf ₂

And the integrated fundamental frequency is adopted to obtain a Mel filter bank, so that the identification precision is higher.

(3) Logarithmic discrete cosine transform (DCT cepstrum)

The DCT cepstrum is calculated by the following formula:

where N is the frame length of the speech signal x (N) and C (k) is the orthogonality factor.

S32: feature extraction

The age of the speaker is a variation factor that affects the acoustic characteristics of human speech, and therefore, features that vary significantly in the age dimension are extracted as the feature parameters for recognition.

The feature parameters of voice recognition, which are the MFCC feature parameters of the obtained voice signal:

wherein S (i, m) is Mel energy; m is the mth Mel filter of M filters, i is the ith frame after the speech signal is framed, and n is the number of spectral lines after DCT cepstrum.

S33: model building and training

And training the extracted features based on an LSTM neural network model, and using the trained model for speech age recognition after training. And determining the speaker as a teacher or a student through the recognition result.

S4: respectively identifying the voice contents of a teacher and students;

and converting the voice information into text information through recognition, and optionally, performing voice recognition based on a convolutional neural network.

Further, the character recognition result can be corrected by recognizing the voice content. Further judging whether the speaker 'S role can be specified or not for the sentence related to the self-role in the recognized speech content, if so, comparing the recognized speaker' S role with the role in step S3, and if not, modifying the original judgment result.

obtaining an electronic teaching plan of the content in class, and extracting a keyword K = { K = from the electronic teaching plan ₁ ,k ₂ ,k ₃ ,…,k _n }；

Comparing the identified lesson content of the teacher with the keywords in the teaching plan to obtain the correlation degree between the lesson content and the keywords, wherein the specific calculation mode is as follows:

carrying out sentence division marking on the lesson contents of the teacher, adding 1 to the mark p when the sentence contains the keyword k, and counting the relation between the number m of all sentences and the mark p to obtain the correlation degree between the lesson contents of the teacher and the lessons:

s61: classifying the voice contents of students, and classifying the voice contents into class-independent voices and class-related voices;

(1) Feature extraction

The invention is based on TF-IDF algorithm, namely word frequency-inverse text frequency algorithm to extract the characteristics of the student voice content. The calculation method is as follows:

wherein N is the number of times of occurrence of a specific keyword in the text, N is the total number of terms in the speech, W is the total number of documents in the corpus, and W is the number of documents containing the specific keyword. The larger the value of D, the greater the importance of the word to the text.

The algorithm ranks the weights of the extracted keywords, and the higher the weight is, the more important the ranking is to analyze the whole voice.

(2) Classification model building

Because the classification of the output layer is a problem of two classifications, and the vocabulary related to classroom learning has obvious recognition characteristics, and simultaneously, the characteristic of high effectiveness requirement of speech recognition is considered, the classification is realized by adopting an SVM algorithm. The SVM is a linear classifier, and the SVM algorithm has the advantages of high precision and high operation speed for the problem of two-classification, so that the method is used for classifying the voice contents of the students on the basis of the SVM.

The classification model returns two results, a prediction label and a confidence. The confidence is calculated as follows:

wherein k is the number of all the supported discriminants, n is the number of all the categories, s _i Scores for all supported discriminant classes.

The basic probability distribution function of the model is as follows:

wherein m is ₁ 、m ₂ 、m ₀ Basic probability distribution functions, w, of positive, negative and complete sets, respectively _m (k _i ) Is the weight of the mth base classifier, k _i Is the true class of the ith sample data, P _i Is the posterior probability of the ith sample data.

In order to improve the convergence rate and the classification accuracy of the classification algorithm, a kernel function is improved, and the improved kernel function is as follows:

wherein beta is a weight coefficient, sigma is the kernel radius of the RBF kernel function, tanh is a double tangent function, v is the coefficient of the multilayer perceptron kernel function, and g is an offset.

The improved kernel function has the advantages of a plurality of kernel functions, and the classification accuracy is improved.

(3) Result output

And obtaining the number of the speeches irrelevant to the classroom and the related speeches in the student speech content and the time length.

S62: calculating to obtain the attention concentration degree Col of the student based on the voice irrelevant to the classroom;

wherein a and b are adjustment coefficients, T ₁ For the duration of the classroom-independent speech time within the sampling period, T ₀ Is a sampling period, and n is the time length of the voice irrelevant to the classroom in the sampling period.

S63: calculating to obtain the student course understanding degree based on the classroom related voice;

the student can obtain the understanding degree of the student on the course in the classroom communication, the invention judges whether the voice relates to the question of the content taught by the teacher based on the classroom communication voice of the student, and obtains the understanding degree of the student course according to the number and the density of the voice of the question of the student.

Further, the speech density of the question may be the sum of the durations of the relevant speech in a unit time.

S7: calculating the correlation between the content of the teacher in class and the attention concentration degree of the students; and prompting the teacher according to the correlation and the understanding degree of the student courses.

The change factor of the attention concentration of the students comprises the content of lessons given by a teacher or the influence of external factors. When the concentration of the subjects of the voice irrelevant to the class of the students is low, the situation that the attention of the students is dispersed due to the fact that the content of the teachers in class is relatively boring and the like is confirmed, and then the students can talk irrelevant to the content of class learning in class time; when the concentration of the subjects of the voices irrelevant to the class is high, it is confirmed that the attention of the students is distracted due to the influence of some external factors. The correlation of the teacher's content in class and the attention concentration of the students is expressed as follows:

wherein H is the concentration of the subject of the voice of the students irrelevant to the classroom, c ₁ To adjust the coefficients.

In the embodiment, the voice in the class is recognized and classified, the voice contents of a teacher and the voices of students are respectively obtained, the correlation degree between the teaching content of the teacher and the class and the attention concentration degree of the students are judged, the understanding degree of the student classes is obtained based on the voice related to the class, and the correlation between the teaching content of the teacher and the attention concentration degree of the students is calculated; and prompting the teacher according to the correlation and the understanding degree of the student course. The real-time feedback of the classroom learning condition is realized, and the teaching scientificity is improved.

The embodiment of the present invention further provides an intelligent speech recognition system, which specifically includes:

And the voice age identification module is used for distinguishing the age of the speaker in the voice signal by identifying the voice age, carrying out role discrimination according to the age and determining the identity of the speaker as a teacher or a student.

and the correlation degree calculation module of the lesson contents and the lessons is used for acquiring the electronic teaching plan of the lesson contents, extracting keywords from the electronic teaching plan, comparing the recognized lesson contents of the teacher with the keywords in the electronic teaching plan and acquiring the correlation degree between the lesson contents and the keywords in the electronic teaching plan.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention made by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. An intelligent speech recognition method, comprising the steps of:

s3: distinguishing the ages of speakers in the voice signals by identifying the ages of the voices, and judging roles according to the ages to determine the identities of the speakers as teachers or students;

s4: respectively recognizing the voice contents of teacher and student, and converting the voice information into text information by recognition

2. The intelligent speech recognition method of claim 1, wherein the endpoint detection employs an LTSD algorithm for classification based on a signal-to-noise ratio of the speech signal.

3. The intelligent speech recognition method of claim 2 wherein the endpoint detection comprises a threshold discrimination based on an adaptive threshold value calculated as follows:

wherein E (k) is a noise estimation value, sigma (k) is a noise mean square error estimation value, and f (SNR) is a signal-to-noise ratio correlation function;

4. The intelligent speech recognition method of claim 1, wherein the step S3 comprises:

s31: generating a spectrogram, wherein the spectrogram generating process comprises power spectrum calculation, and obtaining Mel spectrogram output through a Mel filter bank;

s33: and establishing and training a model, training the extracted features based on an LSTM neural network model, using the trained model for voice age recognition after the training is finished, and determining a speaker as a teacher or a student according to a recognition result.

5. The intelligent speech recognition method of claim 4, wherein the Mel filterbank passes the background noise in the classroom environment and extracts the fundamental frequency f of the background noise _n It is added to the triangular filter bank.

6. The intelligent speech recognition method of claim 5, wherein the background noise is divided into external classroom noise and internal classroom noise, and the internal classroom noise includes the sound of a fan, a computer or other machines, and so on, so that two fundamental frequency signals inside and outside the classroom can be extracted separately and integrated as follows:

f _add ＝af ₁ +bf ₂

7. The intelligent voice recognition method according to claim 1, wherein the step S4 further comprises correcting the character recognition result by recognizing voice content; further judging whether or not the speaker 'S character can be specified in the sentence relating to the own character in the recognized speech content, and if so, comparing the recognized speaker' S character with the character in step S3, and if not, modifying the original judgment result.

8. The intelligent speech recognition method of claim 1, wherein the step S5 further comprises:

carrying out sentence division marking on the content of the lessons of the teacher, adding 1 to the mark p when the sentence contains a keyword k, and counting the relation between the number m of all sentences and the mark p as the correlation degree of the content of the lessons of the teacher and the lessons:

9. the intelligent speech recognition method of claim 1, wherein the step S7 further comprises:

10. A system for an intelligent speech recognition method according to any of claims 1-9, wherein the system comprises:

the system comprises a microphone array, a microphone array and a control unit, wherein the microphone array is used for acquiring classroom voice signals through a plurality of microphones, and optionally, a plurality of microphones are arranged on the front, rear, left and right walls and the ceiling of a classroom and used for acquiring the classroom voice signals;

the voice signal preprocessing module is used for preprocessing the collected voice signals, and the voice signal preprocessing comprises the processes of pre-emphasis, endpoint detection, denoising, framing and windowing;

the voice age identification module is used for distinguishing the age of the speaker in the voice signal by identifying the voice age, carrying out role discrimination according to the age and determining the identity of the speaker as a teacher or a student;

the system comprises a lesson content and course correlation degree calculation module, a lesson content and course correlation degree calculation module and a lesson learning module, wherein the lesson content correlation degree calculation module is used for acquiring an electronic teaching plan of the lesson content, extracting keywords from the electronic teaching plan, comparing the identified lesson content of the lesson with the keywords in the teaching plan and acquiring the correlation degree between the lesson content and the keywords in the teaching plan;

the student learning state acquisition module is used for classifying the voice content of the student, dividing the voice content into class-independent voice and class-related voice, calculating the attention concentration degree of the student based on the class-independent voice, and calculating the course understanding degree of the student based on the class-related voice;

the prompting module is used for calculating the correlation between the content of the lessons taken by the teacher and the attention concentration degree of the students; and prompting the teacher according to the correlation and the understanding degree of the student courses.