CN115376499A - Learning monitoring means applied to intelligent earphone in learning field - Google Patents
Learning monitoring means applied to intelligent earphone in learning field Download PDFInfo
- Publication number
- CN115376499A CN115376499A CN202210994128.9A CN202210994128A CN115376499A CN 115376499 A CN115376499 A CN 115376499A CN 202210994128 A CN202210994128 A CN 202210994128A CN 115376499 A CN115376499 A CN 115376499A
- Authority
- CN
- China
- Prior art keywords
- theme
- word
- audio signal
- result
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 40
- 230000005236 sound signal Effects 0.000 claims abstract description 128
- 238000000034 method Methods 0.000 claims abstract description 61
- 238000006243 chemical reaction Methods 0.000 claims abstract description 29
- 230000008569 process Effects 0.000 claims abstract description 25
- 238000012545 processing Methods 0.000 claims description 44
- 238000004364 calculation method Methods 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 26
- 230000011218 segmentation Effects 0.000 claims description 21
- 238000007781 pre-processing Methods 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 17
- 238000012806 monitoring device Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000004044 response Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims 4
- 238000010586 diagram Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention relates to the technical field of learning monitoring, and discloses a learning monitoring means of an intelligent earphone applied to the learning field, which comprises the following steps: extracting the characteristics of the preprocessed intelligent earphone audio signals; constructing an improved end-to-end speech recognition model, and inputting the extracted audio signal characteristics into the model; performing word pair conversion on the recognized voice result, and calculating the word pair compactness of the converted word pair; constructing a dynamic weighted speech topic recognition model based on the word pair compactness, and inputting the word compactness of a speech recognition result into the model; and judging the theme of the voice theme obtained by recognition and the theme of the current teaching content. The method mainly inspects the word pairs with higher word pair compactness in the voice recognition content, the word pairs with higher compactness have higher theme recognition weight, and dynamically considers the theme selection probability of other word pairs in the theme recognition process, thereby realizing more accurate voice theme recognition based on global theme recognition.
Description
Technical Field
The invention relates to the technical field of learning monitoring, in particular to a learning monitoring means of an intelligent earphone applied to the learning field.
Background
Along with the popularization of online teaching, more and more students use the earphone to study in class, but because a computer can play a plurality of audios and videos synchronously, the students may listen to the audio in class on the surface, but actually listen to other audio, and the teacher cannot effectively judge through the class listening pictures of the students in the online class. CN111383659A provides a distributed voice monitoring method, apparatus, system, storage medium and device, and obtains audio stream data belonging to the same machine room; acquiring audio data to be audited from the audio stream data according to a preset trial pushing strategy; inputting the audio data to be audited into a pre-trained audio recognition model to obtain a predicted value corresponding to the audio recognition model; and generating an audio machine examination result according to the predicted value. The method can realize high input-output ratio, high coverage, low delay, high recognition rate and high-efficiency voice monitoring and auditing, and can meet the audio content monitoring requirement of high-activity voice social application in a multi-operator mixed networking deployment network environment. CN107292271A provides a learning monitoring method, apparatus and electronic device, and the method includes: acquiring a class image of a classroom student; identifying the images in class, and acquiring feature data of the students in class, wherein the feature data comprises at least one of the following data: the facial feature data of the classroom student, the visual feature data of the classroom student and the limb feature data of the classroom student; and determining the class state of the classroom student according to the characteristic data of the classroom student. The system can effectively and accurately monitor the class attendance of students who study through a computer and a network in a classroom, and provides effective reference for subsequent study and teaching so as to further improve the study or teaching process. Although the existing learning monitoring method can monitor the class state of the student based on the feature data of the student, the following two problems still exist: firstly, the existing audio monitoring method can only monitor whether the audio violates rules, and whether the playing content of the earphone is consistent with the theme of the content of classroom teaching in class needs to be judged; secondly, the traditional method does not monitor the class state according to the auditory characteristics. To solve the problem, the patent provides an intelligent earphone learning monitoring method applied to the learning field.
Disclosure of Invention
In view of this, the invention provides a learning monitoring means for an intelligent headset applied in the learning field, and aims to (1) use a gate control circulation unit to replace an LSTM part in a traditional speech recognition model, wherein the gate control circulation unit has less parameter quantity and calculation quantity and can complete speech recognition more quickly; (2) And in the process of topic identification, topic selection probabilities of other word pairs are dynamically considered, and more accurate speech topic identification is realized based on global topic identification.
The invention provides a learning monitoring means of an intelligent earphone applied to the learning field, which comprises the following steps:
s1: collecting an intelligent earphone audio signal, preprocessing the collected intelligent earphone audio signal, and performing feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;
s2: constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model to obtain a voice recognition result, wherein the model is input as the extracted audio signal characteristics, and is output as the voice recognition result;
s3: performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair;
s4: constructing a dynamic weighted speech theme recognition model based on the word pair compactness, inputting the word compactness of a speech recognition result into the model, outputting the speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme;
s5: and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the course, otherwise, the student is attending the course.
As a further improvement of the method of the invention:
optionally, the acquiring an audio signal of the smart headset in the step S1, and preprocessing the acquired audio signal of the smart headset includes:
the method comprises the following steps of acquiring audio signals in the intelligent earphone by utilizing a built-in signal acquisition device in the intelligent earphone, and preprocessing the acquired audio signals of the intelligent earphone, wherein the preprocessing process comprises the following steps:
a high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is small and has no attenuation effect on the signal, if the frequency of the acquired signal is lower than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is large and attenuates the signal, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t is e [ t [ ] [ t ] ] 0 ,t L ]Representing timing information of the acquired audio signal, t 0 Representing the initial moment of audio signal acquisition, t L Representing a cut-off time of audio signal acquisition;
said cut-off frequency f t The calculation formula of (c) is:
wherein:
r represents a resistance value of a resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
a digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion process of the analog signal is as follows:
build up of frequency f s Impulse sequence p (t) of (1):
The number of sampling points in the analog signal is N, so that the analog signal isX (1) and the last sample point is x (N), then the analog signal is obtainedx (n') is the converted analog signal;
the signal acquisition device carries out windowing processing on the converted analog signal x (n'), wherein a window function of the windowing processing is a Hamming window, and a windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
and x (n ') is the audio signal after windowing processing, and x (n') is taken as the audio signal of the intelligent earphone after preprocessing.
Optionally, the performing feature extraction on the preprocessed audio signal of the smart headset in the step S1 to obtain the extracted audio signal feature includes:
performing feature extraction on the preprocessed intelligent earphone audio signal x (n '), so as to obtain an extracted audio signal feature F (x (n'), wherein the feature extraction process comprises the following steps:
s11: performing fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain the frequency domain characteristics of x (n'), wherein the formula of the fast Fourier transform processing is as follows:
wherein:
x (k) represents the result of the fast Fourier transform of X (n') at point k;
k is the number of points of the fast Fourier transform, j represents the unit of imaginary number, j 2 =-1;
S12: constructing a filter bank having M filters, the mth filter in the filter bank having a frequency response of f m (k):
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the fast Fourier transform result into a filter bank, wherein the filter bank outputs the energy of a signal frequency domain:
wherein:
en represents the energy of the audio signal of the smart headset in the frequency domain;
s14: extracting the features of the audio signal of the intelligent earphone, wherein the feature extraction formula is as follows:
wherein:
l 'represents the characteristic order, L' =1, 2., 12;
c (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ")).
Optionally, the constructing an improved end-to-end speech recognition model in step S2 includes:
constructing an improved end-to-end voice recognition model, wherein the model input is the audio signal characteristics obtained by extraction, and the model output is a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by utilizing a gating circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps:
s21: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
conv (-) represents a convolution operation;
f1 represents a convolution feature;
s22: inputting the convolution characteristic into an encoding layer based on a gating cycle unit, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gating cycle units, each gating cycle unit is set to be bidirectional recursion, and the output result of the last gating cycle unit is used as the encoding result of the audio signal characteristic;
s23: the decoding layer outputs the probability P (y | x) of a decoding result y corresponding to each section of the coding result x by using a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end speech recognition model is:
wherein:
N x the number of characters representing the real speech text;
a 1,x number of alternative characters in speech recognition result representing output of model, a 2,x Number of deleted characters in speech recognition result representing model output, a 3,x Representing the number of inserted characters in the voice recognition result output by the model;
the extracted audio signal features F (x (n ″)) are input into a model, resulting in a speech recognition result y (x (n ″)).
Optionally, the performing word pair conversion on the recognized speech result in the step S3 to obtain a converted word pair includes:
constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = 1 ,topic 2 ,...,topic h Therein topic, of h The method comprises the steps that a corpus is a h-th topic type corpus, all words in the corpus A are coded by a single hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;
in the embodiment of the invention, the distances of the single-hot coding results of the words in the same subject category are close;
based on the word encoding result of the corpus a, performing word pair conversion on the recognized speech result y (x (n ")), wherein the word pair conversion process comprises the following steps:
s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary to be z, and intercepting a voice recognition resultMatching the intercepted result with the words in the word segmentation dictionary, if the matching is successful, taking the matching result as the word segmentation result of the intercepted text, and repeating the step on the residual voice recognition result texts to obtain the word segmentation result of the voice recognition resultWhereinIs the n-th of the speech recognition result w Word, n w The total number of words of the voice recognition result is represented, and if the matching is unsuccessful, the front z-1 bits are intercepted and matched again;
s32: result of word segmentationAny two word segmentation results w n1 ,w n2 Form a set of word pairs (w) n1 ,w n2 ) Wherein n1 > n2, n1, n 2. Epsilon. [1, n ] w ];
S33: obtaining a set of conversion word pairs G = { (w) of speech recognition results n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
Optionally, the calculating the word pair compactness of the converted word pair in the step S3 includes:
traversing and selecting word segmentation result from coding result set BObtaining a coding result set of the speech recognition resultWhereinIs composed ofAnd obtaining the coding result of the speech recognition result word pairResult set
Calculating the word pair compactness of the conversion word pair in the voice recognition result:
wherein:
t represents transposition;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
Optionally, the constructing a dynamic weighted speech topic recognition model based on the word pair compactness in the step S4 includes:
the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, wherein the model input is a speech recognition result based on the word pair compactness, and the model output is a speech theme, and the theme recognition process of the dynamic weighted speech theme recognition model comprises the following steps:
s41: set G of word pair compactness d The compactness of Chinese word pair is higher than the threshold valueThe word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A;
s42: setting the current iteration times of the model as r, setting the initial value of r as 0, and setting the maximum iteration times of the model as Max;
s43: when the (r + 1) th iteration is calculated, calculating any word pair q in the word pair set G1 v As a subjectProbability of (c):
wherein:
alpha, beta represents Dirichlet prior parameters;
representing the r +1 st iteration word pair q v Is subject of Representing h topic categories in the corpus A;
denotes the word pair q divided in G1 at the r-th iteration v Is divided into themesThe number of word pairs of;
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v Is divided into topicsThe number of times of (c);
representing removed word pairs q v By the (r + 1) th iteration, the word pair q v The second word in (1) is divided into topicsThe number of times of (c);
representing removed word pairs q v By the r +1 th iteration, all words are divided into topicsThe number of times of (c);
s44: judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step S43;
inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, topic ∈ [1, h [ ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;
representing word pairs q v Two words in (1), the number of occurrences in the speech recognition result, count G Representing the total number of words in the speech recognition result.
Optionally, in the step S5, similarity calculation is performed on the recognized voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold, it indicates that the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the lesson, otherwise, the method is attending the lesson, and includes:
and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course.
In order to solve the above problems, the present invention further provides a learning monitoring apparatus for an intelligent headset applied in the learning field, wherein the apparatus comprises:
the signal acquisition device is used for acquiring the audio signal of the intelligent earphone, preprocessing the acquired audio signal of the intelligent earphone and extracting the characteristics of the preprocessed audio signal of the intelligent earphone to obtain the extracted audio signal characteristics;
the voice recognition module is used for constructing an improved end-to-end voice recognition model and inputting the extracted audio signal characteristics into the model to obtain a voice recognition result;
and the recognition monitoring device is used for performing word pair conversion on the recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, and indicating that the student does not attend the lesson, otherwise, attending the lesson.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the learning monitoring means applied to the intelligent headset in the learning field.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, where the at least one instruction is executed by a processor in an electronic device to implement the above learning monitoring means applied to an intelligent headset in the learning field.
Compared with the prior art, the invention provides a learning monitoring means applied to the intelligent headset in the learning field, and the technology has the following advantages:
firstly, the scheme provides a voice recognition model, wherein an improved end-to-end voice recognition model is constructed, the model is input as the extracted audio signal characteristics, and the model is output as a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by using a gate control circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein: f represents the audio signal characteristics input to the end-to-end speech recognition model; conv (-) represents a convolution operation; f1 represents a convolution feature; inputting the convolution characteristic into an encoding layer based on a gating cycle unit, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gating cycle units, each gating cycle unit is set to be bidirectional recursion, and the output result of the last gating cycle unit is used as the encoding result of the audio signal characteristic; the decoding layer outputs the probability P (y | x) of the decoding result y corresponding to each section of the coding result x by utilizing a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result; the training objective function of the end-to-end speech recognition model is as follows:
wherein: n is a radical of x The number of characters representing the real speech text; a is a 1,x Number of alternative characters in speech recognition result representing output of model, a 2,x Representing the number of deleted characters in the speech recognition result output by the model, a 3,x Representing the number of inserted characters in the voice recognition result output by the model; the extracted audio signal features F (x (n ")) are input into a model, resulting in a speech recognition result y (x (n")). According to the improved model, the gating circulation unit is used for replacing an LSTM part in a traditional speech recognition model, the gating circulation unit has less parameter quantity and calculation quantity, and speech recognition can be completed more quickly.
Meanwhile, the scheme provides a topic recognition model, a dynamic weighted speech topic recognition model based on the word pair compactness is constructed, the model is input as a speech recognition result based on the word pair compactness, and is output as a speech topic, and the topic recognition process of the dynamic weighted speech topic recognition model is as follows: set G of word pair compactness d The compactness of Chinese word pair is higher than the threshold valueThe word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A; setting the current iteration number of the model as r, setting the initial value of r as 0, and setting the maximum iteration number of the model as Max; when the (r + 1) th iteration is calculated, calculating any word pair q in the word pair set G1 v As a subjectProbability of (c):
wherein: alpha, beta represents Dirichlet prior parameters;representing the r +1 th iteration word pair q v The subject matter of is Representing h topic categories in the corpus A;indicates that at the r-th iteration, the word pair q is divided in G1 v Is divided into themesThe number of word pairs of;representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v Is divided into topicsThe number of times of (c);representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v The second word in (1) is divided into topicsThe number of times of (c);representing removed word pairs q v By the r +1 th iteration, all words are divided into topicsThe number of times of (c); judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step; inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
wherein: p (topic | G) d G) represents the probability that the topic of the speech recognition result is topic, which belongs to [1, h ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;representing word pairs q v Two words in the speech recognition resultNumber of occurrences, count G Representing the total number of words in the speech recognition result. And performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course. According to the scheme, a dynamic weighted speech topic recognition model is constructed based on word pair compactness, a traditional model does not distinguish core word pairs and non-core word pairs, all word pairs are considered uniformly, so that core words representing text topic information are ignored, and topic recognition accuracy is influenced.
Drawings
Fig. 1 is a schematic flowchart of a learning monitoring means of an intelligent headset applied in the learning field according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of one step of the embodiment of FIG. 1;
fig. 3 is a functional block diagram of a learning monitoring apparatus for an intelligent headset applied in the learning field according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device for implementing a learning monitoring means applied to an intelligent headset in the learning field according to an embodiment of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a learning monitoring means applied to an intelligent headset in the learning field. The execution subject of the learning monitoring means applied to the smart headset in the learning field includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiment of the present application. In other words, the learning monitoring means applied to the smart headset in the learning field may be executed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Example 1:
s1: the method comprises the steps of collecting intelligent earphone audio signals, preprocessing the collected intelligent earphone audio signals, and extracting features of the preprocessed intelligent earphone audio signals to obtain extracted audio signal features.
Gather intelligent earphone audio signal in the S1 step to carry out the preliminary treatment to the intelligent earphone audio signal who gathers, include:
the method comprises the following steps of acquiring audio signals in the intelligent earphone by utilizing a built-in signal acquisition device in the intelligent earphone, and preprocessing the acquired audio signals of the intelligent earphone, wherein the preprocessing process comprises the following steps:
a high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is smaller and has no attenuation effect on the signal, if the frequency of the acquired signal is lower than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is larger and attenuates the signal, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t belongs to [ t ∈ [ t ] t 0 ,t L ]Representing timing information of the acquired audio signal, t 0 Representing the initial moment of audio signal acquisition, t L Representing a cut-off time of audio signal acquisition;
said cut-off frequency f t The calculation formula of (c) is:
wherein:
r represents a resistance value of a resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
a digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion process of the analog signal is as follows:
build up frequency of f s Impulse sequence p (t):
The number of sampling points in the analog signal is N, so that the analog signal isX (1) and the last sample point is x (N), then the analog signal is obtainedx (n') is the converted analog signal;
the signal acquisition device carries out windowing processing on the converted analog signal x (n'), wherein a window function of the windowing processing is a Hamming window, and a windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
and x (n ') is the audio signal after windowing processing, and x (n') is taken as the audio signal of the intelligent earphone after preprocessing.
And in the step S1, feature extraction is carried out on the preprocessed intelligent earphone audio signal to obtain the extracted audio signal features, and the method comprises the following steps:
performing feature extraction on the preprocessed audio signal x (n ") of the smart headset to obtain an extracted audio signal feature F (x (n")), specifically, referring to fig. 2, the feature extraction process includes:
s11: carrying out fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain the frequency domain characteristics of x (n'), wherein the formula of the fast Fourier transform processing is as follows:
wherein:
x (k) represents the result of fast Fourier transform of X (n') at k points;
k is the number of points of the fast Fourier transform, j represents the unit of imaginary number, j 2 =-1;
S12: constructing a filter bank having M filters, the mth filter in the filter bank having a frequency response of f m (k):
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the result of the fast Fourier transform into a filter bank, and outputting the energy of a signal frequency domain by the filter bank:
wherein:
en represents the energy of the audio signal of the smart headset in the frequency domain;
extracting the features of the audio signal of the intelligent earphone, wherein the feature extraction formula is as follows:
wherein:
l 'represents the characteristic order, L' =1, 2., 12;
c (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ")).
S2: and constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model to obtain a voice recognition result, wherein the model is input as the extracted audio signal characteristics, and is output as the voice recognition result.
The step S2 of constructing an improved end-to-end speech recognition model includes:
constructing an improved end-to-end voice recognition model, wherein the model input is the audio signal characteristics obtained by extraction, and the model output is a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by utilizing a gating circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps:
s21: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
conv (·) denotes a convolution operation;
f1 represents a convolution feature;
s22: inputting the convolution characteristic into an encoding layer based on a gating cycle unit, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gating cycle units, each gating cycle unit is set to be bidirectional recursion, and the output result of the last gating cycle unit is used as the encoding result of the audio signal characteristic;
s23: the decoding layer outputs the probability P (y | x) of a decoding result y corresponding to each section of the coding result x by using a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end speech recognition model is as follows:
wherein:
N x the number of characters representing the real speech text;
a 1,x number of alternative characters in speech recognition result representing output of model, a 2,x Representing the number of deleted characters in the speech recognition result output by the model, a 3,x Representing the number of inserted characters in the voice recognition result output by the model;
the extracted audio signal features F (x (n ″)) are input into a model, resulting in a speech recognition result y (x (n ″)).
S3: and performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair.
In the step S3, performing word pair conversion on the recognized speech result to obtain a converted word pair, including:
constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = 1 ,topic 2 ,...,topic h Wherein topic h The method comprises the steps that a corpus of the h-th theme category is adopted, all words in the corpus A are coded by a single-hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;
in the embodiment of the invention, the distances of the one-hot coding results of the words of the same subject category are similar;
based on the word encoding result of the corpus A, performing word pair conversion on the recognized voice result y (x (n ")), wherein the word pair conversion process comprises the following steps:
s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary to be z, intercepting the front z position of a voice recognition result, matching the intercepted result with the words in the word segmentation dictionary, taking the matching result as the word segmentation result of the intercepted text if the matching is successful, and repeating the step on the rest voice recognition result texts to obtain the word segmentation result of the voice recognition resultWhereinIs the nth of the speech recognition result w Word, n w Representing the total number of words of the voice recognition result, and if the matching is unsuccessful, intercepting the front z-1 bits for re-matching;
s32: result of word segmentationAny two word segmentation knotsFruit w n1 ,w n2 Form a set of word pairs (w) n1 ,w n2 ) Wherein n1 > n2, n1, n 2. Epsilon. [1, n ] w ];
S33: obtaining a set of conversion word pairs G = { (w) of speech recognition results n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
And the step S3 of calculating the word pair compactness of the converted word pair comprises the following steps:
traversing and selecting word segmentation result from coding result set BObtaining a coding result set of speech recognition resultsWhereinIs composed ofAnd obtaining a coding result set G of the speech recognition result word pairs b ={(b n1 ,b n2 )|n1>n2,n1,n2∈[1,n w ]};
Calculating the word pair compactness of the conversion word pair in the voice recognition result:
wherein:
t represents transposition;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
S4: and constructing a dynamic weighted speech theme recognition model based on the word pair compactness, inputting the word compactness of the speech recognition result into the model, outputting the speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme.
And in the step S4, a dynamic weighted speech topic recognition model is constructed based on the word pair compactness, and the method comprises the following steps:
the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, wherein the model input is a speech recognition result based on the word pair compactness, and the model output is a speech theme, and the theme recognition process of the dynamic weighted speech theme recognition model comprises the following steps:
s41: set G of word pair compactness d The compactness of Chinese word pair is higher than the threshold valueThe word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A;
s42: setting the current iteration number of the model as r, setting the initial value of r as 0, and setting the maximum iteration number of the model as Max;
s43: when the (r + 1) th iteration is calculated, any word pair q in the word pair set G1 is calculated v As a subjectProbability of (c):
wherein:
alpha, beta represents Dirichlet prior parameters;
representing the r +1 th iteration word pair q v The subject matter of is Representing h topic categories in the corpus A;
denotes the word pair q divided in G1 at the r-th iteration v Is divided into themesThe number of word pairs of;
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v Is divided into topicsThe number of times of (c);
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v The second word in (b) is divided into topicsThe number of times of (c);
representing removed word pairs q v By the r +1 th iteration, all words are divided into topicsThe number of times of (c);
s44: judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step S43;
inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, topic ∈ [1, h [ ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;
representing word pairs q v Two words in (1), number of occurrences in the speech recognition result, count G Representing the total number of words in the speech recognition result.
S5: and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the course, otherwise, the student is attending the course.
And S5, performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the course, otherwise, the method comprises the following steps:
and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course.
Example 2:
fig. 3 is a functional block diagram of a learning monitoring apparatus applied to an intelligent headset in the learning field according to an embodiment of the present invention, which can implement the learning monitoring means of the intelligent headset in embodiment 1.
The learning monitoring device 100 of the intelligent headset applied to the learning field of the present invention can be installed in an electronic device. According to the realized function, the learning monitoring device applied to the intelligent headset in the learning field may include a signal acquisition device 101, a voice recognition module 102 and a recognition monitoring device 103. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
The signal acquisition device 101 is used for acquiring an intelligent earphone audio signal, preprocessing the acquired intelligent earphone audio signal, and extracting the characteristics of the preprocessed intelligent earphone audio signal to obtain the extracted audio signal characteristics;
the speech recognition module 102 is configured to construct an improved end-to-end speech recognition model, and input the extracted audio signal characteristics into the model to obtain a speech recognition result;
and the recognition monitoring device 103 is used for performing word pair conversion on the recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not attend the lesson, otherwise, attending the lesson.
In detail, when the modules in the learning monitoring apparatus 100 for an intelligent headset in the learning field according to the embodiment of the present invention are used, the same technical means as the learning monitoring means for an intelligent headset in the learning field described in fig. 1 above is adopted, and the same technical effects can be produced, which is not described herein again.
Example 3:
fig. 4 is a schematic structural diagram of an electronic device for implementing a learning monitoring means applied to an intelligent headset in the learning field according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the program 12, but also to temporarily store data that has been output or will be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (programs 12 for learning monitoring, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 4 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-Fl interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions, which when executed in the processor 10, may implement:
collecting an intelligent earphone audio signal, preprocessing the collected intelligent earphone audio signal, and performing feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;
constructing an improved end-to-end voice recognition model, and inputting the extracted audio signal characteristics into the model to obtain a voice recognition result;
performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair;
constructing a dynamic weighted speech topic recognition model based on the word pair compactness, inputting the word compactness of a speech recognition result into the model, and outputting a speech topic by the model;
and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the teaching, otherwise, the student is attending the teaching.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 to fig. 4, which is not repeated herein.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, apparatus, article, or method comprising the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.
Claims (8)
1. A learning monitoring means for an intelligent headset in the learning field, the means comprising:
s1: collecting an intelligent earphone audio signal, preprocessing the collected intelligent earphone audio signal, and performing feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;
s2: constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model to obtain a voice recognition result, wherein the model is input as the extracted audio signal characteristics, and is output as the voice recognition result;
s3: performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair;
s4: the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, inputting the word compactness of a speech recognition result into the model, outputting a speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme, wherein the method comprises the following steps:
the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, wherein the model input is a speech recognition result based on the word pair compactness, and the model output is a speech theme, and the theme recognition process of the dynamic weighted speech theme recognition model comprises the following steps:
s41: set G of word pair compactness d The compactness of Chinese word pair is higher than the threshold valueThe word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A;
s42: setting the current iteration number of the model as r, setting the initial value of r as 0, and setting the maximum iteration number of the model as Max;
s43: when the (r + 1) th iteration is calculated, calculating any word pair q in the word pair set G1 v As a subjectProbability of (c):
wherein:
alpha, beta represents Dirichlet prior parameters;
representing the r +1 th iteration word pair q v The subject matter of isRepresenting h topic categories in the corpus A;
denotes the word pair q divided in G1 at the r-th iteration v Is divided into themesThe number of word pairs of;
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v Is divided into topicsThe number of times of (c);
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v The second word in (b) is divided into topicsThe number of times of (c);
representing removed word pairs q v By the r +1 th iteration, all words are divided into topicsThe number of times of (c);
s44: judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step S43;
inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, which belongs to [1, h ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;
representing word pairs q v Two words in (1), number of occurrences in the speech recognition result, count G Representing the total number of words in the speech recognition result;
s5: and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the teaching, otherwise, the student is attending the teaching.
2. The learning monitoring means of an intelligent headset applied in the learning field as claimed in claim 1, wherein the step S1 of collecting audio signals of the intelligent headset and preprocessing the collected audio signals of the intelligent headset comprises:
the method comprises the following steps of acquiring audio signals in the intelligent earphone by utilizing a built-in signal acquisition device in the intelligent earphone, and preprocessing the acquired audio signals of the intelligent earphone, wherein the preprocessing process comprises the following steps:
a high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is small and has no attenuation effect on the signal, if the frequency of the acquired signal is lower than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is large and attenuates the signal, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t is e [ t [ ] [ t ] ] 0 ,t L ]Representing timing information of the acquired audio signal, t 0 Representing the initial moment of audio signal acquisition, t L Representing a cut-off time of audio signal acquisition;
said cut-off frequency f t The calculation formula of (c) is:
wherein:
r represents a resistance value of a resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
a digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion process of the analog signal is as follows:
build up of frequency f s Impulse sequence p (t) of (1):
The number of sampling points in the analog signal is N, so that the analog signal isX (1) and the last sample point is x (N), then the analog signal is obtainedn′∈[1,N]X (n') is a converted analog signal;
the signal acquisition device carries out windowing processing on the converted analog signal x (n'), wherein the window function of the windowing processing is a Hamming window, and the windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
and x (n ') is the audio signal after windowing processing, and x (n') is taken as the audio signal of the intelligent earphone after preprocessing.
3. The learning monitoring means of the intelligent earphone applied in the learning field as claimed in claim 2, wherein the step S1 of extracting the features of the audio signal of the intelligent earphone after the preprocessing to obtain the extracted features of the audio signal comprises:
performing feature extraction on the preprocessed intelligent earphone audio signal x (n '), so as to obtain an extracted audio signal feature F (x (n'), wherein the feature extraction process comprises the following steps:
s11: performing fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain the frequency domain characteristics of x (n'), wherein the formula of the fast Fourier transform processing is as follows:
wherein:
x (k) represents the result of fast Fourier transform of X (n') at k points;
k is the number of points of the fast Fourier transform, j represents the unit of imaginary number, j 2 =-1;
S12: constructing a filter bank having M filters, the mth filter in the filter bank having a frequency response of f m (k):
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the fast Fourier transform result into a filter bank, wherein the filter bank outputs the energy of a signal frequency domain:
wherein:
en represents the energy of the audio signal of the smart headset in the frequency domain;
s14: extracting the features of the audio signal of the intelligent earphone, wherein the feature extraction formula is as follows:
wherein:
l 'represents the characteristic order, L' =1, 2., 12;
c (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ″)).
4. The learning monitoring tool applied to intelligent headsets in the learning field as claimed in claim 1, wherein the step S2 of constructing the improved end-to-end speech recognition model comprises:
constructing an improved end-to-end voice recognition model, wherein the model input is the audio signal characteristics obtained by extraction, and the model output is a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by utilizing a gate control circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps:
s21: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
conv (-) represents a convolution operation;
f1 represents a convolution feature;
s22: inputting the convolution characteristics into an encoding layer based on gated cyclic units, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gated cyclic units, each gated cyclic unit is set to be bidirectional recursion, and the output result of the last gated cyclic unit is used as the encoding result of the audio signal characteristics;
s23: the decoding layer outputs the probability P (y | x) of a decoding result y corresponding to each section of the coding result x by using a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end speech recognition model is as follows:
wherein:
N x words representing true phonetic textA symbol number;
a 1,x number of alternative characters in speech recognition result representing output of model, a 2,x Representing the number of deleted characters in the speech recognition result output by the model, a 3,x Representing the number of inserted characters in the voice recognition result output by the model;
the extracted audio signal features F (x (n ")) are input into a model, resulting in a speech recognition result y (x (n")).
5. The learning monitoring means of an intelligent earphone applied in the learning field as claimed in claim 1, wherein the step S3 of performing word pair conversion on the recognized speech result to obtain a converted word pair comprises:
constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = 1 ,topic 2 ,...,topic h Wherein topic h The method comprises the steps that a corpus is a h-th topic type corpus, all words in the corpus A are coded by a single hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;
based on the word encoding result of the corpus A, performing word pair conversion on the recognized voice result y (x (n ")), wherein the word pair conversion process comprises the following steps:
s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary to be z, intercepting the front z position of a voice recognition result, matching the intercepted result with the words in the word segmentation dictionary, taking the matching result as the word segmentation result of the intercepted text if the matching is successful, and repeating the step on the rest voice recognition result texts to obtain the word segmentation result of the voice recognition resultWhereinIs the nth of the speech recognition result w Word, n w The total number of words of the voice recognition result is represented, and if the matching is unsuccessful, the front z-1 bits are intercepted and matched again;
s32: result of word segmentationAny two word segmentation results w n1 ,w n2 Form a set of word pairs (w) n1 ,w n2 ) Wherein n1 > n2, n1, n 2. Epsilon. [1, n ] w ];
S33: obtaining a set of conversion word pairs G = { (w) of speech recognition results n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
6. The learning monitoring means of the intelligent earphone applied in the learning field as claimed in claim 5, wherein the step S3 of calculating the word pair compactness of the converted word pair comprises:
traversing and selecting word segmentation result from coding result set BObtaining a coding result set of speech recognition resultsWhereinIs composed ofAnd obtaining a coding result set G of the speech recognition result word pairs b ={(b n1 ,b n2 )|n1>n2,n1,n2∈[1,n w ]};
Calculating the word pair compactness of the conversion word pair in the voice recognition result:
wherein:
t represents transposition;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
7. The learning monitoring means of an intelligent earphone applied in the learning field as claimed in claim 1, wherein the step S5 is to perform similarity calculation between the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold, it indicates that the voice theme in the earphone is different from the teaching content theme, which indicates that the student is not listening to the teaching, otherwise, the learning monitoring means of the intelligent earphone includes:
and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course.
8. A learning monitoring device for intelligent headsets applied in the learning field, the device comprising:
the signal acquisition device is used for acquiring the audio signal of the intelligent earphone, preprocessing the acquired audio signal of the intelligent earphone and extracting the characteristics of the preprocessed audio signal of the intelligent earphone to obtain the extracted audio signal characteristics;
the voice recognition module is used for constructing an improved end-to-end voice recognition model and inputting the extracted audio signal characteristics into the model to obtain a voice recognition result;
the recognition monitoring device is used for performing word pair conversion on a recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and a current teaching content theme, indicating that the voice theme in the earphone is different from the teaching content theme if the similarity calculation structure is smaller than a specified threshold value, and indicating that a student does not attend a lesson, or else attending a lesson, so as to realize the learning monitoring means applied to the intelligent earphone in the learning field as claimed in claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210994128.9A CN115376499B (en) | 2022-08-18 | 2022-08-18 | Learning monitoring method of intelligent earphone applied to learning field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210994128.9A CN115376499B (en) | 2022-08-18 | 2022-08-18 | Learning monitoring method of intelligent earphone applied to learning field |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115376499A true CN115376499A (en) | 2022-11-22 |
CN115376499B CN115376499B (en) | 2023-07-28 |
Family
ID=84064840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210994128.9A Active CN115376499B (en) | 2022-08-18 | 2022-08-18 | Learning monitoring method of intelligent earphone applied to learning field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115376499B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609672A (en) * | 2009-07-21 | 2009-12-23 | 北京邮电大学 | A kind of speech recognition semantic confidence feature extracting methods and device |
CN103680498A (en) * | 2012-09-26 | 2014-03-26 | 华为技术有限公司 | Speech recognition method and speech recognition equipment |
CN108836362A (en) * | 2018-04-03 | 2018-11-20 | 江苏理工学院 | It is a kind of for supervising the household monitoring apparatus and its measure of supervision of child's learning state |
CN110415697A (en) * | 2019-08-29 | 2019-11-05 | 的卢技术有限公司 | A kind of vehicle-mounted voice control method and its system based on deep learning |
CN111105796A (en) * | 2019-12-18 | 2020-05-05 | 杭州智芯科微电子科技有限公司 | Wireless earphone control device and control method, and voice control setting method and system |
KR102228575B1 (en) * | 2020-11-06 | 2021-03-17 | (주)디라직 | Headset system capable of monitoring body temperature and online learning |
CN112863518A (en) * | 2021-01-29 | 2021-05-28 | 深圳前海微众银行股份有限公司 | Method and device for voice data theme recognition |
CN113539268A (en) * | 2021-01-29 | 2021-10-22 | 南京迪港科技有限责任公司 | End-to-end voice-to-text rare word optimization method |
CN113887883A (en) * | 2021-09-13 | 2022-01-04 | 淮阴工学院 | Course teaching evaluation implementation method based on voice recognition technology |
CN114596839A (en) * | 2022-03-03 | 2022-06-07 | 网络通信与安全紫金山实验室 | End-to-end voice recognition method, system and storage medium |
CN114662484A (en) * | 2022-03-16 | 2022-06-24 | 平安科技(深圳)有限公司 | Semantic recognition method and device, electronic equipment and readable storage medium |
-
2022
- 2022-08-18 CN CN202210994128.9A patent/CN115376499B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609672A (en) * | 2009-07-21 | 2009-12-23 | 北京邮电大学 | A kind of speech recognition semantic confidence feature extracting methods and device |
CN103680498A (en) * | 2012-09-26 | 2014-03-26 | 华为技术有限公司 | Speech recognition method and speech recognition equipment |
CN108836362A (en) * | 2018-04-03 | 2018-11-20 | 江苏理工学院 | It is a kind of for supervising the household monitoring apparatus and its measure of supervision of child's learning state |
CN110415697A (en) * | 2019-08-29 | 2019-11-05 | 的卢技术有限公司 | A kind of vehicle-mounted voice control method and its system based on deep learning |
CN111105796A (en) * | 2019-12-18 | 2020-05-05 | 杭州智芯科微电子科技有限公司 | Wireless earphone control device and control method, and voice control setting method and system |
KR102228575B1 (en) * | 2020-11-06 | 2021-03-17 | (주)디라직 | Headset system capable of monitoring body temperature and online learning |
CN112863518A (en) * | 2021-01-29 | 2021-05-28 | 深圳前海微众银行股份有限公司 | Method and device for voice data theme recognition |
CN113539268A (en) * | 2021-01-29 | 2021-10-22 | 南京迪港科技有限责任公司 | End-to-end voice-to-text rare word optimization method |
CN113887883A (en) * | 2021-09-13 | 2022-01-04 | 淮阴工学院 | Course teaching evaluation implementation method based on voice recognition technology |
CN114596839A (en) * | 2022-03-03 | 2022-06-07 | 网络通信与安全紫金山实验室 | End-to-end voice recognition method, system and storage medium |
CN114662484A (en) * | 2022-03-16 | 2022-06-24 | 平安科技(深圳)有限公司 | Semantic recognition method and device, electronic equipment and readable storage medium |
Non-Patent Citations (2)
Title |
---|
杨焕峥;: "基于深度学习的中文语音识别模型设计与实现", 湖南邮电职业技术学院学报 * |
赵从健;雷菊阳;李明明;: "基于无监督学习的语音签到系统" * |
Also Published As
Publication number | Publication date |
---|---|
CN115376499B (en) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110457432B (en) | Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium | |
CN109859772B (en) | Emotion recognition method, emotion recognition device and computer-readable storage medium | |
CN112185348B (en) | Multilingual voice recognition method and device and electronic equipment | |
CN107369439B (en) | Voice awakening method and device | |
CN107103903A (en) | Acoustic training model method, device and storage medium based on artificial intelligence | |
CN110600033B (en) | Learning condition evaluation method and device, storage medium and electronic equipment | |
CN111261162B (en) | Speech recognition method, speech recognition apparatus, and storage medium | |
CN110148400A (en) | The pronunciation recognition methods of type, the training method of model, device and equipment | |
CN111274797A (en) | Intention recognition method, device and equipment for terminal and storage medium | |
CN110459205A (en) | Audio recognition method and device, computer can storage mediums | |
CN110164447A (en) | A kind of spoken language methods of marking and device | |
CN111694940A (en) | User report generation method and terminal equipment | |
CN108595406B (en) | User state reminding method and device, electronic equipment and storage medium | |
CN111651497A (en) | User label mining method and device, storage medium and electronic equipment | |
CN107274903A (en) | Text handling method and device, the device for text-processing | |
CN109800309A (en) | Classroom Discourse genre classification methods and device | |
CN113420556A (en) | Multi-mode signal based emotion recognition method, device, equipment and storage medium | |
CN113327586A (en) | Voice recognition method and device, electronic equipment and storage medium | |
CN113807103A (en) | Recruitment method, device, equipment and storage medium based on artificial intelligence | |
CN114155832A (en) | Speech recognition method, device, equipment and medium based on deep learning | |
WO2024114303A1 (en) | Phoneme recognition method and apparatus, electronic device and storage medium | |
CN112489628B (en) | Voice data selection method and device, electronic equipment and storage medium | |
CN109545226A (en) | A kind of audio recognition method, equipment and computer readable storage medium | |
CN112201253A (en) | Character marking method and device, electronic equipment and computer readable storage medium | |
CN115376499A (en) | Learning monitoring means applied to intelligent earphone in learning field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20230726 Address after: Room 102, Building 1, No. 16, Yunlian Fifth Street, Dalang Town, Dongguan City, Guangdong Province, 523000 Applicant after: DONGGUAN JIEXUN ELECTRONIC TECHNOLOGY Co.,Ltd. Address before: 523000 Building 2, No. 16, Yunlian Fifth Street, Dalang Town, Dongguan City, Guangdong Province Applicant before: Dongguan Leyi Electronic Technology Co.,Ltd. |
|
TA01 | Transfer of patent application right |