CN115376499A

CN115376499A - Learning monitoring means applied to intelligent earphone in learning field

Info

Publication number: CN115376499A
Application number: CN202210994128.9A
Authority: CN
Inventors: 刘其军; 余红云; 陆佳丽
Original assignee: Dongguan Leyi Electronic Technology Co ltd
Current assignee: Dongguan Jiexun Electronic Technology Co ltd
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2022-11-22
Anticipated expiration: 2042-08-18
Also published as: CN115376499B

Abstract

The invention relates to the technical field of learning monitoring, and discloses a learning monitoring means of an intelligent earphone applied to the learning field, which comprises the following steps: extracting the characteristics of the preprocessed intelligent earphone audio signals; constructing an improved end-to-end speech recognition model, and inputting the extracted audio signal characteristics into the model; performing word pair conversion on the recognized voice result, and calculating the word pair compactness of the converted word pair; constructing a dynamic weighted speech topic recognition model based on the word pair compactness, and inputting the word compactness of a speech recognition result into the model; and judging the theme of the voice theme obtained by recognition and the theme of the current teaching content. The method mainly inspects the word pairs with higher word pair compactness in the voice recognition content, the word pairs with higher compactness have higher theme recognition weight, and dynamically considers the theme selection probability of other word pairs in the theme recognition process, thereby realizing more accurate voice theme recognition based on global theme recognition.

Description

Learning monitoring means applied to intelligent earphone in learning field

Technical Field

The invention relates to the technical field of learning monitoring, in particular to a learning monitoring means of an intelligent earphone applied to the learning field.

Background

Along with the popularization of online teaching, more and more students use the earphone to study in class, but because a computer can play a plurality of audios and videos synchronously, the students may listen to the audio in class on the surface, but actually listen to other audio, and the teacher cannot effectively judge through the class listening pictures of the students in the online class. CN111383659A provides a distributed voice monitoring method, apparatus, system, storage medium and device, and obtains audio stream data belonging to the same machine room; acquiring audio data to be audited from the audio stream data according to a preset trial pushing strategy; inputting the audio data to be audited into a pre-trained audio recognition model to obtain a predicted value corresponding to the audio recognition model; and generating an audio machine examination result according to the predicted value. The method can realize high input-output ratio, high coverage, low delay, high recognition rate and high-efficiency voice monitoring and auditing, and can meet the audio content monitoring requirement of high-activity voice social application in a multi-operator mixed networking deployment network environment. CN107292271A provides a learning monitoring method, apparatus and electronic device, and the method includes: acquiring a class image of a classroom student; identifying the images in class, and acquiring feature data of the students in class, wherein the feature data comprises at least one of the following data: the facial feature data of the classroom student, the visual feature data of the classroom student and the limb feature data of the classroom student; and determining the class state of the classroom student according to the characteristic data of the classroom student. The system can effectively and accurately monitor the class attendance of students who study through a computer and a network in a classroom, and provides effective reference for subsequent study and teaching so as to further improve the study or teaching process. Although the existing learning monitoring method can monitor the class state of the student based on the feature data of the student, the following two problems still exist: firstly, the existing audio monitoring method can only monitor whether the audio violates rules, and whether the playing content of the earphone is consistent with the theme of the content of classroom teaching in class needs to be judged; secondly, the traditional method does not monitor the class state according to the auditory characteristics. To solve the problem, the patent provides an intelligent earphone learning monitoring method applied to the learning field.

Disclosure of Invention

In view of this, the invention provides a learning monitoring means for an intelligent headset applied in the learning field, and aims to (1) use a gate control circulation unit to replace an LSTM part in a traditional speech recognition model, wherein the gate control circulation unit has less parameter quantity and calculation quantity and can complete speech recognition more quickly; (2) And in the process of topic identification, topic selection probabilities of other word pairs are dynamically considered, and more accurate speech topic identification is realized based on global topic identification.

The invention provides a learning monitoring means of an intelligent earphone applied to the learning field, which comprises the following steps:

s1: collecting an intelligent earphone audio signal, preprocessing the collected intelligent earphone audio signal, and performing feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;

s2: constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model to obtain a voice recognition result, wherein the model is input as the extracted audio signal characteristics, and is output as the voice recognition result;

s3: performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair;

s4: constructing a dynamic weighted speech theme recognition model based on the word pair compactness, inputting the word compactness of a speech recognition result into the model, outputting the speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme;

s5: and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the course, otherwise, the student is attending the course.

As a further improvement of the method of the invention:

optionally, the acquiring an audio signal of the smart headset in the step S1, and preprocessing the acquired audio signal of the smart headset includes:

the method comprises the following steps of acquiring audio signals in the intelligent earphone by utilizing a built-in signal acquisition device in the intelligent earphone, and preprocessing the acquired audio signals of the intelligent earphone, wherein the preprocessing process comprises the following steps:

a high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is small and has no attenuation effect on the signal, if the frequency of the acquired signal is lower than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is large and attenuates the signal, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t is e [ t [ ] [ t ] ] ₀ ，t _L ]Representing timing information of the acquired audio signal, t ₀ Representing the initial moment of audio signal acquisition, t _L Representing a cut-off time of audio signal acquisition;

said cut-off frequency f _t The calculation formula of (c) is:

wherein:

r represents a resistance value of a resistor in the high-pass filter;

c represents the capacitance of the capacitive element in the high-pass filter;

a digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion process of the analog signal is as follows:

build up of frequency f _s Impulse sequence p (t) of (1):

conversion of an audio signal x (t) into an analog signal on the basis of an impulse sequence p (t)

The number of sampling points in the analog signal is N, so that the analog signal is

X (1) and the last sample point is x (N), then the analog signal is obtained

x (n') is the converted analog signal;

the signal acquisition device carries out windowing processing on the converted analog signal x (n'), wherein a window function of the windowing processing is a Hamming window, and a windowing processing formula is as follows:

x(n″)＝x(n′)×w(n′)

wherein:

w (n') is a window function;

a represents a window function coefficient, which is set to 0.43;

and x (n ') is the audio signal after windowing processing, and x (n') is taken as the audio signal of the intelligent earphone after preprocessing.

Optionally, the performing feature extraction on the preprocessed audio signal of the smart headset in the step S1 to obtain the extracted audio signal feature includes:

performing feature extraction on the preprocessed intelligent earphone audio signal x (n '), so as to obtain an extracted audio signal feature F (x (n'), wherein the feature extraction process comprises the following steps:

s11: performing fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain the frequency domain characteristics of x (n'), wherein the formula of the fast Fourier transform processing is as follows:

wherein:

x (k) represents the result of the fast Fourier transform of X (n') at point k;

k is the number of points of the fast Fourier transform, j represents the unit of imaginary number, j ² ＝-1；

S12: constructing a filter bank having M filters, the mth filter in the filter bank having a frequency response of f _m (k)：

Wherein:

f (m) represents the center frequency of the mth filter;

s13: inputting the fast Fourier transform result into a filter bank, wherein the filter bank outputs the energy of a signal frequency domain:

wherein:

en represents the energy of the audio signal of the smart headset in the frequency domain;

s14: extracting the features of the audio signal of the intelligent earphone, wherein the feature extraction formula is as follows:

wherein:

l 'represents the characteristic order, L' =1, 2., 12;

c (L') represents the extracted L-order features;

the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ")).

Optionally, the constructing an improved end-to-end speech recognition model in step S2 includes:

constructing an improved end-to-end voice recognition model, wherein the model input is the audio signal characteristics obtained by extraction, and the model output is a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by utilizing a gating circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps:

s21: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:

F1＝Conv(F)

wherein:

f represents the audio signal characteristics input to the end-to-end speech recognition model;

conv (-) represents a convolution operation;

f1 represents a convolution feature;

s22: inputting the convolution characteristic into an encoding layer based on a gating cycle unit, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gating cycle units, each gating cycle unit is set to be bidirectional recursion, and the output result of the last gating cycle unit is used as the encoding result of the audio signal characteristic;

s23: the decoding layer outputs the probability P (y | x) of a decoding result y corresponding to each section of the coding result x by using a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;

the training objective function of the end-to-end speech recognition model is:

wherein:

N _x the number of characters representing the real speech text;

a _1，x number of alternative characters in speech recognition result representing output of model, a _2，x Number of deleted characters in speech recognition result representing model output, a _3，x Representing the number of inserted characters in the voice recognition result output by the model;

the extracted audio signal features F (x (n ″)) are input into a model, resulting in a speech recognition result y (x (n ″)).

Optionally, the performing word pair conversion on the recognized speech result in the step S3 to obtain a converted word pair includes:

constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = ₁ ，topic ₂ ，...，topic _h Therein topic, of _h The method comprises the steps that a corpus is a h-th topic type corpus, all words in the corpus A are coded by a single hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;

in the embodiment of the invention, the distances of the single-hot coding results of the words in the same subject category are close;

based on the word encoding result of the corpus a, performing word pair conversion on the recognized speech result y (x (n ")), wherein the word pair conversion process comprises the following steps:

s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary to be z, and intercepting a voice recognition resultMatching the intercepted result with the words in the word segmentation dictionary, if the matching is successful, taking the matching result as the word segmentation result of the intercepted text, and repeating the step on the residual voice recognition result texts to obtain the word segmentation result of the voice recognition result

Wherein

Is the n-th of the speech recognition result _w Word, n _w The total number of words of the voice recognition result is represented, and if the matching is unsuccessful, the front z-1 bits are intercepted and matched again;

s32: result of word segmentation

Any two word segmentation results w _n1 ，w _n2 Form a set of word pairs (w) _n1 ，w _n2 ) Wherein n1 > n2, n1, n 2. Epsilon. [1, n ] _w ]；

S33: obtaining a set of conversion word pairs G = { (w) of speech recognition results _n1 ，w _n2 )|n1＞n2，n1，n2∈[1，n _w ]}。

Optionally, the calculating the word pair compactness of the converted word pair in the step S3 includes:

traversing and selecting word segmentation result from coding result set B

Obtaining a coding result set of the speech recognition result

Wherein

Is composed of

And obtaining the coding result of the speech recognition result word pairResult set

Calculating the word pair compactness of the conversion word pair in the voice recognition result:

wherein:

t represents transposition;

the word pair compactness set in the voice recognition result is as follows: g ^d ＝{dis(w _n1 ，w _n2 )|n1＞n2，n1，n2∈[1，n _w ]}。

Optionally, the constructing a dynamic weighted speech topic recognition model based on the word pair compactness in the step S4 includes:

the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, wherein the model input is a speech recognition result based on the word pair compactness, and the model output is a speech theme, and the theme recognition process of the dynamic weighted speech theme recognition model comprises the following steps:

s41: set G of word pair compactness ^d The compactness of Chinese word pair is higher than the threshold value

The word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A;

s42: setting the current iteration times of the model as r, setting the initial value of r as 0, and setting the maximum iteration times of the model as Max;

s43: when the (r + 1) th iteration is calculated, calculating any word pair q in the word pair set G1 _v As a subject

Probability of (c):

wherein:

alpha, beta represents Dirichlet prior parameters;

representing the r +1 st iteration word pair q _v Is subject of

Representing h topic categories in the corpus A;

denotes the word pair q divided in G1 at the r-th iteration _v Is divided into themes

The number of word pairs of;

representing removed word pairs q _v Up to the (r + 1) th iteration, word pair q _v Is divided into topics

The number of times of (c);

representing removed word pairs q _v By the (r + 1) th iteration, the word pair q _v The second word in (1) is divided into topics

The number of times of (c);

representing removed word pairs q _v By the r +1 th iteration, all words are divided into topics

The number of times of (c);

s44: judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step S43;

inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:

wherein:

p(topic|G ^d g) represents the probability that the topic of the speech recognition result is topic, topic ∈ [1, h [ ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;

representing word pairs q _v Two words in (1), the number of occurrences in the speech recognition result, count _G Representing the total number of words in the speech recognition result.

Optionally, in the step S5, similarity calculation is performed on the recognized voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold, it indicates that the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the lesson, otherwise, the method is attending the lesson, and includes:

and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course.

In order to solve the above problems, the present invention further provides a learning monitoring apparatus for an intelligent headset applied in the learning field, wherein the apparatus comprises:

the signal acquisition device is used for acquiring the audio signal of the intelligent earphone, preprocessing the acquired audio signal of the intelligent earphone and extracting the characteristics of the preprocessed audio signal of the intelligent earphone to obtain the extracted audio signal characteristics;

the voice recognition module is used for constructing an improved end-to-end voice recognition model and inputting the extracted audio signal characteristics into the model to obtain a voice recognition result;

and the recognition monitoring device is used for performing word pair conversion on the recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, and indicating that the student does not attend the lesson, otherwise, attending the lesson.

In order to solve the above problem, the present invention also provides an electronic device, including:

a memory storing at least one instruction; and

and the processor executes the instructions stored in the memory to realize the learning monitoring means applied to the intelligent headset in the learning field.

In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, where the at least one instruction is executed by a processor in an electronic device to implement the above learning monitoring means applied to an intelligent headset in the learning field.

Compared with the prior art, the invention provides a learning monitoring means applied to the intelligent headset in the learning field, and the technology has the following advantages:

firstly, the scheme provides a voice recognition model, wherein an improved end-to-end voice recognition model is constructed, the model is input as the extracted audio signal characteristics, and the model is output as a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by using a gate control circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:

F1＝Conv(F)

wherein: f represents the audio signal characteristics input to the end-to-end speech recognition model; conv (-) represents a convolution operation; f1 represents a convolution feature; inputting the convolution characteristic into an encoding layer based on a gating cycle unit, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gating cycle units, each gating cycle unit is set to be bidirectional recursion, and the output result of the last gating cycle unit is used as the encoding result of the audio signal characteristic; the decoding layer outputs the probability P (y | x) of the decoding result y corresponding to each section of the coding result x by utilizing a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result; the training objective function of the end-to-end speech recognition model is as follows:

wherein: n is a radical of _x The number of characters representing the real speech text; a is a _1，x Number of alternative characters in speech recognition result representing output of model, a _2，x Representing the number of deleted characters in the speech recognition result output by the model, a _3，x Representing the number of inserted characters in the voice recognition result output by the model; the extracted audio signal features F (x (n ")) are input into a model, resulting in a speech recognition result y (x (n")). According to the improved model, the gating circulation unit is used for replacing an LSTM part in a traditional speech recognition model, the gating circulation unit has less parameter quantity and calculation quantity, and speech recognition can be completed more quickly.

Meanwhile, the scheme provides a topic recognition model, a dynamic weighted speech topic recognition model based on the word pair compactness is constructed, the model is input as a speech recognition result based on the word pair compactness, and is output as a speech topic, and the topic recognition process of the dynamic weighted speech topic recognition model is as follows: set G of word pair compactness ^d The compactness of Chinese word pair is higher than the threshold value

The word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A; setting the current iteration number of the model as r, setting the initial value of r as 0, and setting the maximum iteration number of the model as Max; when the (r + 1) th iteration is calculated, calculating any word pair q in the word pair set G1 _v As a subject

Probability of (c):

wherein: alpha, beta represents Dirichlet prior parameters;

representing the r +1 th iteration word pair q _v The subject matter of is

Representing h topic categories in the corpus A;

indicates that at the r-th iteration, the word pair q is divided in G1 _v Is divided into themes

The number of word pairs of;

The number of times of (c);

representing removed word pairs q _v Up to the (r + 1) th iteration, word pair q _v The second word in (1) is divided into topics

The number of times of (c);

The number of times of (c); judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step; inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:

wherein: p (topic | G) ^d G) represents the probability that the topic of the speech recognition result is topic, which belongs to [1, h ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;

representing word pairs q _v Two words in the speech recognition resultNumber of occurrences, count _G Representing the total number of words in the speech recognition result. And performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course. According to the scheme, a dynamic weighted speech topic recognition model is constructed based on word pair compactness, a traditional model does not distinguish core word pairs and non-core word pairs, all word pairs are considered uniformly, so that core words representing text topic information are ignored, and topic recognition accuracy is influenced.

Drawings

Fig. 1 is a schematic flowchart of a learning monitoring means of an intelligent headset applied in the learning field according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of one step of the embodiment of FIG. 1;

fig. 3 is a functional block diagram of a learning monitoring apparatus for an intelligent headset applied in the learning field according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device for implementing a learning monitoring means applied to an intelligent headset in the learning field according to an embodiment of the present invention.

The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the application provides a learning monitoring means applied to an intelligent headset in the learning field. The execution subject of the learning monitoring means applied to the smart headset in the learning field includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiment of the present application. In other words, the learning monitoring means applied to the smart headset in the learning field may be executed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Example 1:

s1: the method comprises the steps of collecting intelligent earphone audio signals, preprocessing the collected intelligent earphone audio signals, and extracting features of the preprocessed intelligent earphone audio signals to obtain extracted audio signal features.

Gather intelligent earphone audio signal in the S1 step to carry out the preliminary treatment to the intelligent earphone audio signal who gathers, include:

a high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is smaller and has no attenuation effect on the signal, if the frequency of the acquired signal is lower than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is larger and attenuates the signal, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t belongs to [ t ∈ [ t ] t ₀ ，t _L ]Representing timing information of the acquired audio signal, t ₀ Representing the initial moment of audio signal acquisition, t _L Representing a cut-off time of audio signal acquisition;

said cut-off frequency f _t The calculation formula of (c) is:

wherein:

r represents a resistance value of a resistor in the high-pass filter;

c represents the capacitance of the capacitive element in the high-pass filter;

build up frequency of f _s Impulse sequence p (t):

X (1) and the last sample point is x (N), then the analog signal is obtained

x (n') is the converted analog signal;

x(n″)＝x(n′)×w(n′)

wherein:

w (n') is a window function;

a represents a window function coefficient, which is set to 0.43;

And in the step S1, feature extraction is carried out on the preprocessed intelligent earphone audio signal to obtain the extracted audio signal features, and the method comprises the following steps:

performing feature extraction on the preprocessed audio signal x (n ") of the smart headset to obtain an extracted audio signal feature F (x (n")), specifically, referring to fig. 2, the feature extraction process includes:

s11: carrying out fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain the frequency domain characteristics of x (n'), wherein the formula of the fast Fourier transform processing is as follows:

wherein:

x (k) represents the result of fast Fourier transform of X (n') at k points;

Wherein:

f (m) represents the center frequency of the mth filter;

s13: inputting the result of the fast Fourier transform into a filter bank, and outputting the energy of a signal frequency domain by the filter bank:

wherein:

extracting the features of the audio signal of the intelligent earphone, wherein the feature extraction formula is as follows:

wherein:

l 'represents the characteristic order, L' =1, 2., 12;

c (L') represents the extracted L-order features;

S2: and constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model to obtain a voice recognition result, wherein the model is input as the extracted audio signal characteristics, and is output as the voice recognition result.

The step S2 of constructing an improved end-to-end speech recognition model includes:

F1＝Conv(F)

wherein:

conv (·) denotes a convolution operation;

f1 represents a convolution feature;

the training objective function of the end-to-end speech recognition model is as follows:

wherein:

N _x the number of characters representing the real speech text;

a _1，x number of alternative characters in speech recognition result representing output of model, a _2，x Representing the number of deleted characters in the speech recognition result output by the model, a _3，x Representing the number of inserted characters in the voice recognition result output by the model;

S3: and performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair.

In the step S3, performing word pair conversion on the recognized speech result to obtain a converted word pair, including:

constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = ₁ ，topic ₂ ，...，topic _h Wherein topic _h The method comprises the steps that a corpus of the h-th theme category is adopted, all words in the corpus A are coded by a single-hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;

in the embodiment of the invention, the distances of the one-hot coding results of the words of the same subject category are similar;

based on the word encoding result of the corpus A, performing word pair conversion on the recognized voice result y (x (n ")), wherein the word pair conversion process comprises the following steps:

s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary to be z, intercepting the front z position of a voice recognition result, matching the intercepted result with the words in the word segmentation dictionary, taking the matching result as the word segmentation result of the intercepted text if the matching is successful, and repeating the step on the rest voice recognition result texts to obtain the word segmentation result of the voice recognition result

Wherein

Is the nth of the speech recognition result _w Word, n _w Representing the total number of words of the voice recognition result, and if the matching is unsuccessful, intercepting the front z-1 bits for re-matching;

s32: result of word segmentation

Any two word segmentation knotsFruit w _n1 ，w _n2 Form a set of word pairs (w) _n1 ，w _n2 ) Wherein n1 > n2, n1, n 2. Epsilon. [1, n ] _w ]；

And the step S3 of calculating the word pair compactness of the converted word pair comprises the following steps:

traversing and selecting word segmentation result from coding result set B

Obtaining a coding result set of speech recognition results

Wherein

Is composed of

And obtaining a coding result set G of the speech recognition result word pairs ^b ＝{(b _n1 ，b _n2 )|n1＞n2，n1，n2∈[1，n _w ]}；

wherein:

t represents transposition;

S4: and constructing a dynamic weighted speech theme recognition model based on the word pair compactness, inputting the word compactness of the speech recognition result into the model, outputting the speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme.

And in the step S4, a dynamic weighted speech topic recognition model is constructed based on the word pair compactness, and the method comprises the following steps:

s42: setting the current iteration number of the model as r, setting the initial value of r as 0, and setting the maximum iteration number of the model as Max;

s43: when the (r + 1) th iteration is calculated, any word pair q in the word pair set G1 is calculated _v As a subject

Probability of (c):

wherein:

alpha, beta represents Dirichlet prior parameters;

representing the r +1 th iteration word pair q _v The subject matter of is

Representing h topic categories in the corpus A;

The number of word pairs of;

The number of times of (c);

representing removed word pairs q _v Up to the (r + 1) th iteration, word pair q _v The second word in (b) is divided into topics

The number of times of (c);

The number of times of (c);

wherein:

representing word pairs q _v Two words in (1), number of occurrences in the speech recognition result, count _G Representing the total number of words in the speech recognition result.

And S5, performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the course, otherwise, the method comprises the following steps:

Example 2:

fig. 3 is a functional block diagram of a learning monitoring apparatus applied to an intelligent headset in the learning field according to an embodiment of the present invention, which can implement the learning monitoring means of the intelligent headset in embodiment 1.

The learning monitoring device 100 of the intelligent headset applied to the learning field of the present invention can be installed in an electronic device. According to the realized function, the learning monitoring device applied to the intelligent headset in the learning field may include a signal acquisition device 101, a voice recognition module 102 and a recognition monitoring device 103. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

The signal acquisition device 101 is used for acquiring an intelligent earphone audio signal, preprocessing the acquired intelligent earphone audio signal, and extracting the characteristics of the preprocessed intelligent earphone audio signal to obtain the extracted audio signal characteristics;

the speech recognition module 102 is configured to construct an improved end-to-end speech recognition model, and input the extracted audio signal characteristics into the model to obtain a speech recognition result;

and the recognition monitoring device 103 is used for performing word pair conversion on the recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not attend the lesson, otherwise, attending the lesson.

In detail, when the modules in the learning monitoring apparatus 100 for an intelligent headset in the learning field according to the embodiment of the present invention are used, the same technical means as the learning monitoring means for an intelligent headset in the learning field described in fig. 1 above is adopted, and the same technical effects can be produced, which is not described herein again.

Example 3:

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the program 12, but also to temporarily store data that has been output or will be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (programs 12 for learning monitoring, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 4 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-Fl interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions, which when executed in the processor 10, may implement:

collecting an intelligent earphone audio signal, preprocessing the collected intelligent earphone audio signal, and performing feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;

constructing an improved end-to-end voice recognition model, and inputting the extracted audio signal characteristics into the model to obtain a voice recognition result;

performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair;

constructing a dynamic weighted speech topic recognition model based on the word pair compactness, inputting the word compactness of a speech recognition result into the model, and outputting a speech topic by the model;

and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the teaching, otherwise, the student is attending the teaching.

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 to fig. 4, which is not repeated herein.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, apparatus, article, or method comprising the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims

1. A learning monitoring means for an intelligent headset in the learning field, the means comprising:

s4: the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, inputting the word compactness of a speech recognition result into the model, outputting a speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme, wherein the method comprises the following steps:

Probability of (c):

wherein:

alpha, beta represents Dirichlet prior parameters;

representing the r +1 th iteration word pair q _v The subject matter of is

Representing h topic categories in the corpus A;

The number of word pairs of;

The number of times of (c);

The number of times of (c);

The number of times of (c);

wherein:

p(topic|G ^d g) represents the probability that the topic of the speech recognition result is topic, which belongs to [1, h ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;

representing word pairs q _v Two words in (1), number of occurrences in the speech recognition result, count _G Representing the total number of words in the speech recognition result;

s5: and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the teaching, otherwise, the student is attending the teaching.

2. The learning monitoring means of an intelligent headset applied in the learning field as claimed in claim 1, wherein the step S1 of collecting audio signals of the intelligent headset and preprocessing the collected audio signals of the intelligent headset comprises:

said cut-off frequency f _t The calculation formula of (c) is:

wherein:

r represents a resistance value of a resistor in the high-pass filter;

c represents the capacitance of the capacitive element in the high-pass filter;

build up of frequency f _s Impulse sequence p (t) of (1):

conversion of an audio signal x (t) into an analog signal based on an impulse sequence p (t)

X (1) and the last sample point is x (N), then the analog signal is obtained

n′∈[1，N]X (n') is a converted analog signal;

the signal acquisition device carries out windowing processing on the converted analog signal x (n'), wherein the window function of the windowing processing is a Hamming window, and the windowing processing formula is as follows:

x(n″)＝x(n′)×w(n′)

wherein:

w (n') is a window function;

a represents a window function coefficient, which is set to 0.43;

3. The learning monitoring means of the intelligent earphone applied in the learning field as claimed in claim 2, wherein the step S1 of extracting the features of the audio signal of the intelligent earphone after the preprocessing to obtain the extracted features of the audio signal comprises:

wherein:

x (k) represents the result of fast Fourier transform of X (n') at k points;

Wherein:

f (m) represents the center frequency of the mth filter;

wherein:

wherein:

l 'represents the characteristic order, L' =1, 2., 12;

c (L') represents the extracted L-order features;

the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ″)).

4. The learning monitoring tool applied to intelligent headsets in the learning field as claimed in claim 1, wherein the step S2 of constructing the improved end-to-end speech recognition model comprises:

constructing an improved end-to-end voice recognition model, wherein the model input is the audio signal characteristics obtained by extraction, and the model output is a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by utilizing a gate control circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps:

F1＝Conv(F)

wherein:

conv (-) represents a convolution operation;

f1 represents a convolution feature;

s22: inputting the convolution characteristics into an encoding layer based on gated cyclic units, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gated cyclic units, each gated cyclic unit is set to be bidirectional recursion, and the output result of the last gated cyclic unit is used as the encoding result of the audio signal characteristics;

wherein:

N _x words representing true phonetic textA symbol number;

the extracted audio signal features F (x (n ")) are input into a model, resulting in a speech recognition result y (x (n")).

5. The learning monitoring means of an intelligent earphone applied in the learning field as claimed in claim 1, wherein the step S3 of performing word pair conversion on the recognized speech result to obtain a converted word pair comprises:

constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = ₁ ，topic ₂ ，...，topic _h Wherein topic _h The method comprises the steps that a corpus is a h-th topic type corpus, all words in the corpus A are coded by a single hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;

Wherein

Is the nth of the speech recognition result _w Word, n _w The total number of words of the voice recognition result is represented, and if the matching is unsuccessful, the front z-1 bits are intercepted and matched again;

s32: result of word segmentation

6. The learning monitoring means of the intelligent earphone applied in the learning field as claimed in claim 5, wherein the step S3 of calculating the word pair compactness of the converted word pair comprises:

traversing and selecting word segmentation result from coding result set B

Obtaining a coding result set of speech recognition results

Wherein

Is composed of

wherein:

t represents transposition;

7. The learning monitoring means of an intelligent earphone applied in the learning field as claimed in claim 1, wherein the step S5 is to perform similarity calculation between the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold, it indicates that the voice theme in the earphone is different from the teaching content theme, which indicates that the student is not listening to the teaching, otherwise, the learning monitoring means of the intelligent earphone includes:

8. A learning monitoring device for intelligent headsets applied in the learning field, the device comprising:

the recognition monitoring device is used for performing word pair conversion on a recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and a current teaching content theme, indicating that the voice theme in the earphone is different from the teaching content theme if the similarity calculation structure is smaller than a specified threshold value, and indicating that a student does not attend a lesson, or else attending a lesson, so as to realize the learning monitoring means applied to the intelligent earphone in the learning field as claimed in claims 1-7.