CN115376499A - Learning monitoring means applied to intelligent earphone in learning field - Google Patents

Learning monitoring means applied to intelligent earphone in learning field Download PDF

Info

Publication number
CN115376499A
CN115376499A CN202210994128.9A CN202210994128A CN115376499A CN 115376499 A CN115376499 A CN 115376499A CN 202210994128 A CN202210994128 A CN 202210994128A CN 115376499 A CN115376499 A CN 115376499A
Authority
CN
China
Prior art keywords
theme
word
audio signal
result
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210994128.9A
Other languages
Chinese (zh)
Other versions
CN115376499B (en
Inventor
刘其军
余红云
陆佳丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan Jiexun Electronic Technology Co ltd
Original Assignee
Dongguan Leyi Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongguan Leyi Electronic Technology Co ltd filed Critical Dongguan Leyi Electronic Technology Co ltd
Priority to CN202210994128.9A priority Critical patent/CN115376499B/en
Publication of CN115376499A publication Critical patent/CN115376499A/en
Application granted granted Critical
Publication of CN115376499B publication Critical patent/CN115376499B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention relates to the technical field of learning monitoring, and discloses a learning monitoring means of an intelligent earphone applied to the learning field, which comprises the following steps: extracting the characteristics of the preprocessed intelligent earphone audio signals; constructing an improved end-to-end speech recognition model, and inputting the extracted audio signal characteristics into the model; performing word pair conversion on the recognized voice result, and calculating the word pair compactness of the converted word pair; constructing a dynamic weighted speech topic recognition model based on the word pair compactness, and inputting the word compactness of a speech recognition result into the model; and judging the theme of the voice theme obtained by recognition and the theme of the current teaching content. The method mainly inspects the word pairs with higher word pair compactness in the voice recognition content, the word pairs with higher compactness have higher theme recognition weight, and dynamically considers the theme selection probability of other word pairs in the theme recognition process, thereby realizing more accurate voice theme recognition based on global theme recognition.

Description

Learning monitoring means applied to intelligent earphone in learning field
Technical Field
The invention relates to the technical field of learning monitoring, in particular to a learning monitoring means of an intelligent earphone applied to the learning field.
Background
Along with the popularization of online teaching, more and more students use the earphone to study in class, but because a computer can play a plurality of audios and videos synchronously, the students may listen to the audio in class on the surface, but actually listen to other audio, and the teacher cannot effectively judge through the class listening pictures of the students in the online class. CN111383659A provides a distributed voice monitoring method, apparatus, system, storage medium and device, and obtains audio stream data belonging to the same machine room; acquiring audio data to be audited from the audio stream data according to a preset trial pushing strategy; inputting the audio data to be audited into a pre-trained audio recognition model to obtain a predicted value corresponding to the audio recognition model; and generating an audio machine examination result according to the predicted value. The method can realize high input-output ratio, high coverage, low delay, high recognition rate and high-efficiency voice monitoring and auditing, and can meet the audio content monitoring requirement of high-activity voice social application in a multi-operator mixed networking deployment network environment. CN107292271A provides a learning monitoring method, apparatus and electronic device, and the method includes: acquiring a class image of a classroom student; identifying the images in class, and acquiring feature data of the students in class, wherein the feature data comprises at least one of the following data: the facial feature data of the classroom student, the visual feature data of the classroom student and the limb feature data of the classroom student; and determining the class state of the classroom student according to the characteristic data of the classroom student. The system can effectively and accurately monitor the class attendance of students who study through a computer and a network in a classroom, and provides effective reference for subsequent study and teaching so as to further improve the study or teaching process. Although the existing learning monitoring method can monitor the class state of the student based on the feature data of the student, the following two problems still exist: firstly, the existing audio monitoring method can only monitor whether the audio violates rules, and whether the playing content of the earphone is consistent with the theme of the content of classroom teaching in class needs to be judged; secondly, the traditional method does not monitor the class state according to the auditory characteristics. To solve the problem, the patent provides an intelligent earphone learning monitoring method applied to the learning field.
Disclosure of Invention
In view of this, the invention provides a learning monitoring means for an intelligent headset applied in the learning field, and aims to (1) use a gate control circulation unit to replace an LSTM part in a traditional speech recognition model, wherein the gate control circulation unit has less parameter quantity and calculation quantity and can complete speech recognition more quickly; (2) And in the process of topic identification, topic selection probabilities of other word pairs are dynamically considered, and more accurate speech topic identification is realized based on global topic identification.
The invention provides a learning monitoring means of an intelligent earphone applied to the learning field, which comprises the following steps:
s1: collecting an intelligent earphone audio signal, preprocessing the collected intelligent earphone audio signal, and performing feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;
s2: constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model to obtain a voice recognition result, wherein the model is input as the extracted audio signal characteristics, and is output as the voice recognition result;
s3: performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair;
s4: constructing a dynamic weighted speech theme recognition model based on the word pair compactness, inputting the word compactness of a speech recognition result into the model, outputting the speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme;
s5: and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the course, otherwise, the student is attending the course.
As a further improvement of the method of the invention:
optionally, the acquiring an audio signal of the smart headset in the step S1, and preprocessing the acquired audio signal of the smart headset includes:
the method comprises the following steps of acquiring audio signals in the intelligent earphone by utilizing a built-in signal acquisition device in the intelligent earphone, and preprocessing the acquired audio signals of the intelligent earphone, wherein the preprocessing process comprises the following steps:
a high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is small and has no attenuation effect on the signal, if the frequency of the acquired signal is lower than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is large and attenuates the signal, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t is e [ t [ ] [ t ] ] 0 ,t L ]Representing timing information of the acquired audio signal, t 0 Representing the initial moment of audio signal acquisition, t L Representing a cut-off time of audio signal acquisition;
said cut-off frequency f t The calculation formula of (c) is:
Figure BDA0003804861620000021
wherein:
r represents a resistance value of a resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
a digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion process of the analog signal is as follows:
build up of frequency f s Impulse sequence p (t) of (1):
Figure BDA0003804861620000022
Figure BDA0003804861620000023
conversion of an audio signal x (t) into an analog signal on the basis of an impulse sequence p (t)
Figure BDA0003804861620000024
Figure BDA0003804861620000025
The number of sampling points in the analog signal is N, so that the analog signal is
Figure BDA0003804861620000026
X (1) and the last sample point is x (N), then the analog signal is obtained
Figure BDA0003804861620000027
x (n') is the converted analog signal;
the signal acquisition device carries out windowing processing on the converted analog signal x (n'), wherein a window function of the windowing processing is a Hamming window, and a windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
Figure BDA0003804861620000028
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
and x (n ') is the audio signal after windowing processing, and x (n') is taken as the audio signal of the intelligent earphone after preprocessing.
Optionally, the performing feature extraction on the preprocessed audio signal of the smart headset in the step S1 to obtain the extracted audio signal feature includes:
performing feature extraction on the preprocessed intelligent earphone audio signal x (n '), so as to obtain an extracted audio signal feature F (x (n'), wherein the feature extraction process comprises the following steps:
s11: performing fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain the frequency domain characteristics of x (n'), wherein the formula of the fast Fourier transform processing is as follows:
Figure BDA0003804861620000031
wherein:
x (k) represents the result of the fast Fourier transform of X (n') at point k;
k is the number of points of the fast Fourier transform, j represents the unit of imaginary number, j 2 =-1;
S12: constructing a filter bank having M filters, the mth filter in the filter bank having a frequency response of f m (k):
Figure BDA0003804861620000032
Figure BDA0003804861620000033
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the fast Fourier transform result into a filter bank, wherein the filter bank outputs the energy of a signal frequency domain:
Figure BDA0003804861620000034
wherein:
en represents the energy of the audio signal of the smart headset in the frequency domain;
s14: extracting the features of the audio signal of the intelligent earphone, wherein the feature extraction formula is as follows:
Figure BDA0003804861620000035
wherein:
l 'represents the characteristic order, L' =1, 2., 12;
c (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ")).
Optionally, the constructing an improved end-to-end speech recognition model in step S2 includes:
constructing an improved end-to-end voice recognition model, wherein the model input is the audio signal characteristics obtained by extraction, and the model output is a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by utilizing a gating circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps:
s21: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
conv (-) represents a convolution operation;
f1 represents a convolution feature;
s22: inputting the convolution characteristic into an encoding layer based on a gating cycle unit, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gating cycle units, each gating cycle unit is set to be bidirectional recursion, and the output result of the last gating cycle unit is used as the encoding result of the audio signal characteristic;
s23: the decoding layer outputs the probability P (y | x) of a decoding result y corresponding to each section of the coding result x by using a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end speech recognition model is:
Figure BDA0003804861620000041
wherein:
N x the number of characters representing the real speech text;
a 1,x number of alternative characters in speech recognition result representing output of model, a 2,x Number of deleted characters in speech recognition result representing model output, a 3,x Representing the number of inserted characters in the voice recognition result output by the model;
the extracted audio signal features F (x (n ″)) are input into a model, resulting in a speech recognition result y (x (n ″)).
Optionally, the performing word pair conversion on the recognized speech result in the step S3 to obtain a converted word pair includes:
constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = 1 ,topic 2 ,...,topic h Therein topic, of h The method comprises the steps that a corpus is a h-th topic type corpus, all words in the corpus A are coded by a single hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;
in the embodiment of the invention, the distances of the single-hot coding results of the words in the same subject category are close;
based on the word encoding result of the corpus a, performing word pair conversion on the recognized speech result y (x (n ")), wherein the word pair conversion process comprises the following steps:
s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary to be z, and intercepting a voice recognition resultMatching the intercepted result with the words in the word segmentation dictionary, if the matching is successful, taking the matching result as the word segmentation result of the intercepted text, and repeating the step on the residual voice recognition result texts to obtain the word segmentation result of the voice recognition result
Figure BDA0003804861620000042
Wherein
Figure BDA0003804861620000043
Is the n-th of the speech recognition result w Word, n w The total number of words of the voice recognition result is represented, and if the matching is unsuccessful, the front z-1 bits are intercepted and matched again;
s32: result of word segmentation
Figure BDA0003804861620000044
Any two word segmentation results w n1 ,w n2 Form a set of word pairs (w) n1 ,w n2 ) Wherein n1 > n2, n1, n 2. Epsilon. [1, n ] w ];
S33: obtaining a set of conversion word pairs G = { (w) of speech recognition results n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
Optionally, the calculating the word pair compactness of the converted word pair in the step S3 includes:
traversing and selecting word segmentation result from coding result set B
Figure BDA0003804861620000045
Obtaining a coding result set of the speech recognition result
Figure BDA0003804861620000046
Wherein
Figure BDA0003804861620000047
Is composed of
Figure BDA0003804861620000048
And obtaining the coding result of the speech recognition result word pairResult set
Figure BDA0003804861620000049
Calculating the word pair compactness of the conversion word pair in the voice recognition result:
Figure BDA00038048616200000410
wherein:
t represents transposition;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
Optionally, the constructing a dynamic weighted speech topic recognition model based on the word pair compactness in the step S4 includes:
the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, wherein the model input is a speech recognition result based on the word pair compactness, and the model output is a speech theme, and the theme recognition process of the dynamic weighted speech theme recognition model comprises the following steps:
s41: set G of word pair compactness d The compactness of Chinese word pair is higher than the threshold value
Figure BDA0003804861620000051
The word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A;
s42: setting the current iteration times of the model as r, setting the initial value of r as 0, and setting the maximum iteration times of the model as Max;
s43: when the (r + 1) th iteration is calculated, calculating any word pair q in the word pair set G1 v As a subject
Figure BDA0003804861620000052
Probability of (c):
Figure BDA0003804861620000053
Figure BDA0003804861620000054
Figure BDA0003804861620000055
Figure BDA0003804861620000056
Figure BDA0003804861620000057
wherein:
alpha, beta represents Dirichlet prior parameters;
Figure BDA0003804861620000058
representing the r +1 st iteration word pair q v Is subject of
Figure BDA0003804861620000059
Figure BDA00038048616200000510
Representing h topic categories in the corpus A;
Figure BDA00038048616200000511
denotes the word pair q divided in G1 at the r-th iteration v Is divided into themes
Figure BDA00038048616200000512
The number of word pairs of;
Figure BDA00038048616200000513
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v Is divided into topics
Figure BDA00038048616200000514
The number of times of (c);
Figure BDA00038048616200000515
representing removed word pairs q v By the (r + 1) th iteration, the word pair q v The second word in (1) is divided into topics
Figure BDA00038048616200000516
The number of times of (c);
Figure BDA00038048616200000517
representing removed word pairs q v By the r +1 th iteration, all words are divided into topics
Figure BDA00038048616200000518
The number of times of (c);
s44: judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step S43;
inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
Figure BDA00038048616200000519
wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, topic ∈ [1, h [ ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;
Figure BDA00038048616200000520
representing word pairs q v Two words in (1), the number of occurrences in the speech recognition result, count G Representing the total number of words in the speech recognition result.
Optionally, in the step S5, similarity calculation is performed on the recognized voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold, it indicates that the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the lesson, otherwise, the method is attending the lesson, and includes:
and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course.
In order to solve the above problems, the present invention further provides a learning monitoring apparatus for an intelligent headset applied in the learning field, wherein the apparatus comprises:
the signal acquisition device is used for acquiring the audio signal of the intelligent earphone, preprocessing the acquired audio signal of the intelligent earphone and extracting the characteristics of the preprocessed audio signal of the intelligent earphone to obtain the extracted audio signal characteristics;
the voice recognition module is used for constructing an improved end-to-end voice recognition model and inputting the extracted audio signal characteristics into the model to obtain a voice recognition result;
and the recognition monitoring device is used for performing word pair conversion on the recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, and indicating that the student does not attend the lesson, otherwise, attending the lesson.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the learning monitoring means applied to the intelligent headset in the learning field.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, where the at least one instruction is executed by a processor in an electronic device to implement the above learning monitoring means applied to an intelligent headset in the learning field.
Compared with the prior art, the invention provides a learning monitoring means applied to the intelligent headset in the learning field, and the technology has the following advantages:
firstly, the scheme provides a voice recognition model, wherein an improved end-to-end voice recognition model is constructed, the model is input as the extracted audio signal characteristics, and the model is output as a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by using a gate control circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein: f represents the audio signal characteristics input to the end-to-end speech recognition model; conv (-) represents a convolution operation; f1 represents a convolution feature; inputting the convolution characteristic into an encoding layer based on a gating cycle unit, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gating cycle units, each gating cycle unit is set to be bidirectional recursion, and the output result of the last gating cycle unit is used as the encoding result of the audio signal characteristic; the decoding layer outputs the probability P (y | x) of the decoding result y corresponding to each section of the coding result x by utilizing a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result; the training objective function of the end-to-end speech recognition model is as follows:
Figure BDA0003804861620000061
wherein: n is a radical of x The number of characters representing the real speech text; a is a 1,x Number of alternative characters in speech recognition result representing output of model, a 2,x Representing the number of deleted characters in the speech recognition result output by the model, a 3,x Representing the number of inserted characters in the voice recognition result output by the model; the extracted audio signal features F (x (n ")) are input into a model, resulting in a speech recognition result y (x (n")). According to the improved model, the gating circulation unit is used for replacing an LSTM part in a traditional speech recognition model, the gating circulation unit has less parameter quantity and calculation quantity, and speech recognition can be completed more quickly.
Meanwhile, the scheme provides a topic recognition model, a dynamic weighted speech topic recognition model based on the word pair compactness is constructed, the model is input as a speech recognition result based on the word pair compactness, and is output as a speech topic, and the topic recognition process of the dynamic weighted speech topic recognition model is as follows: set G of word pair compactness d The compactness of Chinese word pair is higher than the threshold value
Figure BDA0003804861620000062
The word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A; setting the current iteration number of the model as r, setting the initial value of r as 0, and setting the maximum iteration number of the model as Max; when the (r + 1) th iteration is calculated, calculating any word pair q in the word pair set G1 v As a subject
Figure BDA0003804861620000071
Probability of (c):
Figure BDA0003804861620000072
Figure BDA0003804861620000073
Figure BDA0003804861620000074
Figure BDA0003804861620000075
Figure BDA0003804861620000076
wherein: alpha, beta represents Dirichlet prior parameters;
Figure BDA0003804861620000077
representing the r +1 th iteration word pair q v The subject matter of is
Figure BDA0003804861620000078
Figure BDA0003804861620000079
Representing h topic categories in the corpus A;
Figure BDA00038048616200000710
indicates that at the r-th iteration, the word pair q is divided in G1 v Is divided into themes
Figure BDA00038048616200000711
The number of word pairs of;
Figure BDA00038048616200000712
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v Is divided into topics
Figure BDA00038048616200000713
The number of times of (c);
Figure BDA00038048616200000714
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v The second word in (1) is divided into topics
Figure BDA00038048616200000715
The number of times of (c);
Figure BDA00038048616200000716
representing removed word pairs q v By the r +1 th iteration, all words are divided into topics
Figure BDA00038048616200000717
The number of times of (c); judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step; inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
Figure BDA00038048616200000718
wherein: p (topic | G) d G) represents the probability that the topic of the speech recognition result is topic, which belongs to [1, h ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;
Figure BDA00038048616200000719
representing word pairs q v Two words in the speech recognition resultNumber of occurrences, count G Representing the total number of words in the speech recognition result. And performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course. According to the scheme, a dynamic weighted speech topic recognition model is constructed based on word pair compactness, a traditional model does not distinguish core word pairs and non-core word pairs, all word pairs are considered uniformly, so that core words representing text topic information are ignored, and topic recognition accuracy is influenced.
Drawings
Fig. 1 is a schematic flowchart of a learning monitoring means of an intelligent headset applied in the learning field according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of one step of the embodiment of FIG. 1;
fig. 3 is a functional block diagram of a learning monitoring apparatus for an intelligent headset applied in the learning field according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device for implementing a learning monitoring means applied to an intelligent headset in the learning field according to an embodiment of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a learning monitoring means applied to an intelligent headset in the learning field. The execution subject of the learning monitoring means applied to the smart headset in the learning field includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiment of the present application. In other words, the learning monitoring means applied to the smart headset in the learning field may be executed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Example 1:
s1: the method comprises the steps of collecting intelligent earphone audio signals, preprocessing the collected intelligent earphone audio signals, and extracting features of the preprocessed intelligent earphone audio signals to obtain extracted audio signal features.
Gather intelligent earphone audio signal in the S1 step to carry out the preliminary treatment to the intelligent earphone audio signal who gathers, include:
the method comprises the following steps of acquiring audio signals in the intelligent earphone by utilizing a built-in signal acquisition device in the intelligent earphone, and preprocessing the acquired audio signals of the intelligent earphone, wherein the preprocessing process comprises the following steps:
a high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is smaller and has no attenuation effect on the signal, if the frequency of the acquired signal is lower than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is larger and attenuates the signal, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t belongs to [ t ∈ [ t ] t 0 ,t L ]Representing timing information of the acquired audio signal, t 0 Representing the initial moment of audio signal acquisition, t L Representing a cut-off time of audio signal acquisition;
said cut-off frequency f t The calculation formula of (c) is:
Figure BDA0003804861620000081
wherein:
r represents a resistance value of a resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
a digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion process of the analog signal is as follows:
build up frequency of f s Impulse sequence p (t):
Figure BDA0003804861620000082
Figure BDA0003804861620000083
conversion of an audio signal x (t) into an analog signal on the basis of an impulse sequence p (t)
Figure BDA0003804861620000084
Figure BDA0003804861620000085
The number of sampling points in the analog signal is N, so that the analog signal is
Figure BDA0003804861620000086
X (1) and the last sample point is x (N), then the analog signal is obtained
Figure BDA0003804861620000087
x (n') is the converted analog signal;
the signal acquisition device carries out windowing processing on the converted analog signal x (n'), wherein a window function of the windowing processing is a Hamming window, and a windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
Figure BDA0003804861620000091
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
and x (n ') is the audio signal after windowing processing, and x (n') is taken as the audio signal of the intelligent earphone after preprocessing.
And in the step S1, feature extraction is carried out on the preprocessed intelligent earphone audio signal to obtain the extracted audio signal features, and the method comprises the following steps:
performing feature extraction on the preprocessed audio signal x (n ") of the smart headset to obtain an extracted audio signal feature F (x (n")), specifically, referring to fig. 2, the feature extraction process includes:
s11: carrying out fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain the frequency domain characteristics of x (n'), wherein the formula of the fast Fourier transform processing is as follows:
Figure BDA0003804861620000092
wherein:
x (k) represents the result of fast Fourier transform of X (n') at k points;
k is the number of points of the fast Fourier transform, j represents the unit of imaginary number, j 2 =-1;
S12: constructing a filter bank having M filters, the mth filter in the filter bank having a frequency response of f m (k):
Figure BDA0003804861620000093
Figure BDA0003804861620000094
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the result of the fast Fourier transform into a filter bank, and outputting the energy of a signal frequency domain by the filter bank:
Figure BDA0003804861620000095
wherein:
en represents the energy of the audio signal of the smart headset in the frequency domain;
extracting the features of the audio signal of the intelligent earphone, wherein the feature extraction formula is as follows:
Figure BDA0003804861620000096
wherein:
l 'represents the characteristic order, L' =1, 2., 12;
c (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ")).
S2: and constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model to obtain a voice recognition result, wherein the model is input as the extracted audio signal characteristics, and is output as the voice recognition result.
The step S2 of constructing an improved end-to-end speech recognition model includes:
constructing an improved end-to-end voice recognition model, wherein the model input is the audio signal characteristics obtained by extraction, and the model output is a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by utilizing a gating circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps:
s21: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
conv (·) denotes a convolution operation;
f1 represents a convolution feature;
s22: inputting the convolution characteristic into an encoding layer based on a gating cycle unit, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gating cycle units, each gating cycle unit is set to be bidirectional recursion, and the output result of the last gating cycle unit is used as the encoding result of the audio signal characteristic;
s23: the decoding layer outputs the probability P (y | x) of a decoding result y corresponding to each section of the coding result x by using a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end speech recognition model is as follows:
Figure BDA0003804861620000101
wherein:
N x the number of characters representing the real speech text;
a 1,x number of alternative characters in speech recognition result representing output of model, a 2,x Representing the number of deleted characters in the speech recognition result output by the model, a 3,x Representing the number of inserted characters in the voice recognition result output by the model;
the extracted audio signal features F (x (n ″)) are input into a model, resulting in a speech recognition result y (x (n ″)).
S3: and performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair.
In the step S3, performing word pair conversion on the recognized speech result to obtain a converted word pair, including:
constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = 1 ,topic 2 ,...,topic h Wherein topic h The method comprises the steps that a corpus of the h-th theme category is adopted, all words in the corpus A are coded by a single-hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;
in the embodiment of the invention, the distances of the one-hot coding results of the words of the same subject category are similar;
based on the word encoding result of the corpus A, performing word pair conversion on the recognized voice result y (x (n ")), wherein the word pair conversion process comprises the following steps:
s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary to be z, intercepting the front z position of a voice recognition result, matching the intercepted result with the words in the word segmentation dictionary, taking the matching result as the word segmentation result of the intercepted text if the matching is successful, and repeating the step on the rest voice recognition result texts to obtain the word segmentation result of the voice recognition result
Figure BDA0003804861620000102
Wherein
Figure BDA0003804861620000103
Is the nth of the speech recognition result w Word, n w Representing the total number of words of the voice recognition result, and if the matching is unsuccessful, intercepting the front z-1 bits for re-matching;
s32: result of word segmentation
Figure BDA0003804861620000104
Any two word segmentation knotsFruit w n1 ,w n2 Form a set of word pairs (w) n1 ,w n2 ) Wherein n1 > n2, n1, n 2. Epsilon. [1, n ] w ];
S33: obtaining a set of conversion word pairs G = { (w) of speech recognition results n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
And the step S3 of calculating the word pair compactness of the converted word pair comprises the following steps:
traversing and selecting word segmentation result from coding result set B
Figure BDA0003804861620000111
Obtaining a coding result set of speech recognition results
Figure BDA0003804861620000112
Wherein
Figure BDA0003804861620000113
Is composed of
Figure BDA0003804861620000114
And obtaining a coding result set G of the speech recognition result word pairs b ={(b n1 ,b n2 )|n1>n2,n1,n2∈[1,n w ]};
Calculating the word pair compactness of the conversion word pair in the voice recognition result:
Figure BDA0003804861620000115
wherein:
t represents transposition;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
S4: and constructing a dynamic weighted speech theme recognition model based on the word pair compactness, inputting the word compactness of the speech recognition result into the model, outputting the speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme.
And in the step S4, a dynamic weighted speech topic recognition model is constructed based on the word pair compactness, and the method comprises the following steps:
the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, wherein the model input is a speech recognition result based on the word pair compactness, and the model output is a speech theme, and the theme recognition process of the dynamic weighted speech theme recognition model comprises the following steps:
s41: set G of word pair compactness d The compactness of Chinese word pair is higher than the threshold value
Figure BDA00038048616200001123
The word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A;
s42: setting the current iteration number of the model as r, setting the initial value of r as 0, and setting the maximum iteration number of the model as Max;
s43: when the (r + 1) th iteration is calculated, any word pair q in the word pair set G1 is calculated v As a subject
Figure BDA0003804861620000116
Probability of (c):
Figure BDA0003804861620000117
Figure BDA0003804861620000118
Figure BDA0003804861620000119
Figure BDA00038048616200001110
Figure BDA00038048616200001111
wherein:
alpha, beta represents Dirichlet prior parameters;
Figure BDA00038048616200001112
representing the r +1 th iteration word pair q v The subject matter of is
Figure BDA00038048616200001113
Figure BDA00038048616200001114
Representing h topic categories in the corpus A;
Figure BDA00038048616200001115
denotes the word pair q divided in G1 at the r-th iteration v Is divided into themes
Figure BDA00038048616200001116
The number of word pairs of;
Figure BDA00038048616200001117
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v Is divided into topics
Figure BDA00038048616200001118
The number of times of (c);
Figure BDA00038048616200001119
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v The second word in (b) is divided into topics
Figure BDA00038048616200001120
The number of times of (c);
Figure BDA00038048616200001121
representing removed word pairs q v By the r +1 th iteration, all words are divided into topics
Figure BDA00038048616200001122
The number of times of (c);
s44: judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step S43;
inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
Figure BDA0003804861620000121
wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, topic ∈ [1, h [ ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;
Figure BDA0003804861620000122
representing word pairs q v Two words in (1), number of occurrences in the speech recognition result, count G Representing the total number of words in the speech recognition result.
S5: and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the course, otherwise, the student is attending the course.
And S5, performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the course, otherwise, the method comprises the following steps:
and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course.
Example 2:
fig. 3 is a functional block diagram of a learning monitoring apparatus applied to an intelligent headset in the learning field according to an embodiment of the present invention, which can implement the learning monitoring means of the intelligent headset in embodiment 1.
The learning monitoring device 100 of the intelligent headset applied to the learning field of the present invention can be installed in an electronic device. According to the realized function, the learning monitoring device applied to the intelligent headset in the learning field may include a signal acquisition device 101, a voice recognition module 102 and a recognition monitoring device 103. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
The signal acquisition device 101 is used for acquiring an intelligent earphone audio signal, preprocessing the acquired intelligent earphone audio signal, and extracting the characteristics of the preprocessed intelligent earphone audio signal to obtain the extracted audio signal characteristics;
the speech recognition module 102 is configured to construct an improved end-to-end speech recognition model, and input the extracted audio signal characteristics into the model to obtain a speech recognition result;
and the recognition monitoring device 103 is used for performing word pair conversion on the recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not attend the lesson, otherwise, attending the lesson.
In detail, when the modules in the learning monitoring apparatus 100 for an intelligent headset in the learning field according to the embodiment of the present invention are used, the same technical means as the learning monitoring means for an intelligent headset in the learning field described in fig. 1 above is adopted, and the same technical effects can be produced, which is not described herein again.
Example 3:
fig. 4 is a schematic structural diagram of an electronic device for implementing a learning monitoring means applied to an intelligent headset in the learning field according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the program 12, but also to temporarily store data that has been output or will be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (programs 12 for learning monitoring, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 4 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-Fl interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions, which when executed in the processor 10, may implement:
collecting an intelligent earphone audio signal, preprocessing the collected intelligent earphone audio signal, and performing feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;
constructing an improved end-to-end voice recognition model, and inputting the extracted audio signal characteristics into the model to obtain a voice recognition result;
performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair;
constructing a dynamic weighted speech topic recognition model based on the word pair compactness, inputting the word compactness of a speech recognition result into the model, and outputting a speech topic by the model;
and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the teaching, otherwise, the student is attending the teaching.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 to fig. 4, which is not repeated herein.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, apparatus, article, or method comprising the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims (8)

1. A learning monitoring means for an intelligent headset in the learning field, the means comprising:
s1: collecting an intelligent earphone audio signal, preprocessing the collected intelligent earphone audio signal, and performing feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;
s2: constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model to obtain a voice recognition result, wherein the model is input as the extracted audio signal characteristics, and is output as the voice recognition result;
s3: performing word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair;
s4: the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, inputting the word compactness of a speech recognition result into the model, outputting a speech theme by the model, inputting the speech recognition result based on the word pair compactness by the model, and outputting the speech theme, wherein the method comprises the following steps:
the method comprises the following steps of constructing a dynamic weighted speech theme recognition model based on word pair compactness, wherein the model input is a speech recognition result based on the word pair compactness, and the model output is a speech theme, and the theme recognition process of the dynamic weighted speech theme recognition model comprises the following steps:
s41: set G of word pair compactness d The compactness of Chinese word pair is higher than the threshold value
Figure FDA0003804861610000011
The word pairs form a word pair set G1, and the word pair subject in the word pair set G1 is initialized, wherein the word pair subject categories are h types of subject categories in a corpus A;
s42: setting the current iteration number of the model as r, setting the initial value of r as 0, and setting the maximum iteration number of the model as Max;
s43: when the (r + 1) th iteration is calculated, calculating any word pair q in the word pair set G1 v As a subject
Figure FDA0003804861610000012
Probability of (c):
Figure FDA0003804861610000013
Figure FDA0003804861610000014
Figure FDA0003804861610000015
Figure FDA0003804861610000016
Figure FDA0003804861610000017
wherein:
alpha, beta represents Dirichlet prior parameters;
Figure FDA0003804861610000018
representing the r +1 th iteration word pair q v The subject matter of is
Figure FDA0003804861610000019
Representing h topic categories in the corpus A;
Figure FDA00038048616100000110
denotes the word pair q divided in G1 at the r-th iteration v Is divided into themes
Figure FDA00038048616100000111
The number of word pairs of;
Figure FDA00038048616100000112
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v Is divided into topics
Figure FDA00038048616100000113
The number of times of (c);
Figure FDA00038048616100000114
representing removed word pairs q v Up to the (r + 1) th iteration, word pair q v The second word in (b) is divided into topics
Figure FDA00038048616100000115
The number of times of (c);
Figure FDA00038048616100000116
representing removed word pairs q v By the r +1 th iteration, all words are divided into topics
Figure FDA00038048616100000117
The number of times of (c);
s44: judging whether r +1 is greater than or equal to Max, if r +1 is greater than or equal to Max, outputting the probability of the subject categories to which different word pairs in the r +1 th iteration word pair set G1 belong, and if not, making r = r +1, and returning to the step S43;
inputting the vocabulary compactness of the voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
Figure FDA0003804861610000021
wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, which belongs to [1, h ]]Representing h types of theme categories in the corpus A, and selecting the theme with the highest probability as a voice theme;
Figure FDA0003804861610000029
representing word pairs q v Two words in (1), number of occurrences in the speech recognition result, count G Representing the total number of words in the speech recognition result;
s5: and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, which indicates that the student does not attend the teaching, otherwise, the student is attending the teaching.
2. The learning monitoring means of an intelligent headset applied in the learning field as claimed in claim 1, wherein the step S1 of collecting audio signals of the intelligent headset and preprocessing the collected audio signals of the intelligent headset comprises:
the method comprises the following steps of acquiring audio signals in the intelligent earphone by utilizing a built-in signal acquisition device in the intelligent earphone, and preprocessing the acquired audio signals of the intelligent earphone, wherein the preprocessing process comprises the following steps:
a high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is small and has no attenuation effect on the signal, if the frequency of the acquired signal is lower than the cut-off frequency of the high-pass filter, the capacitive reactance of the capacitive element is large and attenuates the signal, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t is e [ t [ ] [ t ] ] 0 ,t L ]Representing timing information of the acquired audio signal, t 0 Representing the initial moment of audio signal acquisition, t L Representing a cut-off time of audio signal acquisition;
said cut-off frequency f t The calculation formula of (c) is:
Figure FDA0003804861610000022
wherein:
r represents a resistance value of a resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
a digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion process of the analog signal is as follows:
build up of frequency f s Impulse sequence p (t) of (1):
Figure FDA0003804861610000023
Figure FDA0003804861610000024
conversion of an audio signal x (t) into an analog signal based on an impulse sequence p (t)
Figure FDA0003804861610000025
Figure FDA0003804861610000026
The number of sampling points in the analog signal is N, so that the analog signal is
Figure FDA0003804861610000027
X (1) and the last sample point is x (N), then the analog signal is obtained
Figure FDA0003804861610000028
n′∈[1,N]X (n') is a converted analog signal;
the signal acquisition device carries out windowing processing on the converted analog signal x (n'), wherein the window function of the windowing processing is a Hamming window, and the windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
Figure FDA0003804861610000031
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
and x (n ') is the audio signal after windowing processing, and x (n') is taken as the audio signal of the intelligent earphone after preprocessing.
3. The learning monitoring means of the intelligent earphone applied in the learning field as claimed in claim 2, wherein the step S1 of extracting the features of the audio signal of the intelligent earphone after the preprocessing to obtain the extracted features of the audio signal comprises:
performing feature extraction on the preprocessed intelligent earphone audio signal x (n '), so as to obtain an extracted audio signal feature F (x (n'), wherein the feature extraction process comprises the following steps:
s11: performing fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain the frequency domain characteristics of x (n'), wherein the formula of the fast Fourier transform processing is as follows:
Figure FDA0003804861610000032
wherein:
x (k) represents the result of fast Fourier transform of X (n') at k points;
k is the number of points of the fast Fourier transform, j represents the unit of imaginary number, j 2 =-1;
S12: constructing a filter bank having M filters, the mth filter in the filter bank having a frequency response of f m (k):
Figure FDA0003804861610000033
Figure FDA0003804861610000034
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the fast Fourier transform result into a filter bank, wherein the filter bank outputs the energy of a signal frequency domain:
Figure FDA0003804861610000035
wherein:
en represents the energy of the audio signal of the smart headset in the frequency domain;
s14: extracting the features of the audio signal of the intelligent earphone, wherein the feature extraction formula is as follows:
Figure FDA0003804861610000036
wherein:
l 'represents the characteristic order, L' =1, 2., 12;
c (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ″)).
4. The learning monitoring tool applied to intelligent headsets in the learning field as claimed in claim 1, wherein the step S2 of constructing the improved end-to-end speech recognition model comprises:
constructing an improved end-to-end voice recognition model, wherein the model input is the audio signal characteristics obtained by extraction, and the model output is a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristics to be recognized and carrying out convolution processing on the audio signal characteristics, the coding layer is used for carrying out coding processing on the characteristics after the convolution processing by utilizing a gate control circulation unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process of performing speech recognition on the audio signal features by using the end-to-end speech recognition model comprises the following steps:
s21: the convolution layer performs convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
conv (-) represents a convolution operation;
f1 represents a convolution feature;
s22: inputting the convolution characteristics into an encoding layer based on gated cyclic units, wherein the activation function of the encoding layer is GeLU, the encoding layer comprises 5 gated cyclic units, each gated cyclic unit is set to be bidirectional recursion, and the output result of the last gated cyclic unit is used as the encoding result of the audio signal characteristics;
s23: the decoding layer outputs the probability P (y | x) of a decoding result y corresponding to each section of the coding result x by using a softmax function, and selects y with the maximum probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end speech recognition model is as follows:
Figure FDA0003804861610000041
wherein:
N x words representing true phonetic textA symbol number;
a 1,x number of alternative characters in speech recognition result representing output of model, a 2,x Representing the number of deleted characters in the speech recognition result output by the model, a 3,x Representing the number of inserted characters in the voice recognition result output by the model;
the extracted audio signal features F (x (n ")) are input into a model, resulting in a speech recognition result y (x (n")).
5. The learning monitoring means of an intelligent earphone applied in the learning field as claimed in claim 1, wherein the step S3 of performing word pair conversion on the recognized speech result to obtain a converted word pair comprises:
constructing corpora of different topic categories, wherein each corpus comprises all words in the topic category, and combining the constructed corpora of different topic categories into a corpus A, wherein the corpus A = { topic = 1 ,topic 2 ,...,topic h Wherein topic h The method comprises the steps that a corpus is a h-th topic type corpus, all words in the corpus A are coded by a single hot coding method according to the sequence of the words in the corpus A, and a coding result set B of the corpus A is obtained;
based on the word encoding result of the corpus A, performing word pair conversion on the recognized voice result y (x (n ")), wherein the word pair conversion process comprises the following steps:
s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary to be z, intercepting the front z position of a voice recognition result, matching the intercepted result with the words in the word segmentation dictionary, taking the matching result as the word segmentation result of the intercepted text if the matching is successful, and repeating the step on the rest voice recognition result texts to obtain the word segmentation result of the voice recognition result
Figure FDA0003804861610000042
Wherein
Figure FDA0003804861610000043
Is the nth of the speech recognition result w Word, n w The total number of words of the voice recognition result is represented, and if the matching is unsuccessful, the front z-1 bits are intercepted and matched again;
s32: result of word segmentation
Figure FDA0003804861610000044
Any two word segmentation results w n1 ,w n2 Form a set of word pairs (w) n1 ,w n2 ) Wherein n1 > n2, n1, n 2. Epsilon. [1, n ] w ];
S33: obtaining a set of conversion word pairs G = { (w) of speech recognition results n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
6. The learning monitoring means of the intelligent earphone applied in the learning field as claimed in claim 5, wherein the step S3 of calculating the word pair compactness of the converted word pair comprises:
traversing and selecting word segmentation result from coding result set B
Figure FDA0003804861610000045
Obtaining a coding result set of speech recognition results
Figure FDA0003804861610000051
Wherein
Figure FDA0003804861610000052
Is composed of
Figure FDA0003804861610000053
And obtaining a coding result set G of the speech recognition result word pairs b ={(b n1 ,b n2 )|n1>n2,n1,n2∈[1,n w ]};
Calculating the word pair compactness of the conversion word pair in the voice recognition result:
Figure FDA0003804861610000054
wherein:
t represents transposition;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
7. The learning monitoring means of an intelligent earphone applied in the learning field as claimed in claim 1, wherein the step S5 is to perform similarity calculation between the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold, it indicates that the voice theme in the earphone is different from the teaching content theme, which indicates that the student is not listening to the teaching, otherwise, the learning monitoring means of the intelligent earphone includes:
and performing similarity calculation on the recognized voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity calculation method, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the student does not attend the course is indicated, otherwise, the student attends the course.
8. A learning monitoring device for intelligent headsets applied in the learning field, the device comprising:
the signal acquisition device is used for acquiring the audio signal of the intelligent earphone, preprocessing the acquired audio signal of the intelligent earphone and extracting the characteristics of the preprocessed audio signal of the intelligent earphone to obtain the extracted audio signal characteristics;
the voice recognition module is used for constructing an improved end-to-end voice recognition model and inputting the extracted audio signal characteristics into the model to obtain a voice recognition result;
the recognition monitoring device is used for performing word pair conversion on a recognized voice result, calculating the word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word compactness of the voice recognition result into the model, outputting a voice theme by the model, performing similarity calculation on the recognized voice theme and a current teaching content theme, indicating that the voice theme in the earphone is different from the teaching content theme if the similarity calculation structure is smaller than a specified threshold value, and indicating that a student does not attend a lesson, or else attending a lesson, so as to realize the learning monitoring means applied to the intelligent earphone in the learning field as claimed in claims 1-7.
CN202210994128.9A 2022-08-18 2022-08-18 Learning monitoring method of intelligent earphone applied to learning field Active CN115376499B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210994128.9A CN115376499B (en) 2022-08-18 2022-08-18 Learning monitoring method of intelligent earphone applied to learning field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210994128.9A CN115376499B (en) 2022-08-18 2022-08-18 Learning monitoring method of intelligent earphone applied to learning field

Publications (2)

Publication Number Publication Date
CN115376499A true CN115376499A (en) 2022-11-22
CN115376499B CN115376499B (en) 2023-07-28

Family

ID=84064840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210994128.9A Active CN115376499B (en) 2022-08-18 2022-08-18 Learning monitoring method of intelligent earphone applied to learning field

Country Status (1)

Country Link
CN (1) CN115376499B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609672A (en) * 2009-07-21 2009-12-23 北京邮电大学 A kind of speech recognition semantic confidence feature extracting methods and device
CN103680498A (en) * 2012-09-26 2014-03-26 华为技术有限公司 Speech recognition method and speech recognition equipment
CN108836362A (en) * 2018-04-03 2018-11-20 江苏理工学院 It is a kind of for supervising the household monitoring apparatus and its measure of supervision of child's learning state
CN110415697A (en) * 2019-08-29 2019-11-05 的卢技术有限公司 A kind of vehicle-mounted voice control method and its system based on deep learning
CN111105796A (en) * 2019-12-18 2020-05-05 杭州智芯科微电子科技有限公司 Wireless earphone control device and control method, and voice control setting method and system
KR102228575B1 (en) * 2020-11-06 2021-03-17 (주)디라직 Headset system capable of monitoring body temperature and online learning
CN112863518A (en) * 2021-01-29 2021-05-28 深圳前海微众银行股份有限公司 Method and device for voice data theme recognition
CN113539268A (en) * 2021-01-29 2021-10-22 南京迪港科技有限责任公司 End-to-end voice-to-text rare word optimization method
CN113887883A (en) * 2021-09-13 2022-01-04 淮阴工学院 Course teaching evaluation implementation method based on voice recognition technology
CN114596839A (en) * 2022-03-03 2022-06-07 网络通信与安全紫金山实验室 End-to-end voice recognition method, system and storage medium
CN114662484A (en) * 2022-03-16 2022-06-24 平安科技(深圳)有限公司 Semantic recognition method and device, electronic equipment and readable storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609672A (en) * 2009-07-21 2009-12-23 北京邮电大学 A kind of speech recognition semantic confidence feature extracting methods and device
CN103680498A (en) * 2012-09-26 2014-03-26 华为技术有限公司 Speech recognition method and speech recognition equipment
CN108836362A (en) * 2018-04-03 2018-11-20 江苏理工学院 It is a kind of for supervising the household monitoring apparatus and its measure of supervision of child's learning state
CN110415697A (en) * 2019-08-29 2019-11-05 的卢技术有限公司 A kind of vehicle-mounted voice control method and its system based on deep learning
CN111105796A (en) * 2019-12-18 2020-05-05 杭州智芯科微电子科技有限公司 Wireless earphone control device and control method, and voice control setting method and system
KR102228575B1 (en) * 2020-11-06 2021-03-17 (주)디라직 Headset system capable of monitoring body temperature and online learning
CN112863518A (en) * 2021-01-29 2021-05-28 深圳前海微众银行股份有限公司 Method and device for voice data theme recognition
CN113539268A (en) * 2021-01-29 2021-10-22 南京迪港科技有限责任公司 End-to-end voice-to-text rare word optimization method
CN113887883A (en) * 2021-09-13 2022-01-04 淮阴工学院 Course teaching evaluation implementation method based on voice recognition technology
CN114596839A (en) * 2022-03-03 2022-06-07 网络通信与安全紫金山实验室 End-to-end voice recognition method, system and storage medium
CN114662484A (en) * 2022-03-16 2022-06-24 平安科技(深圳)有限公司 Semantic recognition method and device, electronic equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨焕峥;: "基于深度学习的中文语音识别模型设计与实现", 湖南邮电职业技术学院学报 *
赵从健;雷菊阳;李明明;: "基于无监督学习的语音签到系统" *

Also Published As

Publication number Publication date
CN115376499B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN109859772B (en) Emotion recognition method, emotion recognition device and computer-readable storage medium
CN112185348B (en) Multilingual voice recognition method and device and electronic equipment
CN107369439B (en) Voice awakening method and device
CN107103903A (en) Acoustic training model method, device and storage medium based on artificial intelligence
CN110600033B (en) Learning condition evaluation method and device, storage medium and electronic equipment
CN111261162B (en) Speech recognition method, speech recognition apparatus, and storage medium
CN110148400A (en) The pronunciation recognition methods of type, the training method of model, device and equipment
CN111274797A (en) Intention recognition method, device and equipment for terminal and storage medium
CN110459205A (en) Audio recognition method and device, computer can storage mediums
CN110164447A (en) A kind of spoken language methods of marking and device
CN111694940A (en) User report generation method and terminal equipment
CN108595406B (en) User state reminding method and device, electronic equipment and storage medium
CN111651497A (en) User label mining method and device, storage medium and electronic equipment
CN107274903A (en) Text handling method and device, the device for text-processing
CN109800309A (en) Classroom Discourse genre classification methods and device
CN113420556A (en) Multi-mode signal based emotion recognition method, device, equipment and storage medium
CN113327586A (en) Voice recognition method and device, electronic equipment and storage medium
CN113807103A (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
CN112489628B (en) Voice data selection method and device, electronic equipment and storage medium
CN109545226A (en) A kind of audio recognition method, equipment and computer readable storage medium
CN112201253A (en) Character marking method and device, electronic equipment and computer readable storage medium
CN115376499A (en) Learning monitoring means applied to intelligent earphone in learning field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TA01 Transfer of patent application right

Effective date of registration: 20230726

Address after: Room 102, Building 1, No. 16, Yunlian Fifth Street, Dalang Town, Dongguan City, Guangdong Province, 523000

Applicant after: DONGGUAN JIEXUN ELECTRONIC TECHNOLOGY Co.,Ltd.

Address before: 523000 Building 2, No. 16, Yunlian Fifth Street, Dalang Town, Dongguan City, Guangdong Province

Applicant before: Dongguan Leyi Electronic Technology Co.,Ltd.

TA01 Transfer of patent application right