CN115376499B - Learning monitoring method of intelligent earphone applied to learning field - Google Patents

Learning monitoring method of intelligent earphone applied to learning field Download PDF

Info

Publication number
CN115376499B
CN115376499B CN202210994128.9A CN202210994128A CN115376499B CN 115376499 B CN115376499 B CN 115376499B CN 202210994128 A CN202210994128 A CN 202210994128A CN 115376499 B CN115376499 B CN 115376499B
Authority
CN
China
Prior art keywords
voice
word pair
audio signal
result
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210994128.9A
Other languages
Chinese (zh)
Other versions
CN115376499A (en
Inventor
刘其军
余红云
陆佳丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan Jiexun Electronic Technology Co ltd
Original Assignee
Dongguan Leyi Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongguan Leyi Electronic Technology Co ltd filed Critical Dongguan Leyi Electronic Technology Co ltd
Priority to CN202210994128.9A priority Critical patent/CN115376499B/en
Publication of CN115376499A publication Critical patent/CN115376499A/en
Application granted granted Critical
Publication of CN115376499B publication Critical patent/CN115376499B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention relates to the technical field of learning monitoring, and discloses a learning monitoring method of an intelligent earphone applied to the learning field, which comprises the following steps: extracting the characteristics of the preprocessed intelligent earphone audio signals; an improved end-to-end speech recognition model is constructed, and the extracted audio signal characteristics are input into the model; word pair conversion is carried out on the recognized voice result, and the word pair compactness of the converted word pair is calculated; constructing a dynamic weighted voice theme recognition model based on the word pair compactness, and inputting the word pair compactness of the voice recognition result into the model; and performing theme judgment on the identified voice theme and the current teaching content theme. The method focuses on examining word pairs with higher compactness in the voice recognition content, the word pairs with higher compactness have higher topic recognition weights, topic selection probability of other word pairs is dynamically considered in the topic recognition process, and more accurate voice topic recognition is realized based on global topic recognition.

Description

Learning monitoring method of intelligent earphone applied to learning field
Technical Field
The invention relates to the technical field of learning monitoring, in particular to a learning monitoring method of an intelligent earphone applied to the learning field.
Background
Along with the popularization of online teaching, more and more students use headphones to learn in class, but because the computer can synchronously play a plurality of audios and videos, the surface of the students can listen to class audio, but other audios can be actually listened to, and teachers cannot effectively judge through class listening pictures of the students in online classes. CN111383659a provides a distributed voice monitoring method, device, system, storage medium and equipment, and obtains audio stream data belonging to the same machine room; collecting audio data to be checked from the audio stream data according to a preset push-check strategy; inputting the audio data to be checked into a pre-trained audio recognition model to obtain a predicted value corresponding to the audio recognition model; and generating an audio machine examination result according to the predicted value. The method can realize high input-output ratio, high coverage, low delay, high recognition rate and high-efficiency voice monitoring audit, and can meet the audio content monitoring requirement of high-activity voice social application in a multi-operator hybrid networking deployment network environment. CN107292271a provides a learning monitoring method, a learning monitoring device and an electronic device, and the method includes: acquiring a lesson image of a classroom student; identifying the lesson image and acquiring characteristic data of the classroom student, wherein the characteristic data comprises at least one of the following: facial feature data of the classroom student, visual feature data of the classroom student and limb feature data of the classroom student; and determining the class state of the classroom student according to the characteristic data of the classroom student. The method can effectively and accurately monitor the class listening situation of students who learn through computers and networks in the class, and provides effective references for subsequent learning and teaching so as to further improve the learning or teaching process. Although the existing learning monitoring method can monitor the teaching state of students based on the characteristic data of the students, there are two problems as follows: firstly, the existing audio monitoring method can only monitor whether the audio is illegal or not, and in class, whether the played content of the earphone is consistent with the subject of the teaching content of the class or not needs to be judged; and secondly, the traditional method does not monitor the class state from the auditory feature. Aiming at the problem, the patent provides an intelligent earphone learning monitoring method applied to the learning field.
Disclosure of Invention
In view of the above, the invention provides a learning monitoring method for an intelligent earphone applied to the learning field, and aims to (1) replace an LSTM part in a traditional speech recognition model by using a gating circulation unit, wherein the gating circulation unit has fewer parameters and less calculation amount, and can complete speech recognition more quickly; (2) A dynamic weighted voice topic recognition model is built based on the compactness of word pairs, the model focuses on word pairs with higher compactness of voice recognition content word pairs, the bias of core word pairs is achieved, word pairs with higher compactness have higher topic recognition weights, topic selection probability of other word pairs is dynamically considered in the topic recognition process, and more accurate voice topic recognition is achieved based on global topic recognition.
The invention provides a learning monitoring method of an intelligent earphone applied to the learning field, which comprises the following steps:
s1: collecting intelligent earphone audio signals, preprocessing the collected intelligent earphone audio signals, and extracting features of the preprocessed intelligent earphone audio signals to obtain extracted audio signal features;
s2: an improved end-to-end voice recognition model is built, the extracted audio signal characteristics are input into the model, a voice recognition result is obtained, the model is input into the extracted audio signal characteristics, and the voice recognition result is output;
S3: word pair conversion is carried out on the recognized voice result, a converted word pair is obtained, and the word pair compactness of the converted word pair is calculated;
s4: constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word pair compactness of the voice recognition result into the model, and outputting the voice theme by the model, wherein the model is input into the voice recognition result based on the word pair compactness and is output into the voice theme;
s5: and performing similarity calculation on the identified voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not listen to the class, otherwise, listening to the class.
As a further improvement of the present invention:
optionally, the step S1 collects the intelligent earphone audio signal, and pre-processes the collected intelligent earphone audio signal, including:
the audio signal in the intelligent earphone is acquired by utilizing a signal acquisition device arranged in the intelligent earphone, the acquired audio signal of the intelligent earphone is preprocessed, and the preprocessing flow is as follows:
the high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cutoff frequency of the high-pass filter, the capacitive reactance of the capacitive element is smaller, the signal is not attenuated, if the frequency of the acquired signal is lower than the cutoff frequency of the high-pass filter, the capacitive element is larger, the signal is attenuated, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t is E [ t ] 0 ,t L ]Representing acquired audio signalsTime sequence information of number, t 0 Representing the initial time, t, of audio signal acquisition L Representing the cut-off time of the audio signal acquisition;
the cut-off frequency f t The calculation formula of (2) is as follows:
wherein:
r represents the resistance value of the resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
the digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion flow of the analog signal is as follows:
construction frequency f s Impulse sequence p (t):
converting an audio signal x (t) into an analog signal based on an impulse sequence p (t)
The number of sampling points in the analog signal is N, so that the analog signalThe first sample point of (2) is x (1) and the last sample point is x (N), then the analog signal +.>n′∈[1,N]X (n') is the converted analog signal;
the signal acquisition device performs windowing processing on the converted analog signal x (n'), wherein a window function of the windowing processing is a Hamming window, and a windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
x (n ") is the windowed audio signal, and x (n") is taken as the pre-processed smart earphone audio signal.
Optionally, in the step S1, feature extraction is performed on the preprocessed intelligent earphone audio signal, so as to obtain extracted audio signal features, including:
performing feature extraction on the preprocessed intelligent earphone audio signal x (n ') to obtain an extracted audio signal feature F (x (n'), wherein the feature extraction flow is as follows:
s11: performing fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain frequency domain characteristics of the x (n'), wherein the fast Fourier transform processing formula is as follows:
wherein:
x (k) represents the result of the fast Fourier transform of X (n') at point k;
k is the number of points of the fast Fourier transform, j represents the imaginary unit, j 2 =-1;
S12: constructing a filter bank with M filters, the frequency response of the mth filter in the filter bankShould be f m (k):
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the fast fourier transform result into a filter bank, the filter bank outputting energy of a signal frequency domain:
wherein:
en represents the energy of the intelligent earphone audio signal in the frequency domain;
s14: extracting characteristics of an intelligent earphone audio signal, wherein the characteristic extraction formula is as follows:
wherein:
l 'represents feature order, L' =1, 2, 12;
C (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ").
Optionally, the constructing an improved end-to-end speech recognition model in step S2 includes:
an improved end-to-end voice recognition model is constructed, wherein the model is input into an extracted audio signal feature, and output into a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, an encoding layer and a decoding layer, the convolution layer is used for receiving the audio signal feature to be recognized and carrying out convolution processing on the audio signal feature, the encoding layer is used for carrying out encoding processing on the feature after the convolution processing by using a gating circulation unit, and the decoding layer is used for decoding the encoding processing result, and the decoding result is the voice recognition result; the process for carrying out voice recognition on the audio signal features by using the end-to-end voice recognition model comprises the following steps:
s21: the convolution layer carries out convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
conv (·) represents a convolution operation;
f1 represents a convolution feature;
S22: the convolution characteristic is input into a coding layer based on gating circulation units, an activation function of the coding layer is GeLU, the coding layer comprises 5 gating circulation units, each gating circulation unit is set to be in bidirectional recursion, and an output result of the last gating circulation unit is used as a coding result of the audio signal characteristic;
s23: the decoding layer outputs probability P (y|x) of a decoding result y corresponding to each section of coding result x by using a softmax function, and selects y with the largest probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end voice recognition model is as follows:
wherein:
N x the number of characters representing the real voice text;
a 1,x representing the number of replacement characters, a, in the speech recognition result output by the model 2,x Representing the number of deleted characters, a, in the speech recognition result output by the model 3,x Representing the number of inserted characters in the speech recognition result output by the model;
the extracted audio signal features F (x (n ")) are input into the model to obtain a speech recognition result y (x (n")).
Optionally, in the step S3, word pair conversion is performed on the recognized voice result to obtain a converted word pair, which includes:
constructing corpuses of different topic categories, wherein each corpus contains all words under the topic category, and combining the constructed corpuses of different topic categories into a corpus A, wherein the corpus A= { topic 1 ,topic 2 ,...,topic h }, wherein topic h The method comprises the steps of encoding all words in a corpus A by using a single-hot encoding method according to the sequence of the words in the corpus A for the corpus of the h topic class to obtain an encoding result set B of the corpus A;
in the embodiment of the invention, the single-heat coding result distances of words with the same theme class are similar;
based on word encoding results of the corpus A, word pair conversion is carried out on the recognized voice results y (x (n ")), wherein the word pair conversion flow is as follows:
s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary as z, intercepting the front z bits of a voice recognition result, matching the intercepted result with words in the word segmentation dictionary, taking the matched result as the word segmentation result of the intercepted text if the matching is successful, and repeating the steps on the rest voice recognition result text to obtain the word segmentation result of the voice recognition resultWherein->N-th of speech recognition result w Words, n w The total number of words representing the voice recognition result is intercepted and the matching is carried out again by the z-1 bit before the interception if the matching is unsuccessful;
s32: will word the resultAny two word segmentation results w n1 ,w n2 Form a group of word pairs (w n1 ,w n2 ) Wherein n1 > n2, n1, n2 ε [1, n ] w ];
S33: converted word pair set G= { (w) for obtaining voice recognition result n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
Optionally, the step S3 of calculating the word pair compactness of the converted word pair includes:
traversing and selecting word segmentation results from a coding result set BCoding result set of speech recognition results is obtained +.>Wherein->Is->And obtains a coding result set G of the word pairs of the voice recognition result b ={(b n1 ,b n2 )|n1>n2,n1,n2∈[1,n w ]};
Calculating word pair compactness of the converted word pairs in the voice recognition result:
wherein:
t represents a transpose;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
Optionally, in the step S4, a dynamically weighted speech topic recognition model is built based on word pair compactness, including:
a dynamic weighted voice theme recognition model based on word pair compactness is constructed, the model is input into a voice recognition result based on word pair compactness, the model is output into a voice theme, and the theme recognition flow of the dynamic weighted voice theme recognition model is as follows:
s41: gathering word pairs into a compact set G d The compactness of the Chinese word pair is higher than the threshold valueThe word pair topic category is h topic categories in the corpus A;
S42: setting the current iteration number of the model as r, the initial value of r as 0, and the maximum iteration number of the model as Max;
s43: when computing the (r+1) th iteration, computing any word pair q in the word pair set G1 v Is the subject ofProbability of (2):
wherein:
alpha, beta represent dirichlet a priori parameters;
represents the (r+1) th iteration word pair q v Subject of->Representing h topic categories in the corpus A;
represents the word pair q in G1 at the r-th iteration v The coating is divided into themes->Word pairs of (a)
Representing the removal word pair q v By stopping to the (r+1) th iteration, the word pair q v The first word in (1) is divided into topic->Is a number of times (1);
representing the removal word pair q v By stopping to the (r+1) th iteration, the word pair q v The second word of (a) is divided into the topics +.>Is a number of times (1);
representing the removal word pair q v By the r+1st iteration, all words are split into topics +.>Is a number of times (1);
s44: judging whether r+1 is greater than or equal to Max, if r+1 is greater than or equal to Max, outputting the probability of the subject class of different word pairs in the word pair set G1 after the r+1 iteration, otherwise, making r=r+1, and returning to the step S43;
inputting word pair compactness of a voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
Wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, topic E [1, h]The h topic categories in the corpus A are represented, and topics with the highest probability are selected as voice topics;
count qv representation word pair q v The number of occurrences of two words in the speech recognition result, count G Representing the total number of words in the speech recognition result.
Optionally, in the step S5, similarity calculation is performed on the identified voice theme and the current lecture content theme, if the similarity calculation structure is smaller than a specified threshold, the voice theme in the earphone is different from the lecture content theme, which indicates that the student does not listen to the lecture, otherwise, the method includes:
and performing similarity calculation on the identified voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity algorithm, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the students do not listen to the classes is indicated, otherwise, the students listen to the classes.
In order to solve the above problems, the present invention further provides a learning monitoring device applied to an intelligent earphone in the learning field, which is characterized in that the device includes:
the signal acquisition device is used for acquiring the intelligent earphone audio signals, preprocessing the acquired intelligent earphone audio signals, and extracting the characteristics of the preprocessed intelligent earphone audio signals to obtain the extracted audio signal characteristics;
The voice recognition module is used for constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model, and obtaining a voice recognition result;
the recognition monitoring device is used for carrying out word pair conversion on the recognized voice result, calculating word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word pair compactness of the voice recognition result into the model, outputting a voice theme by the model, carrying out similarity calculation on the recognized voice theme and the current teaching content theme, and if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not listen to the class, otherwise, listening to the class.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
a memory storing at least one instruction; a kind of electronic device with high-pressure air-conditioning system
And the processor executes the instructions stored in the memory to realize the learning monitoring means applied to the intelligent earphone in the learning field.
In order to solve the above-mentioned problems, the present invention further provides a computer readable storage medium, where at least one instruction is stored, where the at least one instruction is executed by a processor in an electronic device to implement the above-mentioned learning monitoring means applied to the smart earphone in the learning field.
Compared with the prior art, the invention provides a learning monitoring method of an intelligent earphone applied to the learning field, and the technology has the following advantages:
firstly, the scheme provides a voice recognition model, an improved end-to-end voice recognition model is constructed, the model is input into an extracted audio signal characteristic, the model is output into a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, a coding layer and a decoding layer, the convolution layer is used for receiving the audio signal characteristic to be recognized and carrying out convolution processing on the audio signal characteristic, the coding layer is used for carrying out coding processing on the feature after the convolution processing by using a gating circulating unit, the decoding layer is used for decoding the coding processing result, and the decoding result is the voice recognition result; the process for carrying out voice recognition on the audio signal features by using the end-to-end voice recognition model comprises the following steps: the convolution layer carries out convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein: f represents the audio signal characteristics input to the end-to-end speech recognition model; conv (·) represents a convolution operation; f1 represents a convolution feature; the convolution characteristic is input into a coding layer based on gating circulation units, an activation function of the coding layer is GeLU, the coding layer comprises 5 gating circulation units, each gating circulation unit is set to be in bidirectional recursion, and an output result of the last gating circulation unit is used as a coding result of the audio signal characteristic; the decoding layer outputs probability P (y|x) of a decoding result y corresponding to each section of coding result x by using a softmax function, and selects y with the largest probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result; the training objective function of the end-to-end voice recognition model is as follows:
Wherein: n (N) x The number of characters representing the real voice text; a, a 1,x Representing the number of replacement characters, a, in the speech recognition result output by the model 2,x Representing the number of deleted characters, a, in the speech recognition result output by the model 3,x Representing the number of inserted characters in the speech recognition result output by the model; the extracted audio signal features F (x (n ")) are input into the model to obtain a speech recognition result y (x (n")). The improved model of the scheme uses a gating circulation unit to replace an LSTM part in the traditional voice recognition model, and the gating circulation unit has fewer parametersAnd the number and the calculated amount can finish the voice recognition more quickly.
Meanwhile, the scheme provides a topic recognition model, wherein a dynamic weighted voice topic recognition model based on word pair compactness is constructed, the model is input into a voice recognition result based on word pair compactness and output into a voice topic, and the topic recognition flow of the dynamic weighted voice topic recognition model is as follows: gathering word pairs into a compact set G d The compactness of the Chinese word pair is higher than the threshold valueThe word pair topic category is h topic categories in the corpus A; setting the current iteration number of the model as r, the initial value of r as 0, and the maximum iteration number of the model as Max; when computing the (r+1) th iteration, computing any word pair q in the word pair set G1 v For theme->Probability of (2):
wherein: alphaBeta represents a dirichlet priors parameter;represents the (r+1) th iteration word pair q v Subject of-> Representing h topic categories in the corpus A; />Represents the word pair q in G1 at the r-th iteration v The coating is divided into themes->Word pair number of (a); />Representing the removal word pair q v By stopping to the (r+1) th iteration, the word pair q v The first word in (1) is divided into topic->Is a number of times (1); />Representing the removal word pair q v By stopping to the (r+1) th iteration, the word pair q v The second word of (a) is divided into the topics +.>Is a number of times (1); />Representing the removal word pair q v By the r+1st iteration, all words are split into topics +.>Is a number of times (1); judging whether r+1 is greater than or equal to Max, if r+1 is greater than or equal to Max, outputting the probability of the subject category to which different word pairs in the word pair set G1 belong after the r+1 iteration, otherwise, making r=r+1, and returning to the step; inputting word pair compactness of a voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
wherein: p (topic|G) d G) represents the probability that the topic of the speech recognition result is topic, topic E [1, h]The h topic categories in the corpus A are represented, and topics with the highest probability are selected as voice topics; Representation word pair q v The number of occurrences of two words in the speech recognition result, count G Representing the total number of words in the speech recognition result. And performing similarity calculation on the identified voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity algorithm, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the students do not listen to the classes is indicated, otherwise, the students listen to the classes. According to the scheme, a dynamic weighted voice topic recognition model is built based on the compactness of word pairs, the traditional model does not distinguish core word pairs from non-core word pairs, all word pairs are uniformly regarded, so that core words representing text topic information are ignored, topic recognition accuracy is affected, the model focuses on word pairs with higher compactness of voice recognition content word pairs, the bias of the core word pairs is realized, word pairs with higher compactness have higher topic recognition weights, topic selection probability of other word pairs is dynamically considered in the topic recognition process, and more accurate voice topic recognition is realized based on global topic recognition.
Drawings
Fig. 1 is a schematic flow chart of a learning monitoring method applied to an intelligent earphone in the learning field according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating one of the steps in the embodiment of FIG. 1;
fig. 3 is a functional block diagram of a learning monitoring device applied to an intelligent earphone in the learning field according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device for implementing a learning monitoring means applied to an intelligent earphone in the learning field according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the application provides a learning monitoring method of an intelligent earphone applied to the learning field. The execution main body of the learning monitoring means applied to the intelligent earphone in the learning field comprises at least one of an electronic device, such as a server, a terminal and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the learning monitoring means applied to the smart headset in the learning field may be executed by software or hardware installed in the terminal device or the server device, where the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Example 1:
s1: and acquiring the intelligent earphone audio signals, preprocessing the acquired intelligent earphone audio signals, and extracting the characteristics of the preprocessed intelligent earphone audio signals to obtain the extracted audio signal characteristics.
The step S1 is to collect the intelligent earphone audio signal and preprocess the collected intelligent earphone audio signal, and includes:
the audio signal in the intelligent earphone is acquired by utilizing a signal acquisition device arranged in the intelligent earphone, the acquired audio signal of the intelligent earphone is preprocessed, and the preprocessing flow is as follows:
the high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cutoff frequency of the high-pass filter, the capacitive reactance of the capacitive element is smaller, the signal is not attenuated, if the frequency of the acquired signal is lower than the cutoff frequency of the high-pass filter, the capacitive element is larger, the signal is attenuated, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t is E [ t ] 0 ,t L ]Time sequence information representing the acquired audio signal, t 0 Representing the initial time, t, of audio signal acquisition L Representing the cut-off time of the audio signal acquisition;
The cut-off frequency f t The calculation formula of (2) is as follows:
wherein:
r represents the resistance value of the resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
the digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion flow of the analog signal is as follows:
construction frequency f s Impulse sequence p (t):
converting an audio signal x (t) into an analog signal based on an impulse sequence p (t)
The number of sampling points in the analog signal is N, so that the analog signalThe first sample point of (2) is x (1) and the last sample point is x (N), then the analog signal +.>n′∈[1,N]X (n') is the converted analog signal;
the signal acquisition device performs windowing processing on the converted analog signal x (n'), wherein a window function of the windowing processing is a Hamming window, and a windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
x (n ") is the windowed audio signal, and x (n") is taken as the pre-processed smart earphone audio signal.
In the step S1, feature extraction is performed on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features, including:
The feature extraction is performed on the preprocessed intelligent earphone audio signal x (n') to obtain an extracted audio signal feature F (x (n ")), and in detail, referring to fig. 2, the feature extraction flow is as follows:
s11: performing fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain frequency domain characteristics of the x (n'), wherein the fast Fourier transform processing formula is as follows:
wherein:
x (k) represents the result of the fast Fourier transform of X (n') at point k;
k is the number of points of the fast Fourier transform, j represents the imaginary unit, j 2 =-1;
S12: constructing a filter bank with M filters, wherein the frequency response of the mth filter in the filter bank is f m (k):
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the fast fourier transform result into a filter bank, the filter bank outputting energy of a signal frequency domain:
wherein:
en represents the energy of the intelligent earphone audio signal in the frequency domain;
extracting characteristics of an intelligent earphone audio signal, wherein the characteristic extraction formula is as follows:
wherein:
l 'represents feature order, L' =1, 2, 12;
c (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ").
S2: an improved end-to-end voice recognition model is built, the extracted audio signal characteristics are input into the model, a voice recognition result is obtained, the model is input into the extracted audio signal characteristics, and the model is output into the voice recognition result.
The step S2 of constructing an improved end-to-end voice recognition model comprises the following steps:
an improved end-to-end voice recognition model is constructed, wherein the model is input into an extracted audio signal feature, and output into a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, an encoding layer and a decoding layer, the convolution layer is used for receiving the audio signal feature to be recognized and carrying out convolution processing on the audio signal feature, the encoding layer is used for carrying out encoding processing on the feature after the convolution processing by using a gating circulation unit, and the decoding layer is used for decoding the encoding processing result, and the decoding result is the voice recognition result; the process for carrying out voice recognition on the audio signal features by using the end-to-end voice recognition model comprises the following steps:
s21: the convolution layer carries out convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
Conv (·) represents a convolution operation;
f1 represents a convolution feature;
s22: the convolution characteristic is input into a coding layer based on gating circulation units, an activation function of the coding layer is GeLU, the coding layer comprises 5 gating circulation units, each gating circulation unit is set to be in bidirectional recursion, and an output result of the last gating circulation unit is used as a coding result of the audio signal characteristic;
s23: the decoding layer outputs probability P (y|x) of a decoding result y corresponding to each section of coding result x by using a softmax function, and selects y with the largest probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end voice recognition model is as follows:
wherein:
N x the number of characters representing the real voice text;
a 1,x representing the number of replacement characters, a, in the speech recognition result output by the model 2,x Representing the number of deleted characters, a, in the speech recognition result output by the model 3,x Representing the number of inserted characters in the speech recognition result output by the model;
the extracted audio signal features F (x (n ")) are input into the model to obtain a speech recognition result y (x (n")).
S3: and carrying out word pair conversion on the recognized voice result to obtain a converted word pair, and calculating the word pair compactness of the converted word pair.
And step S3, performing word pair conversion on the recognized voice result to obtain converted word pairs, wherein the step S comprises the following steps:
constructing corpuses of different topic categories, wherein each corpus contains all words under the topic category, and combining the constructed corpuses of different topic categories into a corpus A, wherein the corpus A= { topic 1 ,topic 2 ,...,topic h }, wherein topic h The method comprises the steps of encoding all words in a corpus A by using a single-hot encoding method according to the sequence of the words in the corpus A for the corpus of the h topic class to obtain an encoding result set B of the corpus A;
in the embodiment of the invention, the single-heat coding result distances of words with the same theme class are similar;
based on word encoding results of the corpus A, word pair conversion is carried out on the recognized voice results y (x (n ")), wherein the word pair conversion flow is as follows:
s31: taking the corpus A as a word segmentation dictionary, calculating the maximum length of words in the word segmentation dictionary as z, intercepting the front z bits of a voice recognition result, matching the intercepted result with words in the word segmentation dictionary, taking the matched result as the word segmentation result of the intercepted text if the matching is successful, and repeating the steps on the rest voice recognition result text to obtain the word segmentation result of the voice recognition result Wherein->N-th of speech recognition result w Words, n w The total number of words representing the voice recognition result is intercepted and the matching is carried out again by the z-1 bit before the interception if the matching is unsuccessful;
s32: will word the resultAny two word segmentation results wn1, wn2 form a group of word pairs (w n1 ,w n2 ) Wherein n1 > n2, n1, n2 ε [1, n ] w ];
S33: converted word pair set G= { (w) for obtaining voice recognition result n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
And step S3, calculating the word pair compactness of the converted word pair, which comprises the following steps:
traversing and selecting word segmentation results from a coding result set BCoding result set of speech recognition results is obtained +.>Wherein->Is->And obtains a coding result set G of the word pairs of the voice recognition result b ={(b n1 ,b n2 )|n1>n2,n1,n2∈[1,n w ]};
Calculating word pair compactness of the converted word pairs in the voice recognition result:
wherein:
t represents a transpose;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w |}。
S4: a dynamic weighted voice theme recognition model is built based on the word pair compactness, the word pair compactness of a voice recognition result is input into the model, the model outputs a voice theme, the model input is the voice recognition result based on the word pair compactness, and the model output is the voice theme.
In the step S4, a dynamic weighted voice theme recognition model is constructed based on the word pair compactness, and the method comprises the following steps:
A dynamic weighted voice theme recognition model based on word pair compactness is constructed, the model is input into a voice recognition result based on word pair compactness, the model is output into a voice theme, and the theme recognition flow of the dynamic weighted voice theme recognition model is as follows:
s41: gathering word pairs into a compact set G d The compactness of the Chinese word pair is higher than the threshold valueThe word pair topic category is h topic categories in the corpus A;
s42: setting the current iteration number of the model as r, the initial value of r as 0, and the maximum iteration number of the model as Max;
s43: calculation of the (r+1) th timeWhen iterating, calculating any word pair q in the word pair set G1 v Is the subject ofProbability of (2):
wherein:
alpha, beta represent dirichlet a priori parameters;
represents the (r+1) th iteration word pair q v Subject of->Representing h topic categories in the corpus A;
represents the word pair q in G1 at the r-th iteration v The coating is divided into themes->Word pair number of (a);
representing the removal word pair q v By stopping to the (r+1) th iteration, the word pair q v The first word in (1) is divided into topic->Is a number of times (1);
representing the removal word pair q v By stopping to the (r+1) th iteration, the word pair q v The second word of (a) is divided into the topics +. >Is a number of times (1); />
Representing the removal word pair q v By the r+1st iteration, all words are split into topics +.>Is a number of times (1);
s44: judging whether r+1 is greater than or equal to Max, if r+1 is greater than or equal to Max, outputting the probability of the subject class of different word pairs in the word pair set G1 after the r+1 iteration, otherwise, making r=r+1, and returning to the step S43;
inputting word pair compactness of a voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, topic E [1, h]The h topic categories in the corpus A are represented, and topics with the highest probability are selected as voice topics;
representation word pair q v The number of occurrences of two words in the speech recognition result, count G Representing the total number of words in the speech recognition result.
S5: and performing similarity calculation on the identified voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not listen to the class, otherwise, listening to the class.
And S5, performing similarity calculation on the identified voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not listen to the class, otherwise, listening to the class, and comprising the following steps:
And performing similarity calculation on the identified voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity algorithm, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the students do not listen to the classes is indicated, otherwise, the students listen to the classes.
Example 2:
as shown in fig. 3, the functional block diagram of the learning monitoring device applied to the smart headset in the learning field according to an embodiment of the present invention can implement the learning monitoring means of the smart headset in embodiment 1.
The learning monitoring device 100 applied to the intelligent earphone in the learning field can be installed in an electronic device. According to the functions implemented, the learning monitoring device applied to the intelligent earphone in the learning field may include a signal acquisition device 101, a voice recognition module 102 and a recognition monitoring device 103. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
The signal acquisition device 101 is configured to acquire an intelligent earphone audio signal, perform preprocessing on the acquired intelligent earphone audio signal, and perform feature extraction on the preprocessed intelligent earphone audio signal to obtain extracted audio signal features;
The speech recognition module 102 is configured to construct an improved end-to-end speech recognition model, and input the extracted audio signal features into the model to obtain a speech recognition result;
the recognition monitoring device 103 is configured to perform word pair conversion on the recognized voice result, calculate word pair compactness of the converted word pair, construct a dynamic weighted voice topic recognition model based on the word pair compactness, input the word pair compactness of the voice recognition result into the model, output a voice topic by the model, perform similarity calculation on the recognized voice topic and the current lecture content topic, and if the similarity calculation structure is smaller than a specified threshold, represent that the voice topic in the earphone is different from the lecture content topic, so as to indicate that the student does not listen to the lecture, otherwise, listen to the lecture.
In detail, the modules in the learning monitoring device 100 for an intelligent earphone in the learning field in the embodiment of the present invention use the same technical means as the learning monitoring means for an intelligent earphone in the learning field described in fig. 1, and can produce the same technical effects, which are not described herein.
Example 3:
fig. 4 is a schematic structural diagram of an electronic device for implementing a learning monitoring method applied to an intelligent earphone in the learning field according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of the program 12, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective parts of the entire electronic device using various interfaces and lines, executes or executes programs or modules (a program 12 for learning monitoring, etc.) stored in the memory 11, and invokes data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 4 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
collecting intelligent earphone audio signals, preprocessing the collected intelligent earphone audio signals, and extracting features of the preprocessed intelligent earphone audio signals to obtain extracted audio signal features;
an improved end-to-end voice recognition model is built, the extracted audio signal characteristics are input into the model, and a voice recognition result is obtained;
Word pair conversion is carried out on the recognized voice result, a converted word pair is obtained, and the word pair compactness of the converted word pair is calculated;
constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word pair compactness of the voice recognition result into the model, and outputting the voice theme by the model;
and performing similarity calculation on the identified voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not listen to the class, otherwise, listening to the class.
Specifically, the specific implementation method of the above instruction by the processor 10 may refer to descriptions of related steps in the corresponding embodiments of fig. 1 to 4, which are not repeated herein.
It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (8)

1. The learning monitoring method of the intelligent earphone applied to the learning field is characterized by comprising the following steps of:
S1: collecting intelligent earphone audio signals, preprocessing the collected intelligent earphone audio signals, and extracting features of the preprocessed intelligent earphone audio signals to obtain extracted audio signal features;
s2: an improved end-to-end voice recognition model is built, the extracted audio signal characteristics are input into the model, a voice recognition result is obtained, the model is input into the extracted audio signal characteristics, and the voice recognition result is output;
s3: word pair conversion is carried out on the recognized voice result, a converted word pair is obtained, and the word pair compactness of the converted word pair is calculated;
s4: constructing a dynamic weighted voice topic recognition model based on word pair compactness, inputting the word pair compactness of a voice recognition result into the model, and outputting the voice topic by the model, wherein the model is input into the voice recognition result based on the word pair compactness and output into the voice topic, and the method comprises the following steps:
a dynamic weighted voice theme recognition model based on word pair compactness is constructed, the model is input into a voice recognition result based on word pair compactness, the model is output into a voice theme, and the theme recognition flow of the dynamic weighted voice theme recognition model is as follows:
S41: gathering word pairs into a compact set G d The compactness of the Chinese word pair is higher than the threshold valueThe word pair topic category is h topic categories in the corpus A;
s42: setting the current iteration number of the model as r, the initial value of r as 0, and the maximum iteration number of the model as Max;
s43: when computing the (r+1) th iteration, computing any word pair q in the word pair set G1 v Is the subject ofProbability of (2):
wherein:
alpha, beta represent dirichlet a priori parameters;
represents the (r+1) th iteration word pair q v Subject of->Representing h topic categories in the corpus A;
represents the word pair q in G1 at the r-th iteration v The coating is divided into themes->Word pair number of (a);
representing the removal word pair q v By stopping to the (r+1) th iteration, the word pair q v Is divided into topicsIs a number of times (1);
representing the removal word pair q v By stopping to the (r+1) th iteration, the word pair q v Is divided into topicsIs a number of times (1);
representing the removal word pair q v By the r+1st iteration, all words are split into topics +.>Is a number of times (1);
s44: judging whether r+1 is greater than or equal to Max, if r+1 is greater than or equal to Max, outputting the probability of the subject class of different word pairs in the word pair set G1 after the r+1 iteration, otherwise, making r=r+1, and returning to the step S43;
Inputting word pair compactness of a voice recognition result into a model, outputting a voice theme by the model, wherein the formula for calculating the voice theme of the voice recognition result is as follows:
wherein:
p(topic|G d g) represents the probability that the topic of the speech recognition result is topic, topic E [1, h]The h topic categories in the corpus A are represented, and topics with the highest probability are selected as voice topics;
representation word pair q v The number of occurrences of two words in the speech recognition result, count G Representing the total number of words in the speech recognition result;
s5: and performing similarity calculation on the identified voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that the student does not listen to the class, otherwise, listening to the class.
2. The method for learning and monitoring the intelligent earphone applied to the learning field as claimed in claim 1, wherein the step S1 of collecting the intelligent earphone audio signal and preprocessing the collected intelligent earphone audio signal comprises the following steps:
the audio signal in the intelligent earphone is acquired by utilizing a signal acquisition device arranged in the intelligent earphone, the acquired audio signal of the intelligent earphone is preprocessed, and the preprocessing flow is as follows:
The high-pass filter is arranged in the signal acquisition device, if the frequency of the acquired signal is higher than the cutoff frequency of the high-pass filter, the capacitive reactance of the capacitive element is smaller, the signal is not attenuated, if the frequency of the acquired signal is lower than the cutoff frequency of the high-pass filter, the capacitive element is larger, the signal is attenuated, the lower the frequency of the acquired signal is, the higher the attenuation degree is, the audio signal passing through the high-pass filter is x (t), wherein t is E [ t ] 0 ,t L ]Time sequence information representing the acquired audio signal, t 0 Representing the initial time, t, of audio signal acquisition L Representing the cut-off time of the audio signal acquisition;
the cut-off frequency f t The calculation formula of (2) is as follows:
wherein:
r represents the resistance value of the resistor in the high-pass filter;
c represents the capacitance of the capacitive element in the high-pass filter;
the digital-to-analog converter in the signal receiving device converts the collected audio signal x (t) into an analog signal, and the conversion flow of the analog signal is as follows:
construction frequency f s Impulse sequence p (t):
converting an audio signal x (t) into an analog signal based on an impulse sequence p (t)
The number of sampling points in the analog signal is N, so that the analog signalThe first sample point of (2) is x (1) and the last sample point is x (N), then the analog signal +. >n′∈[1,N]X (n') is the converted analog signal;
the signal acquisition device performs windowing processing on the converted analog signal x (n'), wherein a window function of the windowing processing is a Hamming window, and a windowing processing formula is as follows:
x(n″)=x(n′)×w(n′)
wherein:
w (n') is a window function;
a represents a window function coefficient, which is set to 0.43;
x (n ") is the windowed audio signal, and x (n") is taken as the pre-processed smart earphone audio signal.
3. The learning monitoring method for intelligent headphones applied to the learning field according to claim 2, wherein the step S1 of extracting features of the preprocessed intelligent headphones audio signal to obtain extracted audio signal features comprises:
performing feature extraction on the preprocessed intelligent earphone audio signal x (n ') to obtain an extracted audio signal feature F (x (n'), wherein the feature extraction flow is as follows:
s11: performing fast Fourier transform processing on the preprocessed intelligent earphone audio signal x (n ') to obtain frequency domain characteristics of the x (n'), wherein the fast Fourier transform processing formula is as follows:
wherein:
x (k) represents the result of the fast Fourier transform of X (n') at point k;
k is the number of points of the fast Fourier transform, j represents the imaginary unit, j 2 =-1;
S12: constructing a filter bank with M filters, wherein the frequency response of the mth filter in the filter bank is f m (k):
Wherein:
f (m) represents the center frequency of the mth filter;
s13: inputting the fast fourier transform result into a filter bank, the filter bank outputting energy of a signal frequency domain:
wherein:
en represents the energy of the intelligent earphone audio signal in the frequency domain;
s14: extracting characteristics of an intelligent earphone audio signal, wherein the characteristic extraction formula is as follows:
wherein:
l 'represents feature order, L' =1, 2, 12;
c (L') represents the extracted L-order features;
the feature that maximizes C (L') is selected as the extracted smart headphone audio signal feature F (x (n ").
4. The learning monitoring method for intelligent headphones applied to the learning field as claimed in claim 1, wherein the step S2 of constructing an improved end-to-end speech recognition model comprises:
an improved end-to-end voice recognition model is constructed, wherein the model is input into an extracted audio signal feature, and output into a voice recognition result, the improved end-to-end voice recognition model comprises a convolution layer, an encoding layer and a decoding layer, the convolution layer is used for receiving the audio signal feature to be recognized and carrying out convolution processing on the audio signal feature, the encoding layer is used for carrying out encoding processing on the feature after the convolution processing by using a gating circulation unit, and the decoding layer is used for decoding the encoding processing result, and the decoding result is the voice recognition result; the process for carrying out voice recognition on the audio signal features by using the end-to-end voice recognition model comprises the following steps:
S21: the convolution layer carries out convolution processing on the input features, and the convolution processing formula is as follows:
F1=Conv(F)
wherein:
f represents the audio signal characteristics input to the end-to-end speech recognition model;
conv (·) represents a convolution operation;
f1 represents a convolution feature;
s22: the convolution characteristic is input into a coding layer based on gating circulation units, an activation function of the coding layer is GeLU, the coding layer comprises 5 gating circulation units, each gating circulation unit is set to be in bidirectional recursion, and an output result of the last gating circulation unit is used as a coding result of the audio signal characteristic;
s23: the decoding layer outputs probability P (y|x) of a decoding result y corresponding to each section of coding result x by using a softmax function, and selects y with the largest probability as the decoding result of the coding result x, wherein the decoding result is a voice recognition result;
the training objective function of the end-to-end voice recognition model is as follows:
wherein:
N x the number of characters representing the real voice text;
a 1,x representing the number of replacement characters, a, in the speech recognition result output by the model 2,x Representing the number of deleted characters, a, in the speech recognition result output by the model 3,x Representing the number of inserted characters in the speech recognition result output by the model;
the extracted audio signal features F (x (n ")) are input into the model to obtain a speech recognition result y (x (n")).
5. The learning monitoring method of the intelligent earphone applied to the learning field as claimed in claim 1, wherein the step S3 of performing word pair conversion on the recognized voice result to obtain the converted word pair includes:
constructing corpuses of different topic categories, wherein each corpus contains all words under the topic category, and combining the constructed corpuses of different topic categories into a corpus A, wherein the corpus A= { topic 1 ,topic 2 ,...,topic h }, wherein topic h The method comprises the steps of encoding all words in a corpus A by using a single-hot encoding method according to the sequence of the words in the corpus A for the corpus of the h topic class to obtain an encoding result set B of the corpus A;
based on word encoding results of the corpus A, word pair conversion is carried out on the recognized voice results y (x (n ")), wherein the word pair conversion flow is as follows:
s31: the corpus A is used as a word segmentation dictionary, the maximum length of words in the word segmentation dictionary is calculated as z, the front z bits of a voice recognition result are intercepted, the intercepted result is matched with the words in the word segmentation dictionary, and if the matching is successful, the matching is performedThe result is used as a word segmentation result of the intercepted text, and the steps are repeated on the text of the residual voice recognition result, so that the word segmentation result of the voice recognition result is obtained Wherein->N-th of speech recognition result w Words, n w The total number of words representing the voice recognition result is intercepted and the matching is carried out again by the z-1 bit before the interception if the matching is unsuccessful;
s32: will word the resultAny two word segmentation results w n1 ,w n2 Form a group of word pairs (w n1 ,w n2 ) Wherein n1 > n2, n1, n2 ε [1, n ] w ];
S33: converted word pair set G= { (w) for obtaining voice recognition result n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
6. The learning monitoring method for the intelligent earphone applied to the learning field according to claim 5, wherein the step S3 of calculating the word pair compactness of the converted word pair comprises:
traversing and selecting word segmentation results from a coding result set BCoding result set of speech recognition results is obtained +.>Wherein->Is->And obtains a coding result set G of the word pairs of the voice recognition result b ={(b n1 ,b n2 )|n1>n2,n1,n2∈[1,n w ]};
Calculating word pair compactness of the converted word pairs in the voice recognition result:
wherein:
t represents a transpose;
the word pair compactness set in the voice recognition result is as follows: g d ={dis(w n1 ,w n2 )|n1>n2,n1,n2∈[1,n w ]}。
7. The learning monitoring method of the intelligent earphone applied to the learning field as claimed in claim 1, wherein the step S5 is to perform similarity calculation on the identified voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold, the similarity calculation structure indicates that the voice theme in the earphone is different from the teaching content theme, and indicates that the student does not listen to the lesson, otherwise, the learning monitoring method includes:
And performing similarity calculation on the identified voice theme and the current teaching content theme, wherein the similarity calculation method is a cosine similarity algorithm, if the similarity calculation structure is smaller than a specified threshold value, the voice theme in the earphone is different from the teaching content theme, and the fact that the students do not listen to the classes is indicated, otherwise, the students listen to the classes.
8. Be applied to intelligent earphone under study field study monitoring device, its characterized in that, the device includes:
the signal acquisition device is used for acquiring the intelligent earphone audio signals, preprocessing the acquired intelligent earphone audio signals, and extracting the characteristics of the preprocessed intelligent earphone audio signals to obtain the extracted audio signal characteristics;
the voice recognition module is used for constructing an improved end-to-end voice recognition model, inputting the extracted audio signal characteristics into the model, and obtaining a voice recognition result;
the recognition monitoring device is used for carrying out word pair conversion on the recognized voice result, calculating word pair compactness of the converted word pair, constructing a dynamic weighted voice theme recognition model based on the word pair compactness, inputting the word pair compactness of the voice recognition result into the model, outputting a voice theme by the model, carrying out similarity calculation on the recognized voice theme and the current teaching content theme, if the similarity calculation structure is smaller than a specified threshold value, indicating that the voice theme in the earphone is different from the teaching content theme, indicating that students do not listen to class, otherwise, listening to class, so as to realize the learning monitoring method of the intelligent earphone applied to the learning field according to any one of claims 1-7.
CN202210994128.9A 2022-08-18 2022-08-18 Learning monitoring method of intelligent earphone applied to learning field Active CN115376499B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210994128.9A CN115376499B (en) 2022-08-18 2022-08-18 Learning monitoring method of intelligent earphone applied to learning field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210994128.9A CN115376499B (en) 2022-08-18 2022-08-18 Learning monitoring method of intelligent earphone applied to learning field

Publications (2)

Publication Number Publication Date
CN115376499A CN115376499A (en) 2022-11-22
CN115376499B true CN115376499B (en) 2023-07-28

Family

ID=84064840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210994128.9A Active CN115376499B (en) 2022-08-18 2022-08-18 Learning monitoring method of intelligent earphone applied to learning field

Country Status (1)

Country Link
CN (1) CN115376499B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609672A (en) * 2009-07-21 2009-12-23 北京邮电大学 A kind of speech recognition semantic confidence feature extracting methods and device
CN113539268A (en) * 2021-01-29 2021-10-22 南京迪港科技有限责任公司 End-to-end voice-to-text rare word optimization method

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103680498A (en) * 2012-09-26 2014-03-26 华为技术有限公司 Speech recognition method and speech recognition equipment
CN108836362A (en) * 2018-04-03 2018-11-20 江苏理工学院 It is a kind of for supervising the household monitoring apparatus and its measure of supervision of child's learning state
CN110415697A (en) * 2019-08-29 2019-11-05 的卢技术有限公司 A kind of vehicle-mounted voice control method and its system based on deep learning
CN111105796A (en) * 2019-12-18 2020-05-05 杭州智芯科微电子科技有限公司 Wireless earphone control device and control method, and voice control setting method and system
KR102228575B1 (en) * 2020-11-06 2021-03-17 (주)디라직 Headset system capable of monitoring body temperature and online learning
CN112863518B (en) * 2021-01-29 2024-01-09 深圳前海微众银行股份有限公司 Method and device for recognizing voice data subject
CN113887883A (en) * 2021-09-13 2022-01-04 淮阴工学院 Course teaching evaluation implementation method based on voice recognition technology
CN114662484A (en) * 2022-03-16 2022-06-24 平安科技(深圳)有限公司 Semantic recognition method and device, electronic equipment and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609672A (en) * 2009-07-21 2009-12-23 北京邮电大学 A kind of speech recognition semantic confidence feature extracting methods and device
CN113539268A (en) * 2021-01-29 2021-10-22 南京迪港科技有限责任公司 End-to-end voice-to-text rare word optimization method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度学习的中文语音识别模型设计与实现;杨焕峥;;湖南邮电职业技术学院学报(第03期);第24-27页 *

Also Published As

Publication number Publication date
CN115376499A (en) 2022-11-22

Similar Documents

Publication Publication Date Title
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN109741732B (en) Named entity recognition method, named entity recognition device, equipment and medium
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
CN111915148B (en) Classroom teaching evaluation method and system based on information technology
CN107193973A (en) The field recognition methods of semanteme parsing information and device, equipment and computer-readable recording medium
CN110600033A (en) Learning condition evaluation method and device, storage medium and electronic equipment
CN113420556B (en) Emotion recognition method, device, equipment and storage medium based on multi-mode signals
CN111651497A (en) User label mining method and device, storage medium and electronic equipment
CN108962231A (en) A kind of method of speech classification, device, server and storage medium
CN109800309A (en) Classroom Discourse genre classification methods and device
CN113807103A (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN113064994A (en) Conference quality evaluation method, device, equipment and storage medium
CN107195312B (en) Method and device for determining emotion releasing mode, terminal equipment and storage medium
CN114722199A (en) Risk identification method and device based on call recording, computer equipment and medium
CN112489628B (en) Voice data selection method and device, electronic equipment and storage medium
CN108847251A (en) A kind of voice De-weight method, device, server and storage medium
CN109545226A (en) A kind of audio recognition method, equipment and computer readable storage medium
CN115376499B (en) Learning monitoring method of intelligent earphone applied to learning field
CN115240696B (en) Speech recognition method and readable storage medium
CN113658596A (en) Semantic identification method and semantic identification device
CN113314099B (en) Method and device for determining confidence coefficient of speech recognition
CN113555026B (en) Voice conversion method, device, electronic equipment and medium
CN115050350A (en) Label checking method and related device, electronic equipment and storage medium
CN115221351A (en) Audio matching method and device, electronic equipment and computer-readable storage medium
CN111985231B (en) Unsupervised role recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TA01 Transfer of patent application right

Effective date of registration: 20230726

Address after: Room 102, Building 1, No. 16, Yunlian Fifth Street, Dalang Town, Dongguan City, Guangdong Province, 523000

Applicant after: DONGGUAN JIEXUN ELECTRONIC TECHNOLOGY Co.,Ltd.

Address before: 523000 Building 2, No. 16, Yunlian Fifth Street, Dalang Town, Dongguan City, Guangdong Province

Applicant before: Dongguan Leyi Electronic Technology Co.,Ltd.

TA01 Transfer of patent application right