CN115641763B - Memory recitation auxiliary system - Google Patents

Memory recitation auxiliary system Download PDF

Info

Publication number
CN115641763B
CN115641763B CN202211106551.7A CN202211106551A CN115641763B CN 115641763 B CN115641763 B CN 115641763B CN 202211106551 A CN202211106551 A CN 202211106551A CN 115641763 B CN115641763 B CN 115641763B
Authority
CN
China
Prior art keywords
voice
screen
model
memory
color
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211106551.7A
Other languages
Chinese (zh)
Other versions
CN115641763A (en
Inventor
张鹏
钟亚玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongnan Xunzhi Technology Co ltd
Original Assignee
Zhongnan Xunzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongnan Xunzhi Technology Co ltd filed Critical Zhongnan Xunzhi Technology Co ltd
Priority to CN202211106551.7A priority Critical patent/CN115641763B/en
Publication of CN115641763A publication Critical patent/CN115641763A/en
Application granted granted Critical
Publication of CN115641763B publication Critical patent/CN115641763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a memory recitation assisting system, which comprises a seat and is characterized in that: the novel portable intelligent seat comprises a seat body, a seat fixed connection group of rear horns, a seat fixed connection symmetrical square plate, a square plate fixed connection group of ear horns, a seat fixed connection L-shaped plate, an L-shaped plate fixed connection steering engine, an output shaft fixed connection circular plate of the steering engine, a circular plate fixed connection group of circular rods, and a curved surface screen fixed connected with the circular rods. The invention relates to the field of memory auxiliary equipment, in particular to a memory recitation auxiliary system. The invention aims to provide a memory recitation assisting system which is convenient for assisting memory recitation.

Description

Memory recitation auxiliary system
Technical Field
The invention relates to the field of memory auxiliary equipment, in particular to a memory recitation auxiliary system.
Background
Recitation is a special method of memory. It requires people to memorize objective things in a fixed order without dividing the objective things into major and minor aspects. Recitations are generally two types, namely mechanical recitations and an understanding recitation. Mechanical recitation is a method of simply and repeatedly memorizing by means of words without understanding the meaning of the material. This recitation is based on mechanical memory. In general, young children (e.g., pupils) are easy to recite, although they do not understand the meaning of the memory material, because they have a developed mechanical memory.
At present, a device for assisting a user in memorizing is lacking by combining text display and voice prompt.
Disclosure of Invention
The invention aims to provide a memory recitation assisting system which is convenient for assisting memory recitation.
The invention adopts the following technical scheme to realize the aim of the invention:
a memory recitation assistance system comprising a seat, characterized in that: the seat 1 is fixedly connected with a group of rear horns 2, the seat 1 is fixedly connected with symmetrical square plates 3, the square plates 3 are fixedly connected with a group of ear horns 4, the seat 1 is fixedly connected with an L-shaped plate 9, the L-shaped plate 9 is fixedly connected with a steering engine 8, an output shaft of the steering engine 8 is fixedly connected with a circular plate 7, the circular plate 7 is fixedly connected with a group of circular rods 6, and the circular rods 6 are fixedly connected with a curved surface screen 5;
the method also comprises the following steps:
step one: things needing to be recited or memorized, such as words, are displayed in a variable position on the curved screen 5, and the steering engine 8 is controlled to rotate, so that the curved screen 5 swings, and the positions of the things needing to be recited or memorized are changed continuously;
step two: identifying word positions according to a CNN-based convolutional neural network algorithm;
Based on the CNN convolutional neural network algorithm, recognizing word positions, selecting One-Stage single-Stage target detection algorithm YOLO, directly generating class probability and position coordinate values of an object without a region proposal Stage, and directly obtaining a final detection result through single detection, so that the YOLO-v4 algorithm has higher detection speed and efficiency, adopts the most excellent optimization strategy in the CNN field in recent years on the basis of the original YOLO target detection architecture, has different degrees of optimization from various aspects such as data processing, a backbone network, network training, an activation function, a loss function and the like, has better precision and efficiency, and the network structure of YOLOv4 can be divided into Input, backbone, neck, head modules;
step two,: input is used for data enhancement of Input data, mosaic data enhancement is adopted, the Mosaic data enhancement is evolved on the basis of CutMix data enhancement, cutMix uses two pictures for data enhancement, mosaic expands to use four pictures for splicing, and the four pictures are randomly scaled, randomly cut and randomly typeset, so that a very large rich data set can be obtained at one time, and different words are used as Input data when appearing at different positions of a screen;
Step two: the backbox is upgraded once in the YOLOv4, namely CSPDarknet53, CSPNet is named Cross Stage Partial Networks, namely a cross-stage local network, the CSPNet solves the problem of gradient information repetition of network optimization in other large convolutional neural network frame backboxes, the gradient change is integrated into a feature map from head to tail, so that the parameter number and FLOPS value of a model are reduced, the reasoning speed and accuracy are ensured, the model size is reduced, the CSPNet is actually based on the thought of Densonet, the feature map of a base layer is copied, a copy is sent to the next stage through a dense block, so that the feature map of the base layer is separated, the problem of gradient disappearance (the problem that a lost signal is difficult to reversely push through a very deep network) can be effectively alleviated, the feature propagation is supported, the network reuse feature is encouraged, and the number of network parameters is reduced;
step two, three: mish is chosen here as the activation function, which is a very similar activation function to ReLU and Swish, the formula is as follows:
y=x*tanh(ln(1+ex))
the Mish function is a smooth curve, which allows better information to go deep into the neural network, thereby obtaining better accuracy and generalization; not completely truncated at negative values, allowing a relatively small negative gradient inflow;
Step two, four: the method for fusion is characterized in that PANet (Path Aggregation Network) is used for replacing FPN by Neck, YOLOv4 to carry out parameter aggregation so as to be suitable for target detection of different levels, the method used in fusion is Addition, and the YOLOv4 algorithm changes the fusion method from Addition to connection, so that the method is a feature map fusion mode;
step two, five: yolo 3 is used for the last yolo prediction layer in yolo v4, however, it should be noted that yolo v4, after passing through the above-mentioned negk,
(1) First yolo layer: feature map 76x 76= = > mask=0, 1, 2= = > corresponds to the smallest anchor;
(2) Second yolo layer: feature map 38x 38= = > mask=3, 4, 5= = > corresponds to a medium anchor;
(3) Third yolo layer: feature map 19x 19= = > mask=6, 7, 8= = > corresponds to the largest anchor;
step two, six: the YOLOv4 also makes some innovations on Bounding box Regeression Loss, and regression prediction is performed by adopting CIOU_Loss, so that the speed and the precision of a prediction frame are higher;
step two, seven: training a model, wherein a training data set under dark is COCO, firstly, making a data set, creating a yolov4 folder at the beginning of training, adding yolov4.Cfg, coco.data and coco.names, creating a backup folder under the yolov4 folder for storing intermediate weights, and beginning to execute training instructions-/darknet detector train;
Step two, eight: testing the trained model to obtain a detection effect, outputting the position of a word in a screen, namely (x, y) coordinates, carrying out iterative training for multiple times according to errors, and improving the precision of the model;
step three: the rear side loudspeaker (2) and the ear side loudspeaker (4) respectively emit sounds with different decibels, the sounds heard by the left ear and the right ear are overlapped with the eye seeing positions, the viewing and the position are combined to deepen the memory, and the viewing and the position of the changing position are combined to concentrate the attention at the same position.
As a further limitation of the present technical solution, the method further comprises the following steps:
step four: a recording device is arranged near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm;
the voice recognition algorithm converts a section of voice signal into corresponding text information, and the main flow of the system is that four parts of feature extraction- > acoustic model- > language model- > dictionary and decoding are adopted;
step four, first: preprocessing, in order to extract features more effectively, audio data preprocessing work such as filtering, framing and the like is often needed to be carried out on the collected sound signals, and the audio signals to be analyzed are properly extracted from the original signals;
the silence at the head and the tail is cut off, so that the interference caused to the subsequent steps is reduced, the operation of silence cutting is generally called VAD, the sound is divided into frames, namely, the sound is cut into small sections, each small section is called a frame, the sound is realized by using a moving window function, and the frames are not simply cut off, and are generally overlapped;
(1) The CODEC is used for solving frequency aliasing, and a low-pass filter is used for filtering frequency components higher than 1/2 sampling frequency before discretizing and collecting an analog signal, and in the design of an actual instrument, the cut-off frequency (fc) of the low-pass filter is as follows:
cut-off frequency (fc) =sampling frequency (fs)/2.56
(4) Pre-emphasis, in order to emphasize the high frequency part of the voice, remove the influence of lip radiation, increase the high frequency resolution of the voice, because the high frequency end attenuates more than about 800Hz according to 6dB/oct (octave), the higher the frequency the smaller the corresponding component, for this reason, the high frequency part of the voice signal is lifted before being analyzed;
pre-emphasis is typically implemented by a transfer function being a high-pass digital filter, where a is the pre-emphasis coefficient, 0.9< a <1.0, and the speech sample value at time n is set to x (n), and the result after pre-emphasis processing is y (n) =x (n) -ax (n-1), where a=0.97 is taken as the transfer function,
H(z)=1-az -1
(5) The end point detection, also called voice activity detection, voice Activity Detection, VAD, aims to distinguish voice and non-voice areas, and the end point detection aims to accurately locate the starting point and the ending point of voice from voice with noise, remove mute parts and remove noise parts, and find a section of real effective content of voice;
VAD algorithms can be roughly divided into three categories: threshold-based VAD, VAD as classifier, model VAD;
step four, two: feature extraction, converting a sound signal from a time domain to a frequency domain, providing an acoustic model with a proper feature vector, wherein a main algorithm comprises Linear Prediction Cepstrum Coefficient (LPCC) and mel cepstrum coefficient (MFCC), and the purpose is to change each frame of waveform into a multidimensional vector containing sound information;
and step four, three: the acoustic model AM calculates the score of each feature vector on acoustic features according to acoustic characteristics, the score is obtained by training voice data, the input is the feature vector, and the input is phoneme information;
and step four: the language model LM calculates the probability of the possible phrase sequence corresponding to the sound signal according to the theory of linguistic correlation, and the probability of the mutual association of single characters or words is obtained by training a large amount of text information;
step four, five: dictionary: the word or word corresponds to the phoneme, in short, chinese is the correspondence between phonetic transcription and Chinese character, english is the correspondence between phonetic transcription and word;
and step four, six: decoding: performing text output on the audio data with the extracted features through an acoustic model and a dictionary;
Step five: comparing the voice with the memory content played by the screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for playing for multiple times for the content with relatively low sound (the voice of the user is reduced), selecting the combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together;
comparing the result of the voice recognition with the content input by the screen after making a difference with the threshold based on the threshold, judging the degree of the difference of the content, playing for a plurality of times when the degree of the difference is smaller than the threshold, determining a threshold t, namely the acceptable degree of the difference,
as a further limitation of the present solution, according to the language model LM in the fourth step, by calculating the probability p of the correlation between the input sound signal and the word, the word content of the screen itself is 1, where the threshold t is set to 0.3, y=1 to p, and when y is less than or equal to 0.3, it is indicated that there is a little difference, and it is considered that there is no difference; when y is greater than 0.3, the phase difference is considered to be large.
As a further limitation of the present technical solution, the method further comprises the following steps: step six: generating a background with a color in the curved screen, wherein the color accords with memory contents as much as possible, and assisting in deepening memory;
Generating a colored background on the curved screen, wherein a color model based on an algorithm is firstly required to be selected, and the color model is a mathematical model used for representing colors;
step six,: different color models have different application scenes, and the RGB model is suitable for devices such as a display, and the RGB color model is selected to generate the representation of different colors; RGB, commonly known as the three primary colors red (R), green (G), and blue (B), are the most widely used color models,
in the second step, in general development display processing, colors are processed using the model, for example: rgb (255, 0), red (0,255,0), green (0,0,255), blue (blue) by the variation of the three red, green and blue color channels and the mixing superposition of each other, using different intensities, showing different colors, which is an additive color mixture model, with brightness equal to the combination of the brightness of the colors during the superposition mixing, and with more brightness mixed higher;
and step six, three: a 24-bit type is selected, i.e. each color channel R, G, B is represented by 8 bits of data, 2 8 =256, so each channel can represent a color value of the (0-255) level, darkest at 0, brightest at 255;
Step six, four: according to the correlation between the memory content and the color, determining a color model of the screen, looking up the color values to obtain specific RGB values of each channel, and setting the color of the screen.
As a further limitation of the present technical solution, the method further comprises the following steps: step seven: generating different-shape backgrounds, wherein adjacent memory contents are conveniently distinguished by using different shapes;
seventhly, step seven: selecting a plurality of different shapes such as rectangle, circle, ellipse, polygon and the like, and sequentially and circularly drawing to ensure that adjacent contents cannot be in the same shape;
seventhly, step two: drawing different shapes, namely drawing the rectangle according to the parameters of the shape, wherein the parameters required by the rectangle are length and width, firstly, setting the position of the rectangle to be drawn on a screen, generally setting the left upper corner coordinates (x, y) of the rectangle, drawing the rectangle on the screen after determining the length and width, and setting the pixel points starting from x and y as specific color values according to preset colors to obtain the rectangle;
seventhly, step seven: drawing a circle, namely determining the position (x, y) of the circle on a screen, taking the position as the center of a circle, setting the pixel value of the circle for a circle as a specific color according to the radius r, and drawing the circle;
Seventhly, four steps: the ellipse drawing requires more parameters, and the circle center (x, y) of the ellipse, the length of the shaft, the long radius l, the short radius s and the deflection angle are required to be determined, so that ellipses with different angles can be drawn;
seventhly, the steps are as follows: drawing a polygon, wherein the polygon firstly needs to determine the number n of sides of the polygon, and the number d of top points is set according to the number n of sides n =n, randomly giving the coordinates of each vertex (d 1 ,d 2 ,d 3 ,...,d n ) The vertexes are connected in turn, all pixel points are set to be of specific colors, polygons can be drawn on the screen,
the mel cepstrum coefficient extraction process comprises the steps of preprocessing, fast fourier transformation, mei filter bank, logarithmic operation, discrete cosine transformation, dynamic feature extraction and the like;
the fast fourier transform is a general term, FF, of an efficient, fast computing method for computing a Discrete Fourier Transform (DFT) using a computer;
two arguments of fourier:
the periodic signal may represent a weighted sum of sinusoidal signals in harmonic relation;
the non-periodic signal can be represented by a weighted integration of the sinusoidal signal;
four rows: FS (continuous, periodic signal, infinite), FT (continuous, non-periodic signal, infinite), DFS (discrete, periodic signal, infinite), DTFT (discrete, non-periodic signal, infinite) steps are as follows:
Step 1: the signal x is decomposed into two sub-signals, even sample point signals: x [2n\right ]; odd sample point signal: x [2n+1\right ];
step 2: the two summation terms are understood to be two DFTs of length N/2\frac { N } {2}
Step 3: a specific calculation process of FFT;
n additions are performed for any k, so DFT has N2 multiplications and N-1 additions are performed for any k.
The specific flow of the acoustic model AM comprises the following steps:
(3) The GMM voice recognition model recognizes voice and outputs text information, each GMM (0-9,o) is trained by voice data corresponding to the GMM, and when the GMM voice recognition model is tested, the GMM voice recognition model can only divide frames, window and extract features of the whole voice, then likelihood of each frame is calculated on each GMM, and final likelihood is obtained by final summation;
(4) The K-Means algorithm carries out parameter initialization on the GMM model, and for a given sample set, the sample set is divided into K clusters according to the distance between samples, so that points in the clusters are tightly connected together as much as possible, and the distance between the clusters is as large as possible;
input: sample set d= { x1, x2,.. cluster tree k=5, maximum iteration number N;
and (3) outputting: cluster partition c= { C1, C2,..ck };
1) Randomly selecting k samples from the data set D as the initial k centroid vectors: { μ1, μ2,., μk };
2) For n=1, 2,..
a) Initializing cluster partition C tot=1,2...k
b) For i=1, 2..m, sample X was calculated i And the respective centroid vector mu j Distance of (j=1, 2,., k):
x is to be i The minimum mark is d ij The corresponding class λi is updated at this time with cλi=cλi { xi }
c) For j=1, 2..k, recalculate the new centroid for all sample points in Cj
d) If all k centroid vectors have not changed, go to step 3)
3) Output cluster partition c= { C1, C2,..
(3) The EM algorithm trains GMM models, and a GMM model is given, and the optimization target is to find the mean vector, covariance matrix and mixing coefficient of each Gaussian component which make the likelihood function maximum;
initializing parameters; e, calculating posterior probability by using the current parameters; m, re-estimating parameters by using a posterior; recalculating likelihood, recalculating likelihood function, repeating the above two steps until convergence condition is satisfied.
Threshold-based VAD: by extracting the characteristics of time domain (short-time energy, short-time zero crossing rate and the like) or frequency domain (MFCC, spectral entropy and the like), the aim of distinguishing voice from non-voice is achieved by reasonably setting a threshold.
As a further limitation of the technical scheme, when the words are recited, the words enter from one side of the curved screen and move to the other side of the curved screen, word positions are identified according to a CNN convolutional neural network algorithm, when the words rotate to the rear of the curved screen, the curved screen does not display the words any more, the rear loudspeaker and the ear loudspeaker emit sounds with different decibels according to the virtual word moving positions, the brain space position part is mobilized to assist in memory through the low simulated word positions of the sounds, the memory is deepened by forward and backward rotation for a plurality of weeks.
As a further limitation of the present solution, the number of the rear horns in a group is at least 2, and the horns are symmetrically distributed about the symmetry axis of the seat.
As a further limitation of the present technical solution, the number of the ear horns in one group is at least 2, and the two groups of ear horns are symmetrically distributed about the symmetry axis of the seat.
Compared with the prior art, the invention has the advantages and positive effects that:
1. according to the device, contents to be memorized or recited are input to the curved surface screen, the position of the change is displayed on the curved surface screen, the word position is identified according to the CNN convolutional neural network algorithm, when the word rotates to the rear of the curved surface screen, the curved surface screen does not display the word any more, the rear loudspeaker and the ear loudspeaker emit sounds of different decibels according to the virtual word moving position, the word position is simulated through the low sound pitch, the brain space position part is mobilized to assist in memory, the word rotates forward and backward for many weeks, and the memory is deepened.
2. The device is provided with a recording device near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm. The method comprises the steps of comparing voice with memory content played on a screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for the content with relatively low sound (the sound of a user is reduced), playing for multiple times, selecting combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together.
3. The device is skillfully designed, so that the intelligent equipment is used for stimulating the brain of a user through vision and sound, generating a background with color and shape, deepening vision stimulation, and deepening sound stimulation through the comparison of the voice of the user and the voice of original content to assist the user in memorizing and reciting.
Drawings
Fig. 1 is a schematic perspective view of the present invention.
Fig. 2 is a schematic perspective view of a second embodiment of the present invention.
Fig. 3 is a schematic view of a partial perspective structure of the present invention.
Fig. 4 is a schematic diagram of a network structure of YOLOv4 according to the present invention.
Fig. 5 is a schematic diagram of the mel-frequency coefficient extraction process according to the present invention.
FIG. 6 is a color control schematic of the present invention.
In the figure: 1. the seat, 2, the rear side loudspeaker, 3, square board, 4, ear side loudspeaker, 5, curved surface screen, 6, round rod, 7, plectane, 8, steering wheel, 9, L shaped plate.
Detailed Description
One embodiment of the present invention will be described in detail below with reference to the attached drawings, but it should be understood that the scope of the present invention is not limited by the embodiment.
The invention comprises a seat 1, wherein the seat 1 is fixedly connected with a group of rear horns 2, the seat 1 is fixedly connected with symmetrical square plates 3, the square plates 3 are fixedly connected with a group of ear horns 4, the seat 1 is fixedly connected with an L-shaped plate 9, the L-shaped plate 9 is fixedly connected with a steering engine 8, an output shaft of the steering engine 8 is fixedly connected with a circular plate 7, the circular plate 7 is fixedly connected with a group of circular rods 6, and the circular rods 6 are fixedly connected with a curved surface screen 5;
the method also comprises the following steps:
step one: things needing to be recited or memorized, such as words, are displayed in a variable position on the curved screen 5, and the steering engine 8 is controlled to rotate, so that the curved screen 5 swings, and the positions of the things needing to be recited or memorized are changed continuously;
step two: identifying word positions according to a CNN-based convolutional neural network algorithm;
based on the CNN convolutional neural network algorithm, recognizing word positions, selecting One-Stage single-Stage target detection algorithm YOLO, directly generating class probability and position coordinate values of an object without a region proposal Stage, and directly obtaining a final detection result through single detection, so that the YOLO-v4 algorithm has higher detection speed and efficiency, adopts the most excellent optimization strategy in the CNN field in recent years on the basis of the original YOLO target detection architecture, has different degrees of optimization from various aspects such as data processing, a backbone network, network training, an activation function, a loss function and the like, has better precision and efficiency, and the network structure of YOLOv4 can be divided into Input, backbone, neck, head modules;
Step two,: input is used for data enhancement of Input data, mosaic data enhancement is adopted, the Mosaic data enhancement is evolved on the basis of CutMix data enhancement, cutMix uses two pictures for data enhancement, mosaic expands to use four pictures for splicing, and the four pictures are randomly scaled, randomly cut and randomly typeset, so that a very large rich data set can be obtained at one time, and different words are used as Input data when appearing at different positions of a screen;
step two: the backbox is upgraded once in the YOLOv4, namely CSPDarknet53, CSPNet is named Cross Stage Partial Networks, namely a cross-stage local network, the CSPNet solves the problem of gradient information repetition of network optimization in other large convolutional neural network frame backboxes, the gradient change is integrated into a feature map from head to tail, so that the parameter number and FLOPS value of a model are reduced, the reasoning speed and accuracy are ensured, the model size is reduced, the CSPNet is actually based on the thought of Densonet, the feature map of a base layer is copied, a copy is sent to the next stage through a dense block, so that the feature map of the base layer is separated, the problem of gradient disappearance (the problem that a lost signal is difficult to reversely push through a very deep network) can be effectively alleviated, the feature propagation is supported, the network reuse feature is encouraged, and the number of network parameters is reduced;
Step two, three: mish is chosen here as the activation function, which is a very similar activation function to ReLU and Swish, the formula is as follows:
y=x*tanh(ln(1+ex))
the Mish function is a smooth curve, which allows better information to go deep into the neural network, thereby obtaining better accuracy and generalization; not completely truncated at negative values, allowing a relatively small negative gradient inflow;
step two, four: the method for fusion is characterized in that PANet (Path Aggregation Network) is used for replacing FPN by Neck, YOLOv4 to carry out parameter aggregation so as to be suitable for target detection of different levels, the method used in fusion is Addition, and the YOLOv4 algorithm changes the fusion method from Addition to connection, so that the method is a feature map fusion mode;
step two, five: yolo 3 is used for the last yolo prediction layer in yolo v4, however, it should be noted that yolo v4, after passing through the above-mentioned negk,
(1) First yolo layer: feature map 76x 76= = > mask=0, 1, 2= = > corresponds to the smallest anchor;
(2) Second yolo layer: feature map 38x 38= = > mask=3, 4, 5= = > corresponds to a medium anchor;
(3) Third yolo layer: feature map 19x 19= = > mask=6, 7, 8= = > corresponds to the largest anchor;
step two, six: the YOLOv4 also makes some innovations on Bounding box Regeression Loss, and regression prediction is performed by adopting CIOU_Loss, so that the speed and the precision of a prediction frame are higher;
Step two, seven: training a model, wherein a training data set under dark is COCO, firstly, making a data set, creating a yolov4 folder at the beginning of training, adding yolov4.Cfg, coco.data and coco.names, creating a backup folder under the yolov4 folder for storing intermediate weights, and beginning to execute training instructions-/darknet detector train;
step two, eight: testing the trained model to obtain a detection effect, outputting the position of a word in a screen, namely (x, y) coordinates, carrying out iterative training for multiple times according to errors, and improving the precision of the model;
step three: the rear side loudspeaker (2) and the ear side loudspeaker (4) respectively emit sounds with different decibels, the sounds heard by the left ear and the right ear are overlapped with the eye seeing positions, the viewing and the position are combined to deepen the memory, and the viewing and the position of the changing position are combined to concentrate the attention at the same position.
The method also comprises the following steps:
step four: a recording device is arranged near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm;
the voice recognition algorithm converts a section of voice signal into corresponding text information, and the main flow of the system is that four parts of feature extraction- > acoustic model- > language model- > dictionary and decoding are adopted;
Step four, first: preprocessing, in order to extract features more effectively, audio data preprocessing work such as filtering, framing and the like is often needed to be carried out on the collected sound signals, and the audio signals to be analyzed are properly extracted from the original signals;
the silence at the head and the tail is cut off, so that the interference caused to the subsequent steps is reduced, the operation of silence cutting is generally called VAD, the sound is divided into frames, namely, the sound is cut into small sections, each small section is called a frame, the sound is realized by using a moving window function, and the frames are not simply cut off, and are generally overlapped;
(1) The CODEC is used for solving frequency aliasing, and a low-pass filter is used for filtering frequency components higher than 1/2 sampling frequency before discretizing and collecting an analog signal, and in the design of an actual instrument, the cut-off frequency (fc) of the low-pass filter is as follows:
cut-off frequency (fc) =sampling frequency (fs)/2.56
(6) Pre-emphasis, in order to emphasize the high frequency part of the voice, remove the influence of lip radiation, increase the high frequency resolution of the voice, because the high frequency end attenuates more than about 800Hz according to 6dB/oct (octave), the higher the frequency the smaller the corresponding component, for this reason, the high frequency part of the voice signal is lifted before being analyzed;
Pre-emphasis is typically implemented by a transfer function being a high-pass digital filter, where a is the pre-emphasis coefficient, 0.9< a <1.0, and the speech sample value at time n is set to x (n), and the result after pre-emphasis processing is y (n) =x (n) -ax (n-1), where a=0.97 is taken as the transfer function,
H(z)=1-az -1
(7) The end point detection, also called voice activity detection, voice Activity Detection, VAD, aims to distinguish voice and non-voice areas, and the end point detection aims to accurately locate the starting point and the ending point of voice from voice with noise, remove mute parts and remove noise parts, and find a section of real effective content of voice;
VAD algorithms can be roughly divided into three categories: threshold-based VAD, VAD as classifier, model VAD;
step four, two: feature extraction, converting a sound signal from a time domain to a frequency domain, providing an acoustic model with a proper feature vector, wherein a main algorithm comprises Linear Prediction Cepstrum Coefficient (LPCC) and mel cepstrum coefficient (MFCC), and the purpose is to change each frame of waveform into a multidimensional vector containing sound information;
and step four, three: the acoustic model AM calculates the score of each feature vector on acoustic features according to acoustic characteristics, the score is obtained by training voice data, the input is the feature vector, and the input is phoneme information;
And step four: the language model LM calculates the probability of the possible phrase sequence corresponding to the sound signal according to the theory of linguistic correlation, and the probability of the mutual association of single characters or words is obtained by training a large amount of text information;
step four, five: dictionary: the word or word corresponds to the phoneme, in short, chinese is the correspondence between phonetic transcription and Chinese character, english is the correspondence between phonetic transcription and word;
and step four, six: decoding: performing text output on the audio data with the extracted features through an acoustic model and a dictionary;
step five: comparing the voice with the memory content played by the screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for playing for multiple times for the content with relatively low sound (the voice of the user is reduced), selecting the combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together;
comparing the result of the voice recognition with the content input by the screen after making a difference with the threshold based on the threshold, judging the degree of the difference of the content, playing for a plurality of times when the degree of the difference is smaller than the threshold, determining a threshold t, namely the acceptable degree of the difference,
according to the language model LM in the fourth step, by calculating the probability p of the mutual association between the input sound signal and the word, the content of the word on the screen itself is 1, where the threshold t is set to 0.3, y=1-p, and when y is less than or equal to 0.3, it is indicated that there is no difference, and it is considered that there is no difference; when y is greater than 0.3, the phase difference is considered to be large.
The method also comprises the following steps: step six: generating a background with a color in the curved screen 5, wherein the color accords with memory contents as much as possible and assists in deepening memory;
generating a colored background on the curved screen 5, first requiring a color model on which the selection algorithm is based, which is a mathematical model used to represent color;
step six,: different color models have different application scenes, and the RGB model is suitable for devices such as a display, and the RGB color model is selected to generate the representation of different colors; RGB, commonly known as the three primary colors red (R), green (G), and blue (B), are the most widely used color models,
in the second step, in general development display processing, colors are processed using the model, for example: rgb (255, 0), red (0,255,0), green (0,0,255), blue (blue) by the variation of the three red, green and blue color channels and the mixing superposition of each other, using different intensities, showing different colors, which is an additive color mixture model, with brightness equal to the combination of the brightness of the colors during the superposition mixing, and with more brightness mixed higher;
and step six, three: a 24-bit type is selected, i.e. each color channel R, G, B is represented by 8 bits of data, 2 8 =256, so each channel can represent a color value of the (0-255) level, darkest at 0, brightest at 255;
step six, four: according to the correlation between the memory content and the color, determining a color model of the screen, looking up the color values to obtain specific RGB values of each channel, and setting the color of the screen.
The method also comprises the following steps: step seven: generating different-shape backgrounds, wherein adjacent memory contents are conveniently distinguished by using different shapes;
seventhly, step seven: several different shapes such as rectangle, circle, ellipse, polygon, etc. are selected. Sequentially and circularly drawing to ensure that adjacent contents cannot be in the same shape;
seventhly, step two: drawing different shapes, namely drawing the rectangle according to the parameters of the shape, wherein the parameters required by the rectangle are length and width, firstly, setting the position of the rectangle to be drawn on a screen, generally setting the left upper corner coordinates (x, y) of the rectangle, drawing the rectangle on the screen after determining the length and width, and setting the pixel points starting from x and y as specific color values according to preset colors to obtain the rectangle;
seventhly, step seven: drawing a circle, namely determining the position (x, y) of the circle on a screen, taking the position as the center of a circle, setting the pixel value of the circle for a circle as a specific color according to the radius r, and drawing the circle;
Seventhly, four steps: the drawing of an ellipse requires many parameters, and the center (x, y) of the ellipse, the length of the axis, the long radius l and the short radius s, and the deflection angle need to be determined. Thus, ellipses with different angles can be drawn;
seventhly, the steps are as follows: drawing a polygon, wherein the polygon firstly needs to determine the number n of sides of the polygon, and the number d of top points is set according to the number n of sides n =n, randomly giving the coordinates of each vertex (d 1 ,d 2 ,d 3 ,...,d n ) And connecting the vertexes in turn, and setting all pixel points to be of specific colors, so that polygons can be drawn on the screen.
The mel cepstrum coefficient extraction process comprises the steps of preprocessing, fast fourier transformation, mei filter bank, logarithmic operation, discrete cosine transformation, dynamic feature extraction and the like;
the fast fourier transform is a general term, FF, of an efficient, fast computing method for computing a Discrete Fourier Transform (DFT) using a computer;
two arguments of fourier:
the periodic signal may represent a weighted sum of sinusoidal signals in harmonic relation;
the non-periodic signal can be represented by a weighted integration of the sinusoidal signal;
four rows: FS (continuous, periodic signal, infinite), FT (continuous, non-periodic signal, infinite), DFS (discrete, periodic signal, infinite), DTFT (discrete, non-periodic signal, infinite) steps are as follows:
Step 1: the signal x is decomposed into two sub-signals, even sample point signals: x [2n\right ]; odd sample point signal: x [2n+1\right ];
step 2: the two summation terms are understood to be two DFTs of length N/2\frac { N } {2}
Step 3: a specific calculation process of FFT;
n additions are performed for any k, so DFT has N2 multiplications and N-1 additions are performed for any k.
The specific flow of the acoustic model AM comprises the following steps:
(5) The GMM voice recognition model recognizes voice and outputs text information, each GMM (0-9,o) is trained by voice data corresponding to the GMM, and when the GMM voice recognition model is tested, the GMM voice recognition model can only divide frames, window and extract features of the whole voice, then likelihood of each frame is calculated on each GMM, and final likelihood is obtained by final summation;
(6) The K-Means algorithm carries out parameter initialization on the GMM model, and for a given sample set, the sample set is divided into K clusters according to the distance between samples, so that points in the clusters are tightly connected together as much as possible, and the distance between the clusters is as large as possible;
input: sample set d= { x1, x2,.. cluster tree k=5, maximum iteration number N;
and (3) outputting: cluster partition c= { C1, C2,..ck };
1) Randomly selecting k samples from the data set D as the initial k centroid vectors: { μ1, μ2,., μk };
2) For n=1, 2,..
a) Initializing cluster partition C tot=1,2...k
b) For i=1, 2..m, sample X was calculated i And the respective centroid vector mu j Distance of (j=1, 2,., k):
x is to be i Minimal marked asd ij The corresponding class λi is updated at this time with cλi=cλi { xi }
c) For j=1, 2..k, recalculate the new centroid for all sample points in Cj
d) If all k centroid vectors have not changed, go to step 3)
3) Output cluster partition c= { C1, C2,..
(3) The EM algorithm trains GMM models, and a GMM model is given, and the optimization target is to find the mean vector, covariance matrix and mixing coefficient of each Gaussian component which make the likelihood function maximum;
initializing parameters; e, calculating posterior probability by using the current parameters; m, re-estimating parameters by using a posterior; recalculating likelihood, recalculating likelihood function, repeating the above two steps until convergence condition is satisfied.
Threshold-based VAD: by extracting the characteristics of time domain (short-time energy, short-time zero crossing rate and the like) or frequency domain (MFCC, spectral entropy and the like), the aim of distinguishing voice from non-voice is achieved by reasonably setting a threshold.
When the words are recited, the words enter from one side of the curved surface screen 5 and move to the other side of the curved surface screen 5, word positions are identified according to a CNN convolutional neural network algorithm, when the words rotate to the back of the curved surface screen 5, the curved surface screen 5 screen does not display the words any more, the rear side loudspeaker 2 and the ear side loudspeaker 4 emit sounds with different decibels according to the virtual word moving positions, the word positions are simulated through the low sound pitch, the brain space position part is mobilized to assist in memory, the words rotate forward and backward for many weeks, and the memory is deepened.
The number of the rear horns 2 in a group is at least 2, and the rear horns are symmetrically distributed about the symmetry axis of the seat 1.
The number of the ear-side horns 4 in one group is at least 2, and the two groups of the ear-side horns 4 are symmetrically distributed relative to the symmetry axis of the seat 1.
Embodiment one:
things needing to be recited or memorized, such as words, are displayed in a variable position on the curved surface screen 5, and the steering engine 8 is controlled to rotate, so that the curved surface screen 5 swings, and the position of the things needing to be recited or memorized is changed continuously;
identifying word positions according to a CNN-based convolutional neural network algorithm;
the rear loudspeaker 2 and the ear loudspeaker 4 respectively emit sounds with different decibels, the sounds heard by the left ear and the right ear are overlapped with the eye seeing positions, the viewing and the positions are combined to deepen the memory, and the viewing and the positions of the changed positions are combined to concentrate the attention at the same position.
Embodiment two: the embodiment is further described on the basis of the first embodiment, when the words are recited, the words enter from one side of the curved screen 5 and move to the other side of the curved screen 5, the positions of the words are identified according to the CNN convolutional neural network algorithm, when the words rotate to the rear of the screen 5, the screen of the curved screen 5 does not display the words any more, the rear loudspeaker 2 and the ear loudspeaker 4 emit sounds with different decibels according to the virtual word moving positions, the word positions are simulated by the low sound pitch, the brain space position part auxiliary memory is mobilized, the words are rotated forward and backward for many weeks, and the memory is deepened.
Embodiment III: the first embodiment is further described based on the first embodiment, in which a recording device is disposed near the mouth of the user, and the voice of the user is recognized by a voice recognition algorithm.
The method comprises the steps of comparing voice with memory content played on a screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for the content with relatively low sound (the sound of a user is reduced), playing for multiple times, selecting combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together.
Embodiment four: this embodiment is further described on the basis of the first, second or third embodiment, wherein a background with a color is generated in the curved screen 5, and the color accords with the memory content as much as possible, thereby assisting in deepening the memory.
Fifth embodiment: the embodiment is further described on the basis of the first embodiment, the second embodiment or the third embodiment, the background with different shapes is generated, and adjacent memory contents are conveniently distinguished by using the different shapes.
According to the device, contents to be memorized or recited are input to the curved surface screen 5, the position of the change is displayed on the curved surface screen 5, the word position is identified according to the CNN convolutional neural network algorithm, when the word rotates to the rear of the screen 5, the screen 5 does not display the word any more, the rear loudspeaker 2 and the ear loudspeaker 4 emit sounds with different decibels according to the virtual word moving position, the word position is simulated through the low sound pitch, the partial auxiliary memory of the brain space position is mobilized, the word is rotated forward and backward for many weeks, and the memory is deepened.
The device is provided with a recording device near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm. The method comprises the steps of comparing voice with memory content played on a screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for the content with relatively low sound (the sound of a user is reduced), playing for multiple times, selecting combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together.
The device is skillfully designed, so that the intelligent equipment is used for stimulating the brain of a user through vision and sound, generating a background with color and shape, deepening vision stimulation, and deepening sound stimulation through the comparison of the voice of the user and the voice of original content to assist the user in memorizing and reciting.
The above disclosure is merely illustrative of specific embodiments of the present invention, but the present invention is not limited thereto, and any variations that can be considered by those skilled in the art should fall within the scope of the present invention.

Claims (10)

1. A memory recitation assistance system comprising a seat (1), characterized in that:
the seat (1) is fixedly connected with a group of rear-side horns (2), the seat (1) is fixedly connected with symmetrical square plates (3), and the square plates (3) are fixedly connected with a group of ear-side horns (4);
the seat (1) is fixedly connected with an L-shaped plate (9), the L-shaped plate (9) is fixedly connected with a steering engine (8), an output shaft of the steering engine (8) is fixedly connected with a circular plate (7), the circular plate (7) is fixedly connected with a group of circular rods (6), and the circular rods (6) are fixedly connected with a curved surface screen (5);
the method also comprises the following steps:
step one: the method comprises the steps that things needing to be recited or memorized are displayed in a variable position on the curved surface screen (5), and the steering engine (8) is controlled to rotate, so that the curved surface screen (5) swings, and the positions of the things needing to be recited or memorized are changed continuously;
Step two: identifying word positions according to a CNN-based convolutional neural network algorithm;
based on the CNN convolutional neural network algorithm, recognizing word positions, selecting One-Stage single-Stage target detection algorithm YOLO, directly generating class probability and position coordinate values of an object without a region proposal Stage, and directly obtaining a final detection result through single detection, so that the YOLO-v4 algorithm has higher detection speed and efficiency, adopts the most excellent optimization strategy in the CNN field in recent years on the basis of the original YOLO target detection architecture, has different degrees of optimization from various aspects such as data processing, a backbone network, network training, an activation function, a loss function and the like, has better precision and efficiency, and the network structure of YOLOv4 can be divided into Input, backbone, neck, head modules;
step two,: input is used for data enhancement of Input data, mosaic data enhancement is adopted, the Mosaic data enhancement is evolved on the basis of CutMix data enhancement, cutMix uses two pictures for data enhancement, mosaic expands to use four pictures for splicing, and the four pictures are randomly scaled, randomly cut and randomly typeset, so that a very large rich data set can be obtained at one time, and different words are used as Input data when appearing at different positions of a screen;
Step two: the backbox is upgraded once in the YOLOv4, namely CSPDarknet53, CSPNet is named Cross Stage Partial Networks, namely a cross-stage local network, the CSPNet solves the problem of gradient information repetition of network optimization in other large convolutional neural network frame backboxes, the gradient change is integrated into a feature map from head to tail, so that the parameter number and FLOPS value of a model are reduced, the reasoning speed and accuracy are ensured, the model size is reduced, the CSPNet is actually based on the thought of Densonet, the feature map of a base layer is copied, a copy is sent to the next stage through a dense block, so that the feature map of the base layer is separated, the problem of gradient disappearance (the problem that a lost signal is difficult to reversely push through a very deep network) can be effectively alleviated, the feature propagation is supported, the network reuse feature is encouraged, and the number of network parameters is reduced;
step two, three: mish is chosen here as the activation function, which is a very similar activation function to ReLU and Swish, the formula is as follows:
y=x∗tanh(ln(1+ex))
step two, four: the method for fusion is characterized in that PANet (Path Aggregation Network) is used for replacing FPN by Neck, YOLOv4 to carry out parameter aggregation so as to be suitable for target detection of different levels, the method used in fusion is Addition, and the YOLOv4 algorithm changes the fusion method from Addition to connection, so that the method is a feature map fusion mode;
Step two, five: yolo 3 is used for the last yolo prediction layer in yolo v4, however, it should be noted that yolo v4, after passing through the above-mentioned negk,
(1) First yolo layer: feature map 76 x 76= = > mask=0, 1, 2= > corresponds to the smallest anchor;
(2) Second yolo layer: feature map 38 x 38= = > mask=3, 4, 5= > corresponds to a medium anchor;
(3) Third yolo layer: feature map 19 x 19= = > mask=6, 7, 8= = > corresponds to the largest anchor;
step two, six: the YOLOv4 also makes some innovations on Bounding box Regeression Loss, and regression prediction is performed by adopting CIOU_Loss, so that the speed and the precision of a prediction frame are higher;
step two, seven: training a model, wherein a training data set under dark is COCO, firstly, making a data set, creating a yolov4 folder at the beginning of training, adding yolov4.Cfg, coco.data and coco.names, creating a backup folder under the yolov4 folder for storing intermediate weights, and beginning to execute training instructions-/darknet detector train;
step two, eight: testing the trained model to obtain a detection effect, outputting the position of a word in a screen, namely (x, y) coordinates, carrying out iterative training for multiple times according to errors, and improving the precision of the model;
Step three: the rear side loudspeaker (2) and the ear side loudspeaker (4) respectively emit sounds with different decibels, the sounds heard by the left ear and the right ear are overlapped with the eye seeing positions, the viewing and the position are combined to deepen the memory, and the viewing and the position of the changing position are combined to concentrate the attention at the same position.
2. The memory recitation assistance system of claim 1, further comprising the steps of:
step four: a recording device is arranged near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm;
the voice recognition algorithm converts a section of voice signal into corresponding text information, and the main flow of the system is that four parts of feature extraction- > acoustic model- > language model- > dictionary and decoding are adopted;
step four, first: preprocessing, in order to extract features more effectively, audio data preprocessing work such as filtering, framing and the like is often needed to be carried out on the collected sound signals, and the audio signals to be analyzed are properly extracted from the original signals;
the silence at the head and the tail is cut off, so that the interference caused to the subsequent steps is reduced, the operation of silence cutting is generally called VAD, the sound is divided into frames, namely, the sound is cut into small sections, each small section is called a frame, the sound is realized by using a moving window function, and the frames are not simply cut off, and are generally overlapped;
(1) The CODEC is used for solving frequency aliasing, and a low-pass filter is used for filtering frequency components higher than 1/2 sampling frequency before discretizing and collecting an analog signal, and in the design of an actual instrument, the cut-off frequency (fc) of the low-pass filter is as follows:
cut-off frequency (fc) =sampling frequency (fs)/2.56
Pre-emphasis, in order to emphasize the high frequency part of the voice, removing the influence of lip radiation, and increasing the high frequency resolution of the voice;
pre-emphasis is typically implemented by a transfer function being a high-pass digital filter, where a is the pre-emphasis coefficient, 0.9< a <1.0, and the speech sample value at time n is set to x (n), and the result after pre-emphasis processing is y (n) =x (n) -ax (n-1), where a=0.97 is taken as the transfer function,
H(z)=1-az -1
the end point detection, also called voice activity detection, voice Activity Detection, VAD, aims to distinguish voice and non-voice areas, and the end point detection aims to accurately locate the starting point and the ending point of voice from voice with noise, remove mute parts and remove noise parts, and find a section of real effective content of voice;
VAD algorithms can be roughly divided into three categories: threshold-based VAD, VAD as classifier, model VAD;
Step four, two: feature extraction, converting a sound signal from a time domain to a frequency domain, providing an acoustic model with a proper feature vector, wherein a main algorithm comprises Linear Prediction Cepstrum Coefficient (LPCC) and mel cepstrum coefficient (MFCC), and the purpose is to change each frame of waveform into a multidimensional vector containing sound information;
and step four, three: the acoustic model AM calculates the score of each feature vector on acoustic features according to acoustic characteristics, the score is obtained by training voice data, the input is the feature vector, and the input is phoneme information;
and step four: the language model LM calculates the probability of the possible phrase sequence corresponding to the sound signal according to the theory of linguistic correlation, and the probability of the mutual association of single characters or words is obtained by training a large amount of text information;
step four, five: dictionary: the word or word corresponds to the phoneme, in short, chinese is the correspondence between phonetic transcription and Chinese character, english is the correspondence between phonetic transcription and word;
and step four, six: decoding: performing text output on the audio data with the extracted features through an acoustic model and a dictionary;
step five: comparing the voice with the memory content played by the screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for the content with relatively low sound, playing for multiple times, selecting the combination of sound and silence when playing, and recording and judging when the sound is sounded and the silence is sounded;
Comparing the result of the voice recognition with the content input by the screen after making a difference with the threshold based on the threshold, judging the degree of the difference of the content, playing for a plurality of times when the degree of the difference is smaller than the threshold, determining a threshold t, namely the acceptable degree of the difference,
according to the language model LM in the fourth step, by calculating the probability p of the mutual association between the input sound signal and the word, the content of the word on the screen itself is 1, where the threshold t is set to 0.3, y=1-p, and when y is less than or equal to 0.3, it is indicated that there is no difference, and it is considered that there is no difference; when y is greater than 0.3, the phase difference is considered to be large.
3. The memory recitation assistance system of claim 2, further comprising the steps of: step six: generating a background with a color in the curved screen (5), wherein the color accords with memory contents as much as possible and assists in deepening memory;
generating a coloured background on said curved screen (5) first requires a selection of an algorithm-based colour model, which is a mathematical model used to represent the colour
Step six,: different color models have different application scenes, and the RGB model is suitable for devices such as a display, and the RGB color model is selected to generate the representation of different colors;
Step six, in the general development display processing, almost all the colors are processed by using the model, the model displays different colors by using different intensities through the change of red, green and blue color channels and the mixing superposition of the red, green and blue color channels, the brightness is equal to the combination of the brightness of the colors in the superposition mixing process, and the more the mixed brightness is, the higher the brightness is;
and step six, three: a 24-bit type is selected, i.e. each color channel R, G, B is represented by 8 bits of data, 2 8 =256, so each channel can represent a color value of the (0-255) level, darkest at 0, brightest at 255;
step six, four: according to the correlation between the memory content and the color, determining a color model of the screen, looking up the color values to obtain specific RGB values of each channel, and setting the color of the screen.
4. The memory recitation assistance system of claim 2, further comprising the steps of: step seven: generating different-shape backgrounds, wherein adjacent memory contents are conveniently distinguished by using different shapes;
seventhly, step seven: selecting a plurality of different shapes such as rectangle, circle, ellipse, polygon and the like, and sequentially and circularly drawing to ensure that adjacent contents cannot be in the same shape;
Seventhly, step two: drawing different shapes, namely drawing the rectangle according to the parameters of the shape, wherein the parameters required by the rectangle are length and width, firstly, setting the position of the rectangle to be drawn on a screen, generally setting the left upper corner coordinates (x, y) of the rectangle, drawing the rectangle on the screen after determining the length and width, and setting the pixel points starting from x and y as specific color values according to preset colors to obtain the rectangle;
seventhly, step seven: drawing a circle, namely determining the position (x, y) of the circle on a screen, taking the position as the center of a circle, setting the pixel value of the circle for a circle as a specific color according to the radius r, and drawing the circle;
seventhly, four steps: the ellipse drawing requires more parameters, and the circle center (x, y) of the ellipse, the length of the shaft, the long radius l, the short radius s and the deflection angle are required to be determined, so that ellipses with different angles can be drawn;
seventhly, the steps are as follows: drawing a polygon, wherein the polygon firstly needs to determine the number n of sides of the polygon, and the number d of top points is set according to the number n of sides n =n, randomly giving the coordinates of each vertex (d 1 ,d 2 ,d 3 ,...,d n ) And connecting the vertexes in turn, and setting all pixel points to be of specific colors, so that polygons can be drawn on the screen.
5. The memory recitation assistance system of claim 2, wherein: the mel cepstrum coefficient extraction process comprises the steps of preprocessing, fast fourier transformation, mei filter bank, logarithmic operation, discrete cosine transformation, dynamic feature extraction and the like;
the fast fourier transform is a general term, FF, of an efficient, fast computing method for computing a Discrete Fourier Transform (DFT) using a computer;
two arguments of fourier:
the periodic signal may represent a weighted sum of sinusoidal signals in harmonic relation;
the non-periodic signal can be represented by a weighted integration of the sinusoidal signal;
four rows: FS, FT, DFS, DTFT, the steps are as follows:
step 1: the signal x is decomposed into two sub-signals, even sample point signals: x [2n\right ]; odd sample point signal: x [2n+1\right ];
step 2: the two summation terms are understood to be two DFTs of length N/2\frac { N } {2}
Step 3: a specific calculation process of FFT;
n additions are performed for any k, so DFT has N2 multiplications and N-1 additions are performed for any k.
6. The memory recitation assistance system of claim 2, wherein: the specific flow of the acoustic model AM comprises the following steps:
The GMM voice recognition model recognizes voice and outputs text information, each GMM (0-9,o) is trained by voice data corresponding to the GMM, and when the GMM voice recognition model is tested, the GMM voice recognition model can only divide frames, window and extract features of the whole voice, then likelihood of each frame is calculated on each GMM, and final likelihood is obtained by final summation;
the K-Means algorithm carries out parameter initialization on the GMM model, and for a given sample set, the sample set is divided into K clusters according to the distance between samples, so that points in the clusters are tightly connected together as much as possible, and the distance between the clusters is as large as possible;
input: sample set d= { x1, x2,.. cluster tree k=5, maximum iteration number N;
and (3) outputting: cluster partition c= { C1, C2,..ck };
1) Randomly selecting k samples from the data set D as the initial k centroid vectors: { μ1, μ2,., μk };
2) For n=1, 2,..
a) The cluster partition C is initialized to ct= ∅, t=1, 2..k
b) For i=1, 2..m, sample X was calculated i And the respective centroid vector mu j Distance of (j=1, 2,., k):
x is to be i The minimum mark is d ij The corresponding class λi is updated at this time with cλi=cλi { xi }
c) For j=1, 2..k, recalculate the new centroid for all sample points in Cj
d) If all k centroid vectors have not changed, go to step 3)
3) Output cluster partition c= { C1, C2,..
(3) The EM algorithm trains GMM models, and a GMM model is given, and the optimization target is to find the mean vector, covariance matrix and mixing coefficient of each Gaussian component which make the likelihood function maximum;
initializing parameters; e, calculating posterior probability by using the current parameters; m, re-estimating parameters by using a posterior; recalculating likelihood, recalculating likelihood function, repeating the above two steps until convergence condition is satisfied.
7. The memory recitation assistance system of claim 2, wherein: threshold-based VAD: by extracting time domain or frequency domain characteristics and reasonably setting a threshold, the aim of distinguishing voice from non-voice is achieved.
8. The memory recitation aid system of claim 1, wherein: when reciting words, the words enter from one side of the curved surface screen (5), move to the other side of the curved surface screen (5), identify the positions of the words according to a CNN convolutional neural network algorithm, when the words rotate to the rear of the screen of the curved surface screen (5), the screen of the curved surface screen (5) does not display the words any more, the rear side loudspeaker (2) and the ear side loudspeaker (4) emit sounds with different decibels according to the virtual word moving positions, the positions of the words are simulated through low sound pitch, the brain space position part is mobilized to assist in memorizing, the words rotate forward and backward for many weeks, and the memorizing is deepened.
9. The memory recitation aid system of claim 6, wherein: the number of the rear horns (2) is at least 2, and the rear horns are symmetrically distributed about the symmetry axis of the seat (1).
10. The memory recitation aid system of claim 7, wherein: the number of the ear-side horns (4) is at least 2, and the two ear-side horns (4) are symmetrically distributed about the symmetry axis of the seat (1).
CN202211106551.7A 2022-09-12 2022-09-12 Memory recitation auxiliary system Active CN115641763B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211106551.7A CN115641763B (en) 2022-09-12 2022-09-12 Memory recitation auxiliary system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211106551.7A CN115641763B (en) 2022-09-12 2022-09-12 Memory recitation auxiliary system

Publications (2)

Publication Number Publication Date
CN115641763A CN115641763A (en) 2023-01-24
CN115641763B true CN115641763B (en) 2023-12-19

Family

ID=84943245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211106551.7A Active CN115641763B (en) 2022-09-12 2022-09-12 Memory recitation auxiliary system

Country Status (1)

Country Link
CN (1) CN115641763B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100988759B1 (en) * 2010-03-05 2010-10-20 재단법인 광양만권 유아이티연구소 System for learning english depending on a situation based on the recognizing position
RO131754A1 (en) * 2015-09-29 2017-03-30 Danimated Studio S.R.L. Method and device for learning calligraphic writing
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN107592451A (en) * 2017-08-31 2018-01-16 努比亚技术有限公司 A kind of multi-mode auxiliary photo-taking method, apparatus and computer-readable recording medium
CN108665769A (en) * 2018-05-11 2018-10-16 深圳市鹰硕技术有限公司 Network teaching method based on convolutional neural networks and device
CN110599822A (en) * 2019-08-28 2019-12-20 湖南优美科技发展有限公司 Voice blackboard-writing display method, system and storage medium
CN111986667A (en) * 2020-08-17 2020-11-24 重庆大学 Voice robot control method based on particle filter algorithm
CN113076938A (en) * 2021-05-06 2021-07-06 广西师范大学 Deep learning target detection method combined with embedded hardware information
CN113505775A (en) * 2021-07-15 2021-10-15 大连民族大学 Manchu word recognition method based on character positioning
EP3936079A1 (en) * 2020-07-10 2022-01-12 Spine Align, LLC Intraoperative alignment assessment system and method
CN114463724A (en) * 2022-04-11 2022-05-10 南京慧筑信息技术研究院有限公司 Lane extraction and recognition method based on machine vision

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100988759B1 (en) * 2010-03-05 2010-10-20 재단법인 광양만권 유아이티연구소 System for learning english depending on a situation based on the recognizing position
RO131754A1 (en) * 2015-09-29 2017-03-30 Danimated Studio S.R.L. Method and device for learning calligraphic writing
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN107592451A (en) * 2017-08-31 2018-01-16 努比亚技术有限公司 A kind of multi-mode auxiliary photo-taking method, apparatus and computer-readable recording medium
CN108665769A (en) * 2018-05-11 2018-10-16 深圳市鹰硕技术有限公司 Network teaching method based on convolutional neural networks and device
WO2019214019A1 (en) * 2018-05-11 2019-11-14 深圳市鹰硕技术有限公司 Online teaching method and apparatus based on convolutional neural network
CN110599822A (en) * 2019-08-28 2019-12-20 湖南优美科技发展有限公司 Voice blackboard-writing display method, system and storage medium
EP3936079A1 (en) * 2020-07-10 2022-01-12 Spine Align, LLC Intraoperative alignment assessment system and method
CN111986667A (en) * 2020-08-17 2020-11-24 重庆大学 Voice robot control method based on particle filter algorithm
CN113076938A (en) * 2021-05-06 2021-07-06 广西师范大学 Deep learning target detection method combined with embedded hardware information
CN113505775A (en) * 2021-07-15 2021-10-15 大连民族大学 Manchu word recognition method based on character positioning
CN114463724A (en) * 2022-04-11 2022-05-10 南京慧筑信息技术研究院有限公司 Lane extraction and recognition method based on machine vision

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于熵功率的手写识别系统设计;张敬林;王旭智;万旺根;吴永亮;;电子设计工程(03);第1-3页 *

Also Published As

Publication number Publication date
CN115641763A (en) 2023-01-24

Similar Documents

Publication Publication Date Title
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
CN104756182B (en) Auditory attention clue is combined to detect for phone/vowel/syllable boundaries with phoneme posteriority score
CN110136698B (en) Method, apparatus, device and storage medium for determining mouth shape
US7280964B2 (en) Method of recognizing spoken language with recognition of language color
CN109800700A (en) A kind of underwater sound signal target classification identification method based on deep learning
US20140114655A1 (en) Emotion recognition using auditory attention cues extracted from users voice
JP2000501847A (en) Method and apparatus for obtaining complex information from speech signals of adaptive dialogue in education and testing
CN103503060A (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
US11410642B2 (en) Method and system using phoneme embedding
CN116109455B (en) Language teaching auxiliary system based on artificial intelligence
CN107293290A (en) The method and apparatus for setting up Speech acoustics model
CN112802456A (en) Voice evaluation scoring method and device, electronic equipment and storage medium
Minematsu et al. Structural representation of the pronunciation and its use for CALL
CN115641763B (en) Memory recitation auxiliary system
CN108109610A (en) A kind of simulation vocal technique and simulation sonification system
CN107610720A (en) Pronounce inclined error detection method, apparatus, storage medium and equipment
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
Zhao Analysis of music teaching in basic education integrating scientific computing visualization and computer music technology
CN116364096A (en) Electroencephalogram signal voice decoding method based on generation countermeasure network
CN116913244A (en) Speech synthesis method, equipment and medium
CN112863486B (en) Voice-based spoken language evaluation method and device and electronic equipment
Ahmad et al. The Modeling of the Quranic Alphabets' Correct Pronunciation for Adults and Children Experts
CN117393000B (en) Synthetic voice detection method based on neural network and feature fusion
Küçükbay et al. Audio event detection using adaptive feature extraction scheme
CN109543063B (en) Lyric song matching degree analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant