CN115641763B - Memory recitation auxiliary system - Google Patents
Memory recitation auxiliary system Download PDFInfo
- Publication number
- CN115641763B CN115641763B CN202211106551.7A CN202211106551A CN115641763B CN 115641763 B CN115641763 B CN 115641763B CN 202211106551 A CN202211106551 A CN 202211106551A CN 115641763 B CN115641763 B CN 115641763B
- Authority
- CN
- China
- Prior art keywords
- voice
- screen
- model
- memory
- color
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 claims description 33
- 238000013527 convolutional neural network Methods 0.000 claims description 32
- 230000006870 function Effects 0.000 claims description 29
- 238000000034 method Methods 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 24
- 239000013598 vector Substances 0.000 claims description 24
- 239000003086 colorant Substances 0.000 claims description 21
- 230000005236 sound signal Effects 0.000 claims description 15
- 230000000737 periodic effect Effects 0.000 claims description 14
- 238000007792 addition Methods 0.000 claims description 12
- 238000005457 optimization Methods 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 9
- 238000002156 mixing Methods 0.000 claims description 9
- 238000005192 partition Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 210000004556 brain Anatomy 0.000 claims description 8
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 238000013518 transcription Methods 0.000 claims description 6
- 230000035897 transcription Effects 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 239000012141 concentrate Substances 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000003775 Density Functional Theory Methods 0.000 claims description 3
- 235000012571 Ficus glomerata Nutrition 0.000 claims description 3
- 244000153665 Ficus glomerata Species 0.000 claims description 3
- 238000013461 design Methods 0.000 claims description 3
- 238000011161 development Methods 0.000 claims description 3
- 230000008034 disappearance Effects 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 230000010354 integration Effects 0.000 claims description 3
- 238000013178 mathematical model Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000007500 overflow downdraw method Methods 0.000 claims description 3
- 230000005855 radiation Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 210000003128 head Anatomy 0.000 description 6
- 230000000638 stimulation Effects 0.000 description 4
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000004936 stimulating effect Effects 0.000 description 2
- 230000008092 positive effect Effects 0.000 description 1
- 210000001747 pupil Anatomy 0.000 description 1
Abstract
The invention discloses a memory recitation assisting system, which comprises a seat and is characterized in that: the novel portable intelligent seat comprises a seat body, a seat fixed connection group of rear horns, a seat fixed connection symmetrical square plate, a square plate fixed connection group of ear horns, a seat fixed connection L-shaped plate, an L-shaped plate fixed connection steering engine, an output shaft fixed connection circular plate of the steering engine, a circular plate fixed connection group of circular rods, and a curved surface screen fixed connected with the circular rods. The invention relates to the field of memory auxiliary equipment, in particular to a memory recitation auxiliary system. The invention aims to provide a memory recitation assisting system which is convenient for assisting memory recitation.
Description
Technical Field
The invention relates to the field of memory auxiliary equipment, in particular to a memory recitation auxiliary system.
Background
Recitation is a special method of memory. It requires people to memorize objective things in a fixed order without dividing the objective things into major and minor aspects. Recitations are generally two types, namely mechanical recitations and an understanding recitation. Mechanical recitation is a method of simply and repeatedly memorizing by means of words without understanding the meaning of the material. This recitation is based on mechanical memory. In general, young children (e.g., pupils) are easy to recite, although they do not understand the meaning of the memory material, because they have a developed mechanical memory.
At present, a device for assisting a user in memorizing is lacking by combining text display and voice prompt.
Disclosure of Invention
The invention aims to provide a memory recitation assisting system which is convenient for assisting memory recitation.
The invention adopts the following technical scheme to realize the aim of the invention:
a memory recitation assistance system comprising a seat, characterized in that: the seat 1 is fixedly connected with a group of rear horns 2, the seat 1 is fixedly connected with symmetrical square plates 3, the square plates 3 are fixedly connected with a group of ear horns 4, the seat 1 is fixedly connected with an L-shaped plate 9, the L-shaped plate 9 is fixedly connected with a steering engine 8, an output shaft of the steering engine 8 is fixedly connected with a circular plate 7, the circular plate 7 is fixedly connected with a group of circular rods 6, and the circular rods 6 are fixedly connected with a curved surface screen 5;
the method also comprises the following steps:
step one: things needing to be recited or memorized, such as words, are displayed in a variable position on the curved screen 5, and the steering engine 8 is controlled to rotate, so that the curved screen 5 swings, and the positions of the things needing to be recited or memorized are changed continuously;
step two: identifying word positions according to a CNN-based convolutional neural network algorithm;
Based on the CNN convolutional neural network algorithm, recognizing word positions, selecting One-Stage single-Stage target detection algorithm YOLO, directly generating class probability and position coordinate values of an object without a region proposal Stage, and directly obtaining a final detection result through single detection, so that the YOLO-v4 algorithm has higher detection speed and efficiency, adopts the most excellent optimization strategy in the CNN field in recent years on the basis of the original YOLO target detection architecture, has different degrees of optimization from various aspects such as data processing, a backbone network, network training, an activation function, a loss function and the like, has better precision and efficiency, and the network structure of YOLOv4 can be divided into Input, backbone, neck, head modules;
step two,: input is used for data enhancement of Input data, mosaic data enhancement is adopted, the Mosaic data enhancement is evolved on the basis of CutMix data enhancement, cutMix uses two pictures for data enhancement, mosaic expands to use four pictures for splicing, and the four pictures are randomly scaled, randomly cut and randomly typeset, so that a very large rich data set can be obtained at one time, and different words are used as Input data when appearing at different positions of a screen;
Step two: the backbox is upgraded once in the YOLOv4, namely CSPDarknet53, CSPNet is named Cross Stage Partial Networks, namely a cross-stage local network, the CSPNet solves the problem of gradient information repetition of network optimization in other large convolutional neural network frame backboxes, the gradient change is integrated into a feature map from head to tail, so that the parameter number and FLOPS value of a model are reduced, the reasoning speed and accuracy are ensured, the model size is reduced, the CSPNet is actually based on the thought of Densonet, the feature map of a base layer is copied, a copy is sent to the next stage through a dense block, so that the feature map of the base layer is separated, the problem of gradient disappearance (the problem that a lost signal is difficult to reversely push through a very deep network) can be effectively alleviated, the feature propagation is supported, the network reuse feature is encouraged, and the number of network parameters is reduced;
step two, three: mish is chosen here as the activation function, which is a very similar activation function to ReLU and Swish, the formula is as follows:
y=x*tanh(ln(1+ex))
the Mish function is a smooth curve, which allows better information to go deep into the neural network, thereby obtaining better accuracy and generalization; not completely truncated at negative values, allowing a relatively small negative gradient inflow;
Step two, four: the method for fusion is characterized in that PANet (Path Aggregation Network) is used for replacing FPN by Neck, YOLOv4 to carry out parameter aggregation so as to be suitable for target detection of different levels, the method used in fusion is Addition, and the YOLOv4 algorithm changes the fusion method from Addition to connection, so that the method is a feature map fusion mode;
step two, five: yolo 3 is used for the last yolo prediction layer in yolo v4, however, it should be noted that yolo v4, after passing through the above-mentioned negk,
(1) First yolo layer: feature map 76x 76= = > mask=0, 1, 2= = > corresponds to the smallest anchor;
(2) Second yolo layer: feature map 38x 38= = > mask=3, 4, 5= = > corresponds to a medium anchor;
(3) Third yolo layer: feature map 19x 19= = > mask=6, 7, 8= = > corresponds to the largest anchor;
step two, six: the YOLOv4 also makes some innovations on Bounding box Regeression Loss, and regression prediction is performed by adopting CIOU_Loss, so that the speed and the precision of a prediction frame are higher;
step two, seven: training a model, wherein a training data set under dark is COCO, firstly, making a data set, creating a yolov4 folder at the beginning of training, adding yolov4.Cfg, coco.data and coco.names, creating a backup folder under the yolov4 folder for storing intermediate weights, and beginning to execute training instructions-/darknet detector train;
Step two, eight: testing the trained model to obtain a detection effect, outputting the position of a word in a screen, namely (x, y) coordinates, carrying out iterative training for multiple times according to errors, and improving the precision of the model;
step three: the rear side loudspeaker (2) and the ear side loudspeaker (4) respectively emit sounds with different decibels, the sounds heard by the left ear and the right ear are overlapped with the eye seeing positions, the viewing and the position are combined to deepen the memory, and the viewing and the position of the changing position are combined to concentrate the attention at the same position.
As a further limitation of the present technical solution, the method further comprises the following steps:
step four: a recording device is arranged near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm;
the voice recognition algorithm converts a section of voice signal into corresponding text information, and the main flow of the system is that four parts of feature extraction- > acoustic model- > language model- > dictionary and decoding are adopted;
step four, first: preprocessing, in order to extract features more effectively, audio data preprocessing work such as filtering, framing and the like is often needed to be carried out on the collected sound signals, and the audio signals to be analyzed are properly extracted from the original signals;
the silence at the head and the tail is cut off, so that the interference caused to the subsequent steps is reduced, the operation of silence cutting is generally called VAD, the sound is divided into frames, namely, the sound is cut into small sections, each small section is called a frame, the sound is realized by using a moving window function, and the frames are not simply cut off, and are generally overlapped;
(1) The CODEC is used for solving frequency aliasing, and a low-pass filter is used for filtering frequency components higher than 1/2 sampling frequency before discretizing and collecting an analog signal, and in the design of an actual instrument, the cut-off frequency (fc) of the low-pass filter is as follows:
cut-off frequency (fc) =sampling frequency (fs)/2.56
(4) Pre-emphasis, in order to emphasize the high frequency part of the voice, remove the influence of lip radiation, increase the high frequency resolution of the voice, because the high frequency end attenuates more than about 800Hz according to 6dB/oct (octave), the higher the frequency the smaller the corresponding component, for this reason, the high frequency part of the voice signal is lifted before being analyzed;
pre-emphasis is typically implemented by a transfer function being a high-pass digital filter, where a is the pre-emphasis coefficient, 0.9< a <1.0, and the speech sample value at time n is set to x (n), and the result after pre-emphasis processing is y (n) =x (n) -ax (n-1), where a=0.97 is taken as the transfer function,
H(z)=1-az -1
(5) The end point detection, also called voice activity detection, voice Activity Detection, VAD, aims to distinguish voice and non-voice areas, and the end point detection aims to accurately locate the starting point and the ending point of voice from voice with noise, remove mute parts and remove noise parts, and find a section of real effective content of voice;
VAD algorithms can be roughly divided into three categories: threshold-based VAD, VAD as classifier, model VAD;
step four, two: feature extraction, converting a sound signal from a time domain to a frequency domain, providing an acoustic model with a proper feature vector, wherein a main algorithm comprises Linear Prediction Cepstrum Coefficient (LPCC) and mel cepstrum coefficient (MFCC), and the purpose is to change each frame of waveform into a multidimensional vector containing sound information;
and step four, three: the acoustic model AM calculates the score of each feature vector on acoustic features according to acoustic characteristics, the score is obtained by training voice data, the input is the feature vector, and the input is phoneme information;
and step four: the language model LM calculates the probability of the possible phrase sequence corresponding to the sound signal according to the theory of linguistic correlation, and the probability of the mutual association of single characters or words is obtained by training a large amount of text information;
step four, five: dictionary: the word or word corresponds to the phoneme, in short, chinese is the correspondence between phonetic transcription and Chinese character, english is the correspondence between phonetic transcription and word;
and step four, six: decoding: performing text output on the audio data with the extracted features through an acoustic model and a dictionary;
Step five: comparing the voice with the memory content played by the screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for playing for multiple times for the content with relatively low sound (the voice of the user is reduced), selecting the combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together;
comparing the result of the voice recognition with the content input by the screen after making a difference with the threshold based on the threshold, judging the degree of the difference of the content, playing for a plurality of times when the degree of the difference is smaller than the threshold, determining a threshold t, namely the acceptable degree of the difference,
as a further limitation of the present solution, according to the language model LM in the fourth step, by calculating the probability p of the correlation between the input sound signal and the word, the word content of the screen itself is 1, where the threshold t is set to 0.3, y=1 to p, and when y is less than or equal to 0.3, it is indicated that there is a little difference, and it is considered that there is no difference; when y is greater than 0.3, the phase difference is considered to be large.
As a further limitation of the present technical solution, the method further comprises the following steps: step six: generating a background with a color in the curved screen, wherein the color accords with memory contents as much as possible, and assisting in deepening memory;
Generating a colored background on the curved screen, wherein a color model based on an algorithm is firstly required to be selected, and the color model is a mathematical model used for representing colors;
step six,: different color models have different application scenes, and the RGB model is suitable for devices such as a display, and the RGB color model is selected to generate the representation of different colors; RGB, commonly known as the three primary colors red (R), green (G), and blue (B), are the most widely used color models,
in the second step, in general development display processing, colors are processed using the model, for example: rgb (255, 0), red (0,255,0), green (0,0,255), blue (blue) by the variation of the three red, green and blue color channels and the mixing superposition of each other, using different intensities, showing different colors, which is an additive color mixture model, with brightness equal to the combination of the brightness of the colors during the superposition mixing, and with more brightness mixed higher;
and step six, three: a 24-bit type is selected, i.e. each color channel R, G, B is represented by 8 bits of data, 2 8 =256, so each channel can represent a color value of the (0-255) level, darkest at 0, brightest at 255;
Step six, four: according to the correlation between the memory content and the color, determining a color model of the screen, looking up the color values to obtain specific RGB values of each channel, and setting the color of the screen.
As a further limitation of the present technical solution, the method further comprises the following steps: step seven: generating different-shape backgrounds, wherein adjacent memory contents are conveniently distinguished by using different shapes;
seventhly, step seven: selecting a plurality of different shapes such as rectangle, circle, ellipse, polygon and the like, and sequentially and circularly drawing to ensure that adjacent contents cannot be in the same shape;
seventhly, step two: drawing different shapes, namely drawing the rectangle according to the parameters of the shape, wherein the parameters required by the rectangle are length and width, firstly, setting the position of the rectangle to be drawn on a screen, generally setting the left upper corner coordinates (x, y) of the rectangle, drawing the rectangle on the screen after determining the length and width, and setting the pixel points starting from x and y as specific color values according to preset colors to obtain the rectangle;
seventhly, step seven: drawing a circle, namely determining the position (x, y) of the circle on a screen, taking the position as the center of a circle, setting the pixel value of the circle for a circle as a specific color according to the radius r, and drawing the circle;
Seventhly, four steps: the ellipse drawing requires more parameters, and the circle center (x, y) of the ellipse, the length of the shaft, the long radius l, the short radius s and the deflection angle are required to be determined, so that ellipses with different angles can be drawn;
seventhly, the steps are as follows: drawing a polygon, wherein the polygon firstly needs to determine the number n of sides of the polygon, and the number d of top points is set according to the number n of sides n =n, randomly giving the coordinates of each vertex (d 1 ,d 2 ,d 3 ,...,d n ) The vertexes are connected in turn, all pixel points are set to be of specific colors, polygons can be drawn on the screen,
the mel cepstrum coefficient extraction process comprises the steps of preprocessing, fast fourier transformation, mei filter bank, logarithmic operation, discrete cosine transformation, dynamic feature extraction and the like;
the fast fourier transform is a general term, FF, of an efficient, fast computing method for computing a Discrete Fourier Transform (DFT) using a computer;
two arguments of fourier:
the periodic signal may represent a weighted sum of sinusoidal signals in harmonic relation;
the non-periodic signal can be represented by a weighted integration of the sinusoidal signal;
four rows: FS (continuous, periodic signal, infinite), FT (continuous, non-periodic signal, infinite), DFS (discrete, periodic signal, infinite), DTFT (discrete, non-periodic signal, infinite) steps are as follows:
Step 1: the signal x is decomposed into two sub-signals, even sample point signals: x [2n\right ]; odd sample point signal: x [2n+1\right ];
step 2: the two summation terms are understood to be two DFTs of length N/2\frac { N } {2}
Step 3: a specific calculation process of FFT;
n additions are performed for any k, so DFT has N2 multiplications and N-1 additions are performed for any k.
The specific flow of the acoustic model AM comprises the following steps:
(3) The GMM voice recognition model recognizes voice and outputs text information, each GMM (0-9,o) is trained by voice data corresponding to the GMM, and when the GMM voice recognition model is tested, the GMM voice recognition model can only divide frames, window and extract features of the whole voice, then likelihood of each frame is calculated on each GMM, and final likelihood is obtained by final summation;
(4) The K-Means algorithm carries out parameter initialization on the GMM model, and for a given sample set, the sample set is divided into K clusters according to the distance between samples, so that points in the clusters are tightly connected together as much as possible, and the distance between the clusters is as large as possible;
input: sample set d= { x1, x2,.. cluster tree k=5, maximum iteration number N;
and (3) outputting: cluster partition c= { C1, C2,..ck };
1) Randomly selecting k samples from the data set D as the initial k centroid vectors: { μ1, μ2,., μk };
2) For n=1, 2,..
a) Initializing cluster partition C tot=1,2...k
b) For i=1, 2..m, sample X was calculated i And the respective centroid vector mu j Distance of (j=1, 2,., k):
x is to be i The minimum mark is d ij The corresponding class λi is updated at this time with cλi=cλi { xi }
c) For j=1, 2..k, recalculate the new centroid for all sample points in Cj
d) If all k centroid vectors have not changed, go to step 3)
3) Output cluster partition c= { C1, C2,..
(3) The EM algorithm trains GMM models, and a GMM model is given, and the optimization target is to find the mean vector, covariance matrix and mixing coefficient of each Gaussian component which make the likelihood function maximum;
initializing parameters; e, calculating posterior probability by using the current parameters; m, re-estimating parameters by using a posterior; recalculating likelihood, recalculating likelihood function, repeating the above two steps until convergence condition is satisfied.
Threshold-based VAD: by extracting the characteristics of time domain (short-time energy, short-time zero crossing rate and the like) or frequency domain (MFCC, spectral entropy and the like), the aim of distinguishing voice from non-voice is achieved by reasonably setting a threshold.
As a further limitation of the technical scheme, when the words are recited, the words enter from one side of the curved screen and move to the other side of the curved screen, word positions are identified according to a CNN convolutional neural network algorithm, when the words rotate to the rear of the curved screen, the curved screen does not display the words any more, the rear loudspeaker and the ear loudspeaker emit sounds with different decibels according to the virtual word moving positions, the brain space position part is mobilized to assist in memory through the low simulated word positions of the sounds, the memory is deepened by forward and backward rotation for a plurality of weeks.
As a further limitation of the present solution, the number of the rear horns in a group is at least 2, and the horns are symmetrically distributed about the symmetry axis of the seat.
As a further limitation of the present technical solution, the number of the ear horns in one group is at least 2, and the two groups of ear horns are symmetrically distributed about the symmetry axis of the seat.
Compared with the prior art, the invention has the advantages and positive effects that:
1. according to the device, contents to be memorized or recited are input to the curved surface screen, the position of the change is displayed on the curved surface screen, the word position is identified according to the CNN convolutional neural network algorithm, when the word rotates to the rear of the curved surface screen, the curved surface screen does not display the word any more, the rear loudspeaker and the ear loudspeaker emit sounds of different decibels according to the virtual word moving position, the word position is simulated through the low sound pitch, the brain space position part is mobilized to assist in memory, the word rotates forward and backward for many weeks, and the memory is deepened.
2. The device is provided with a recording device near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm. The method comprises the steps of comparing voice with memory content played on a screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for the content with relatively low sound (the sound of a user is reduced), playing for multiple times, selecting combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together.
3. The device is skillfully designed, so that the intelligent equipment is used for stimulating the brain of a user through vision and sound, generating a background with color and shape, deepening vision stimulation, and deepening sound stimulation through the comparison of the voice of the user and the voice of original content to assist the user in memorizing and reciting.
Drawings
Fig. 1 is a schematic perspective view of the present invention.
Fig. 2 is a schematic perspective view of a second embodiment of the present invention.
Fig. 3 is a schematic view of a partial perspective structure of the present invention.
Fig. 4 is a schematic diagram of a network structure of YOLOv4 according to the present invention.
Fig. 5 is a schematic diagram of the mel-frequency coefficient extraction process according to the present invention.
FIG. 6 is a color control schematic of the present invention.
In the figure: 1. the seat, 2, the rear side loudspeaker, 3, square board, 4, ear side loudspeaker, 5, curved surface screen, 6, round rod, 7, plectane, 8, steering wheel, 9, L shaped plate.
Detailed Description
One embodiment of the present invention will be described in detail below with reference to the attached drawings, but it should be understood that the scope of the present invention is not limited by the embodiment.
The invention comprises a seat 1, wherein the seat 1 is fixedly connected with a group of rear horns 2, the seat 1 is fixedly connected with symmetrical square plates 3, the square plates 3 are fixedly connected with a group of ear horns 4, the seat 1 is fixedly connected with an L-shaped plate 9, the L-shaped plate 9 is fixedly connected with a steering engine 8, an output shaft of the steering engine 8 is fixedly connected with a circular plate 7, the circular plate 7 is fixedly connected with a group of circular rods 6, and the circular rods 6 are fixedly connected with a curved surface screen 5;
the method also comprises the following steps:
step one: things needing to be recited or memorized, such as words, are displayed in a variable position on the curved screen 5, and the steering engine 8 is controlled to rotate, so that the curved screen 5 swings, and the positions of the things needing to be recited or memorized are changed continuously;
step two: identifying word positions according to a CNN-based convolutional neural network algorithm;
based on the CNN convolutional neural network algorithm, recognizing word positions, selecting One-Stage single-Stage target detection algorithm YOLO, directly generating class probability and position coordinate values of an object without a region proposal Stage, and directly obtaining a final detection result through single detection, so that the YOLO-v4 algorithm has higher detection speed and efficiency, adopts the most excellent optimization strategy in the CNN field in recent years on the basis of the original YOLO target detection architecture, has different degrees of optimization from various aspects such as data processing, a backbone network, network training, an activation function, a loss function and the like, has better precision and efficiency, and the network structure of YOLOv4 can be divided into Input, backbone, neck, head modules;
Step two,: input is used for data enhancement of Input data, mosaic data enhancement is adopted, the Mosaic data enhancement is evolved on the basis of CutMix data enhancement, cutMix uses two pictures for data enhancement, mosaic expands to use four pictures for splicing, and the four pictures are randomly scaled, randomly cut and randomly typeset, so that a very large rich data set can be obtained at one time, and different words are used as Input data when appearing at different positions of a screen;
step two: the backbox is upgraded once in the YOLOv4, namely CSPDarknet53, CSPNet is named Cross Stage Partial Networks, namely a cross-stage local network, the CSPNet solves the problem of gradient information repetition of network optimization in other large convolutional neural network frame backboxes, the gradient change is integrated into a feature map from head to tail, so that the parameter number and FLOPS value of a model are reduced, the reasoning speed and accuracy are ensured, the model size is reduced, the CSPNet is actually based on the thought of Densonet, the feature map of a base layer is copied, a copy is sent to the next stage through a dense block, so that the feature map of the base layer is separated, the problem of gradient disappearance (the problem that a lost signal is difficult to reversely push through a very deep network) can be effectively alleviated, the feature propagation is supported, the network reuse feature is encouraged, and the number of network parameters is reduced;
Step two, three: mish is chosen here as the activation function, which is a very similar activation function to ReLU and Swish, the formula is as follows:
y=x*tanh(ln(1+ex))
the Mish function is a smooth curve, which allows better information to go deep into the neural network, thereby obtaining better accuracy and generalization; not completely truncated at negative values, allowing a relatively small negative gradient inflow;
step two, four: the method for fusion is characterized in that PANet (Path Aggregation Network) is used for replacing FPN by Neck, YOLOv4 to carry out parameter aggregation so as to be suitable for target detection of different levels, the method used in fusion is Addition, and the YOLOv4 algorithm changes the fusion method from Addition to connection, so that the method is a feature map fusion mode;
step two, five: yolo 3 is used for the last yolo prediction layer in yolo v4, however, it should be noted that yolo v4, after passing through the above-mentioned negk,
(1) First yolo layer: feature map 76x 76= = > mask=0, 1, 2= = > corresponds to the smallest anchor;
(2) Second yolo layer: feature map 38x 38= = > mask=3, 4, 5= = > corresponds to a medium anchor;
(3) Third yolo layer: feature map 19x 19= = > mask=6, 7, 8= = > corresponds to the largest anchor;
step two, six: the YOLOv4 also makes some innovations on Bounding box Regeression Loss, and regression prediction is performed by adopting CIOU_Loss, so that the speed and the precision of a prediction frame are higher;
Step two, seven: training a model, wherein a training data set under dark is COCO, firstly, making a data set, creating a yolov4 folder at the beginning of training, adding yolov4.Cfg, coco.data and coco.names, creating a backup folder under the yolov4 folder for storing intermediate weights, and beginning to execute training instructions-/darknet detector train;
step two, eight: testing the trained model to obtain a detection effect, outputting the position of a word in a screen, namely (x, y) coordinates, carrying out iterative training for multiple times according to errors, and improving the precision of the model;
step three: the rear side loudspeaker (2) and the ear side loudspeaker (4) respectively emit sounds with different decibels, the sounds heard by the left ear and the right ear are overlapped with the eye seeing positions, the viewing and the position are combined to deepen the memory, and the viewing and the position of the changing position are combined to concentrate the attention at the same position.
The method also comprises the following steps:
step four: a recording device is arranged near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm;
the voice recognition algorithm converts a section of voice signal into corresponding text information, and the main flow of the system is that four parts of feature extraction- > acoustic model- > language model- > dictionary and decoding are adopted;
Step four, first: preprocessing, in order to extract features more effectively, audio data preprocessing work such as filtering, framing and the like is often needed to be carried out on the collected sound signals, and the audio signals to be analyzed are properly extracted from the original signals;
the silence at the head and the tail is cut off, so that the interference caused to the subsequent steps is reduced, the operation of silence cutting is generally called VAD, the sound is divided into frames, namely, the sound is cut into small sections, each small section is called a frame, the sound is realized by using a moving window function, and the frames are not simply cut off, and are generally overlapped;
(1) The CODEC is used for solving frequency aliasing, and a low-pass filter is used for filtering frequency components higher than 1/2 sampling frequency before discretizing and collecting an analog signal, and in the design of an actual instrument, the cut-off frequency (fc) of the low-pass filter is as follows:
cut-off frequency (fc) =sampling frequency (fs)/2.56
(6) Pre-emphasis, in order to emphasize the high frequency part of the voice, remove the influence of lip radiation, increase the high frequency resolution of the voice, because the high frequency end attenuates more than about 800Hz according to 6dB/oct (octave), the higher the frequency the smaller the corresponding component, for this reason, the high frequency part of the voice signal is lifted before being analyzed;
Pre-emphasis is typically implemented by a transfer function being a high-pass digital filter, where a is the pre-emphasis coefficient, 0.9< a <1.0, and the speech sample value at time n is set to x (n), and the result after pre-emphasis processing is y (n) =x (n) -ax (n-1), where a=0.97 is taken as the transfer function,
H(z)=1-az -1
(7) The end point detection, also called voice activity detection, voice Activity Detection, VAD, aims to distinguish voice and non-voice areas, and the end point detection aims to accurately locate the starting point and the ending point of voice from voice with noise, remove mute parts and remove noise parts, and find a section of real effective content of voice;
VAD algorithms can be roughly divided into three categories: threshold-based VAD, VAD as classifier, model VAD;
step four, two: feature extraction, converting a sound signal from a time domain to a frequency domain, providing an acoustic model with a proper feature vector, wherein a main algorithm comprises Linear Prediction Cepstrum Coefficient (LPCC) and mel cepstrum coefficient (MFCC), and the purpose is to change each frame of waveform into a multidimensional vector containing sound information;
and step four, three: the acoustic model AM calculates the score of each feature vector on acoustic features according to acoustic characteristics, the score is obtained by training voice data, the input is the feature vector, and the input is phoneme information;
And step four: the language model LM calculates the probability of the possible phrase sequence corresponding to the sound signal according to the theory of linguistic correlation, and the probability of the mutual association of single characters or words is obtained by training a large amount of text information;
step four, five: dictionary: the word or word corresponds to the phoneme, in short, chinese is the correspondence between phonetic transcription and Chinese character, english is the correspondence between phonetic transcription and word;
and step four, six: decoding: performing text output on the audio data with the extracted features through an acoustic model and a dictionary;
step five: comparing the voice with the memory content played by the screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for playing for multiple times for the content with relatively low sound (the voice of the user is reduced), selecting the combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together;
comparing the result of the voice recognition with the content input by the screen after making a difference with the threshold based on the threshold, judging the degree of the difference of the content, playing for a plurality of times when the degree of the difference is smaller than the threshold, determining a threshold t, namely the acceptable degree of the difference,
according to the language model LM in the fourth step, by calculating the probability p of the mutual association between the input sound signal and the word, the content of the word on the screen itself is 1, where the threshold t is set to 0.3, y=1-p, and when y is less than or equal to 0.3, it is indicated that there is no difference, and it is considered that there is no difference; when y is greater than 0.3, the phase difference is considered to be large.
The method also comprises the following steps: step six: generating a background with a color in the curved screen 5, wherein the color accords with memory contents as much as possible and assists in deepening memory;
generating a colored background on the curved screen 5, first requiring a color model on which the selection algorithm is based, which is a mathematical model used to represent color;
step six,: different color models have different application scenes, and the RGB model is suitable for devices such as a display, and the RGB color model is selected to generate the representation of different colors; RGB, commonly known as the three primary colors red (R), green (G), and blue (B), are the most widely used color models,
in the second step, in general development display processing, colors are processed using the model, for example: rgb (255, 0), red (0,255,0), green (0,0,255), blue (blue) by the variation of the three red, green and blue color channels and the mixing superposition of each other, using different intensities, showing different colors, which is an additive color mixture model, with brightness equal to the combination of the brightness of the colors during the superposition mixing, and with more brightness mixed higher;
and step six, three: a 24-bit type is selected, i.e. each color channel R, G, B is represented by 8 bits of data, 2 8 =256, so each channel can represent a color value of the (0-255) level, darkest at 0, brightest at 255;
step six, four: according to the correlation between the memory content and the color, determining a color model of the screen, looking up the color values to obtain specific RGB values of each channel, and setting the color of the screen.
The method also comprises the following steps: step seven: generating different-shape backgrounds, wherein adjacent memory contents are conveniently distinguished by using different shapes;
seventhly, step seven: several different shapes such as rectangle, circle, ellipse, polygon, etc. are selected. Sequentially and circularly drawing to ensure that adjacent contents cannot be in the same shape;
seventhly, step two: drawing different shapes, namely drawing the rectangle according to the parameters of the shape, wherein the parameters required by the rectangle are length and width, firstly, setting the position of the rectangle to be drawn on a screen, generally setting the left upper corner coordinates (x, y) of the rectangle, drawing the rectangle on the screen after determining the length and width, and setting the pixel points starting from x and y as specific color values according to preset colors to obtain the rectangle;
seventhly, step seven: drawing a circle, namely determining the position (x, y) of the circle on a screen, taking the position as the center of a circle, setting the pixel value of the circle for a circle as a specific color according to the radius r, and drawing the circle;
Seventhly, four steps: the drawing of an ellipse requires many parameters, and the center (x, y) of the ellipse, the length of the axis, the long radius l and the short radius s, and the deflection angle need to be determined. Thus, ellipses with different angles can be drawn;
seventhly, the steps are as follows: drawing a polygon, wherein the polygon firstly needs to determine the number n of sides of the polygon, and the number d of top points is set according to the number n of sides n =n, randomly giving the coordinates of each vertex (d 1 ,d 2 ,d 3 ,...,d n ) And connecting the vertexes in turn, and setting all pixel points to be of specific colors, so that polygons can be drawn on the screen.
The mel cepstrum coefficient extraction process comprises the steps of preprocessing, fast fourier transformation, mei filter bank, logarithmic operation, discrete cosine transformation, dynamic feature extraction and the like;
the fast fourier transform is a general term, FF, of an efficient, fast computing method for computing a Discrete Fourier Transform (DFT) using a computer;
two arguments of fourier:
the periodic signal may represent a weighted sum of sinusoidal signals in harmonic relation;
the non-periodic signal can be represented by a weighted integration of the sinusoidal signal;
four rows: FS (continuous, periodic signal, infinite), FT (continuous, non-periodic signal, infinite), DFS (discrete, periodic signal, infinite), DTFT (discrete, non-periodic signal, infinite) steps are as follows:
Step 1: the signal x is decomposed into two sub-signals, even sample point signals: x [2n\right ]; odd sample point signal: x [2n+1\right ];
step 2: the two summation terms are understood to be two DFTs of length N/2\frac { N } {2}
Step 3: a specific calculation process of FFT;
n additions are performed for any k, so DFT has N2 multiplications and N-1 additions are performed for any k.
The specific flow of the acoustic model AM comprises the following steps:
(5) The GMM voice recognition model recognizes voice and outputs text information, each GMM (0-9,o) is trained by voice data corresponding to the GMM, and when the GMM voice recognition model is tested, the GMM voice recognition model can only divide frames, window and extract features of the whole voice, then likelihood of each frame is calculated on each GMM, and final likelihood is obtained by final summation;
(6) The K-Means algorithm carries out parameter initialization on the GMM model, and for a given sample set, the sample set is divided into K clusters according to the distance between samples, so that points in the clusters are tightly connected together as much as possible, and the distance between the clusters is as large as possible;
input: sample set d= { x1, x2,.. cluster tree k=5, maximum iteration number N;
and (3) outputting: cluster partition c= { C1, C2,..ck };
1) Randomly selecting k samples from the data set D as the initial k centroid vectors: { μ1, μ2,., μk };
2) For n=1, 2,..
a) Initializing cluster partition C tot=1,2...k
b) For i=1, 2..m, sample X was calculated i And the respective centroid vector mu j Distance of (j=1, 2,., k):
x is to be i Minimal marked asd ij The corresponding class λi is updated at this time with cλi=cλi { xi }
c) For j=1, 2..k, recalculate the new centroid for all sample points in Cj
d) If all k centroid vectors have not changed, go to step 3)
3) Output cluster partition c= { C1, C2,..
(3) The EM algorithm trains GMM models, and a GMM model is given, and the optimization target is to find the mean vector, covariance matrix and mixing coefficient of each Gaussian component which make the likelihood function maximum;
initializing parameters; e, calculating posterior probability by using the current parameters; m, re-estimating parameters by using a posterior; recalculating likelihood, recalculating likelihood function, repeating the above two steps until convergence condition is satisfied.
Threshold-based VAD: by extracting the characteristics of time domain (short-time energy, short-time zero crossing rate and the like) or frequency domain (MFCC, spectral entropy and the like), the aim of distinguishing voice from non-voice is achieved by reasonably setting a threshold.
When the words are recited, the words enter from one side of the curved surface screen 5 and move to the other side of the curved surface screen 5, word positions are identified according to a CNN convolutional neural network algorithm, when the words rotate to the back of the curved surface screen 5, the curved surface screen 5 screen does not display the words any more, the rear side loudspeaker 2 and the ear side loudspeaker 4 emit sounds with different decibels according to the virtual word moving positions, the word positions are simulated through the low sound pitch, the brain space position part is mobilized to assist in memory, the words rotate forward and backward for many weeks, and the memory is deepened.
The number of the rear horns 2 in a group is at least 2, and the rear horns are symmetrically distributed about the symmetry axis of the seat 1.
The number of the ear-side horns 4 in one group is at least 2, and the two groups of the ear-side horns 4 are symmetrically distributed relative to the symmetry axis of the seat 1.
Embodiment one:
things needing to be recited or memorized, such as words, are displayed in a variable position on the curved surface screen 5, and the steering engine 8 is controlled to rotate, so that the curved surface screen 5 swings, and the position of the things needing to be recited or memorized is changed continuously;
identifying word positions according to a CNN-based convolutional neural network algorithm;
the rear loudspeaker 2 and the ear loudspeaker 4 respectively emit sounds with different decibels, the sounds heard by the left ear and the right ear are overlapped with the eye seeing positions, the viewing and the positions are combined to deepen the memory, and the viewing and the positions of the changed positions are combined to concentrate the attention at the same position.
Embodiment two: the embodiment is further described on the basis of the first embodiment, when the words are recited, the words enter from one side of the curved screen 5 and move to the other side of the curved screen 5, the positions of the words are identified according to the CNN convolutional neural network algorithm, when the words rotate to the rear of the screen 5, the screen of the curved screen 5 does not display the words any more, the rear loudspeaker 2 and the ear loudspeaker 4 emit sounds with different decibels according to the virtual word moving positions, the word positions are simulated by the low sound pitch, the brain space position part auxiliary memory is mobilized, the words are rotated forward and backward for many weeks, and the memory is deepened.
Embodiment III: the first embodiment is further described based on the first embodiment, in which a recording device is disposed near the mouth of the user, and the voice of the user is recognized by a voice recognition algorithm.
The method comprises the steps of comparing voice with memory content played on a screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for the content with relatively low sound (the sound of a user is reduced), playing for multiple times, selecting combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together.
Embodiment four: this embodiment is further described on the basis of the first, second or third embodiment, wherein a background with a color is generated in the curved screen 5, and the color accords with the memory content as much as possible, thereby assisting in deepening the memory.
Fifth embodiment: the embodiment is further described on the basis of the first embodiment, the second embodiment or the third embodiment, the background with different shapes is generated, and adjacent memory contents are conveniently distinguished by using the different shapes.
According to the device, contents to be memorized or recited are input to the curved surface screen 5, the position of the change is displayed on the curved surface screen 5, the word position is identified according to the CNN convolutional neural network algorithm, when the word rotates to the rear of the screen 5, the screen 5 does not display the word any more, the rear loudspeaker 2 and the ear loudspeaker 4 emit sounds with different decibels according to the virtual word moving position, the word position is simulated through the low sound pitch, the partial auxiliary memory of the brain space position is mobilized, the word is rotated forward and backward for many weeks, and the memory is deepened.
The device is provided with a recording device near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm. The method comprises the steps of comparing voice with memory content played on a screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for the content with relatively low sound (the sound of a user is reduced), playing for multiple times, selecting combination of sound and silence during playing, and recording and judging when the sound is sounded and the silence is taken together.
The device is skillfully designed, so that the intelligent equipment is used for stimulating the brain of a user through vision and sound, generating a background with color and shape, deepening vision stimulation, and deepening sound stimulation through the comparison of the voice of the user and the voice of original content to assist the user in memorizing and reciting.
The above disclosure is merely illustrative of specific embodiments of the present invention, but the present invention is not limited thereto, and any variations that can be considered by those skilled in the art should fall within the scope of the present invention.
Claims (10)
1. A memory recitation assistance system comprising a seat (1), characterized in that:
the seat (1) is fixedly connected with a group of rear-side horns (2), the seat (1) is fixedly connected with symmetrical square plates (3), and the square plates (3) are fixedly connected with a group of ear-side horns (4);
the seat (1) is fixedly connected with an L-shaped plate (9), the L-shaped plate (9) is fixedly connected with a steering engine (8), an output shaft of the steering engine (8) is fixedly connected with a circular plate (7), the circular plate (7) is fixedly connected with a group of circular rods (6), and the circular rods (6) are fixedly connected with a curved surface screen (5);
the method also comprises the following steps:
step one: the method comprises the steps that things needing to be recited or memorized are displayed in a variable position on the curved surface screen (5), and the steering engine (8) is controlled to rotate, so that the curved surface screen (5) swings, and the positions of the things needing to be recited or memorized are changed continuously;
Step two: identifying word positions according to a CNN-based convolutional neural network algorithm;
based on the CNN convolutional neural network algorithm, recognizing word positions, selecting One-Stage single-Stage target detection algorithm YOLO, directly generating class probability and position coordinate values of an object without a region proposal Stage, and directly obtaining a final detection result through single detection, so that the YOLO-v4 algorithm has higher detection speed and efficiency, adopts the most excellent optimization strategy in the CNN field in recent years on the basis of the original YOLO target detection architecture, has different degrees of optimization from various aspects such as data processing, a backbone network, network training, an activation function, a loss function and the like, has better precision and efficiency, and the network structure of YOLOv4 can be divided into Input, backbone, neck, head modules;
step two,: input is used for data enhancement of Input data, mosaic data enhancement is adopted, the Mosaic data enhancement is evolved on the basis of CutMix data enhancement, cutMix uses two pictures for data enhancement, mosaic expands to use four pictures for splicing, and the four pictures are randomly scaled, randomly cut and randomly typeset, so that a very large rich data set can be obtained at one time, and different words are used as Input data when appearing at different positions of a screen;
Step two: the backbox is upgraded once in the YOLOv4, namely CSPDarknet53, CSPNet is named Cross Stage Partial Networks, namely a cross-stage local network, the CSPNet solves the problem of gradient information repetition of network optimization in other large convolutional neural network frame backboxes, the gradient change is integrated into a feature map from head to tail, so that the parameter number and FLOPS value of a model are reduced, the reasoning speed and accuracy are ensured, the model size is reduced, the CSPNet is actually based on the thought of Densonet, the feature map of a base layer is copied, a copy is sent to the next stage through a dense block, so that the feature map of the base layer is separated, the problem of gradient disappearance (the problem that a lost signal is difficult to reversely push through a very deep network) can be effectively alleviated, the feature propagation is supported, the network reuse feature is encouraged, and the number of network parameters is reduced;
step two, three: mish is chosen here as the activation function, which is a very similar activation function to ReLU and Swish, the formula is as follows:
y=x∗tanh(ln(1+ex))
step two, four: the method for fusion is characterized in that PANet (Path Aggregation Network) is used for replacing FPN by Neck, YOLOv4 to carry out parameter aggregation so as to be suitable for target detection of different levels, the method used in fusion is Addition, and the YOLOv4 algorithm changes the fusion method from Addition to connection, so that the method is a feature map fusion mode;
Step two, five: yolo 3 is used for the last yolo prediction layer in yolo v4, however, it should be noted that yolo v4, after passing through the above-mentioned negk,
(1) First yolo layer: feature map 76 x 76= = > mask=0, 1, 2= > corresponds to the smallest anchor;
(2) Second yolo layer: feature map 38 x 38= = > mask=3, 4, 5= > corresponds to a medium anchor;
(3) Third yolo layer: feature map 19 x 19= = > mask=6, 7, 8= = > corresponds to the largest anchor;
step two, six: the YOLOv4 also makes some innovations on Bounding box Regeression Loss, and regression prediction is performed by adopting CIOU_Loss, so that the speed and the precision of a prediction frame are higher;
step two, seven: training a model, wherein a training data set under dark is COCO, firstly, making a data set, creating a yolov4 folder at the beginning of training, adding yolov4.Cfg, coco.data and coco.names, creating a backup folder under the yolov4 folder for storing intermediate weights, and beginning to execute training instructions-/darknet detector train;
step two, eight: testing the trained model to obtain a detection effect, outputting the position of a word in a screen, namely (x, y) coordinates, carrying out iterative training for multiple times according to errors, and improving the precision of the model;
Step three: the rear side loudspeaker (2) and the ear side loudspeaker (4) respectively emit sounds with different decibels, the sounds heard by the left ear and the right ear are overlapped with the eye seeing positions, the viewing and the position are combined to deepen the memory, and the viewing and the position of the changing position are combined to concentrate the attention at the same position.
2. The memory recitation assistance system of claim 1, further comprising the steps of:
step four: a recording device is arranged near the mouth of a user, and the voice of the user is recognized through a voice recognition algorithm;
the voice recognition algorithm converts a section of voice signal into corresponding text information, and the main flow of the system is that four parts of feature extraction- > acoustic model- > language model- > dictionary and decoding are adopted;
step four, first: preprocessing, in order to extract features more effectively, audio data preprocessing work such as filtering, framing and the like is often needed to be carried out on the collected sound signals, and the audio signals to be analyzed are properly extracted from the original signals;
the silence at the head and the tail is cut off, so that the interference caused to the subsequent steps is reduced, the operation of silence cutting is generally called VAD, the sound is divided into frames, namely, the sound is cut into small sections, each small section is called a frame, the sound is realized by using a moving window function, and the frames are not simply cut off, and are generally overlapped;
(1) The CODEC is used for solving frequency aliasing, and a low-pass filter is used for filtering frequency components higher than 1/2 sampling frequency before discretizing and collecting an analog signal, and in the design of an actual instrument, the cut-off frequency (fc) of the low-pass filter is as follows:
cut-off frequency (fc) =sampling frequency (fs)/2.56
Pre-emphasis, in order to emphasize the high frequency part of the voice, removing the influence of lip radiation, and increasing the high frequency resolution of the voice;
pre-emphasis is typically implemented by a transfer function being a high-pass digital filter, where a is the pre-emphasis coefficient, 0.9< a <1.0, and the speech sample value at time n is set to x (n), and the result after pre-emphasis processing is y (n) =x (n) -ax (n-1), where a=0.97 is taken as the transfer function,
H(z)=1-az -1
the end point detection, also called voice activity detection, voice Activity Detection, VAD, aims to distinguish voice and non-voice areas, and the end point detection aims to accurately locate the starting point and the ending point of voice from voice with noise, remove mute parts and remove noise parts, and find a section of real effective content of voice;
VAD algorithms can be roughly divided into three categories: threshold-based VAD, VAD as classifier, model VAD;
Step four, two: feature extraction, converting a sound signal from a time domain to a frequency domain, providing an acoustic model with a proper feature vector, wherein a main algorithm comprises Linear Prediction Cepstrum Coefficient (LPCC) and mel cepstrum coefficient (MFCC), and the purpose is to change each frame of waveform into a multidimensional vector containing sound information;
and step four, three: the acoustic model AM calculates the score of each feature vector on acoustic features according to acoustic characteristics, the score is obtained by training voice data, the input is the feature vector, and the input is phoneme information;
and step four: the language model LM calculates the probability of the possible phrase sequence corresponding to the sound signal according to the theory of linguistic correlation, and the probability of the mutual association of single characters or words is obtained by training a large amount of text information;
step four, five: dictionary: the word or word corresponds to the phoneme, in short, chinese is the correspondence between phonetic transcription and Chinese character, english is the correspondence between phonetic transcription and word;
and step four, six: decoding: performing text output on the audio data with the extracted features through an acoustic model and a dictionary;
step five: comparing the voice with the memory content played by the screen, playing the recitation content with larger difference for multiple times, judging that the memory is not deep for the content with relatively low sound, playing for multiple times, selecting the combination of sound and silence when playing, and recording and judging when the sound is sounded and the silence is sounded;
Comparing the result of the voice recognition with the content input by the screen after making a difference with the threshold based on the threshold, judging the degree of the difference of the content, playing for a plurality of times when the degree of the difference is smaller than the threshold, determining a threshold t, namely the acceptable degree of the difference,
according to the language model LM in the fourth step, by calculating the probability p of the mutual association between the input sound signal and the word, the content of the word on the screen itself is 1, where the threshold t is set to 0.3, y=1-p, and when y is less than or equal to 0.3, it is indicated that there is no difference, and it is considered that there is no difference; when y is greater than 0.3, the phase difference is considered to be large.
3. The memory recitation assistance system of claim 2, further comprising the steps of: step six: generating a background with a color in the curved screen (5), wherein the color accords with memory contents as much as possible and assists in deepening memory;
generating a coloured background on said curved screen (5) first requires a selection of an algorithm-based colour model, which is a mathematical model used to represent the colour
Step six,: different color models have different application scenes, and the RGB model is suitable for devices such as a display, and the RGB color model is selected to generate the representation of different colors;
Step six, in the general development display processing, almost all the colors are processed by using the model, the model displays different colors by using different intensities through the change of red, green and blue color channels and the mixing superposition of the red, green and blue color channels, the brightness is equal to the combination of the brightness of the colors in the superposition mixing process, and the more the mixed brightness is, the higher the brightness is;
and step six, three: a 24-bit type is selected, i.e. each color channel R, G, B is represented by 8 bits of data, 2 8 =256, so each channel can represent a color value of the (0-255) level, darkest at 0, brightest at 255;
step six, four: according to the correlation between the memory content and the color, determining a color model of the screen, looking up the color values to obtain specific RGB values of each channel, and setting the color of the screen.
4. The memory recitation assistance system of claim 2, further comprising the steps of: step seven: generating different-shape backgrounds, wherein adjacent memory contents are conveniently distinguished by using different shapes;
seventhly, step seven: selecting a plurality of different shapes such as rectangle, circle, ellipse, polygon and the like, and sequentially and circularly drawing to ensure that adjacent contents cannot be in the same shape;
Seventhly, step two: drawing different shapes, namely drawing the rectangle according to the parameters of the shape, wherein the parameters required by the rectangle are length and width, firstly, setting the position of the rectangle to be drawn on a screen, generally setting the left upper corner coordinates (x, y) of the rectangle, drawing the rectangle on the screen after determining the length and width, and setting the pixel points starting from x and y as specific color values according to preset colors to obtain the rectangle;
seventhly, step seven: drawing a circle, namely determining the position (x, y) of the circle on a screen, taking the position as the center of a circle, setting the pixel value of the circle for a circle as a specific color according to the radius r, and drawing the circle;
seventhly, four steps: the ellipse drawing requires more parameters, and the circle center (x, y) of the ellipse, the length of the shaft, the long radius l, the short radius s and the deflection angle are required to be determined, so that ellipses with different angles can be drawn;
seventhly, the steps are as follows: drawing a polygon, wherein the polygon firstly needs to determine the number n of sides of the polygon, and the number d of top points is set according to the number n of sides n =n, randomly giving the coordinates of each vertex (d 1 ,d 2 ,d 3 ,...,d n ) And connecting the vertexes in turn, and setting all pixel points to be of specific colors, so that polygons can be drawn on the screen.
5. The memory recitation assistance system of claim 2, wherein: the mel cepstrum coefficient extraction process comprises the steps of preprocessing, fast fourier transformation, mei filter bank, logarithmic operation, discrete cosine transformation, dynamic feature extraction and the like;
the fast fourier transform is a general term, FF, of an efficient, fast computing method for computing a Discrete Fourier Transform (DFT) using a computer;
two arguments of fourier:
the periodic signal may represent a weighted sum of sinusoidal signals in harmonic relation;
the non-periodic signal can be represented by a weighted integration of the sinusoidal signal;
four rows: FS, FT, DFS, DTFT, the steps are as follows:
step 1: the signal x is decomposed into two sub-signals, even sample point signals: x [2n\right ]; odd sample point signal: x [2n+1\right ];
step 2: the two summation terms are understood to be two DFTs of length N/2\frac { N } {2}
Step 3: a specific calculation process of FFT;
n additions are performed for any k, so DFT has N2 multiplications and N-1 additions are performed for any k.
6. The memory recitation assistance system of claim 2, wherein: the specific flow of the acoustic model AM comprises the following steps:
The GMM voice recognition model recognizes voice and outputs text information, each GMM (0-9,o) is trained by voice data corresponding to the GMM, and when the GMM voice recognition model is tested, the GMM voice recognition model can only divide frames, window and extract features of the whole voice, then likelihood of each frame is calculated on each GMM, and final likelihood is obtained by final summation;
the K-Means algorithm carries out parameter initialization on the GMM model, and for a given sample set, the sample set is divided into K clusters according to the distance between samples, so that points in the clusters are tightly connected together as much as possible, and the distance between the clusters is as large as possible;
input: sample set d= { x1, x2,.. cluster tree k=5, maximum iteration number N;
and (3) outputting: cluster partition c= { C1, C2,..ck };
1) Randomly selecting k samples from the data set D as the initial k centroid vectors: { μ1, μ2,., μk };
2) For n=1, 2,..
a) The cluster partition C is initialized to ct= ∅, t=1, 2..k
b) For i=1, 2..m, sample X was calculated i And the respective centroid vector mu j Distance of (j=1, 2,., k):
x is to be i The minimum mark is d ij The corresponding class λi is updated at this time with cλi=cλi { xi }
c) For j=1, 2..k, recalculate the new centroid for all sample points in Cj
d) If all k centroid vectors have not changed, go to step 3)
3) Output cluster partition c= { C1, C2,..
(3) The EM algorithm trains GMM models, and a GMM model is given, and the optimization target is to find the mean vector, covariance matrix and mixing coefficient of each Gaussian component which make the likelihood function maximum;
initializing parameters; e, calculating posterior probability by using the current parameters; m, re-estimating parameters by using a posterior; recalculating likelihood, recalculating likelihood function, repeating the above two steps until convergence condition is satisfied.
7. The memory recitation assistance system of claim 2, wherein: threshold-based VAD: by extracting time domain or frequency domain characteristics and reasonably setting a threshold, the aim of distinguishing voice from non-voice is achieved.
8. The memory recitation aid system of claim 1, wherein: when reciting words, the words enter from one side of the curved surface screen (5), move to the other side of the curved surface screen (5), identify the positions of the words according to a CNN convolutional neural network algorithm, when the words rotate to the rear of the screen of the curved surface screen (5), the screen of the curved surface screen (5) does not display the words any more, the rear side loudspeaker (2) and the ear side loudspeaker (4) emit sounds with different decibels according to the virtual word moving positions, the positions of the words are simulated through low sound pitch, the brain space position part is mobilized to assist in memorizing, the words rotate forward and backward for many weeks, and the memorizing is deepened.
9. The memory recitation aid system of claim 6, wherein: the number of the rear horns (2) is at least 2, and the rear horns are symmetrically distributed about the symmetry axis of the seat (1).
10. The memory recitation aid system of claim 7, wherein: the number of the ear-side horns (4) is at least 2, and the two ear-side horns (4) are symmetrically distributed about the symmetry axis of the seat (1).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211106551.7A CN115641763B (en) | 2022-09-12 | 2022-09-12 | Memory recitation auxiliary system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211106551.7A CN115641763B (en) | 2022-09-12 | 2022-09-12 | Memory recitation auxiliary system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115641763A CN115641763A (en) | 2023-01-24 |
CN115641763B true CN115641763B (en) | 2023-12-19 |
Family
ID=84943245
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211106551.7A Active CN115641763B (en) | 2022-09-12 | 2022-09-12 | Memory recitation auxiliary system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115641763B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100988759B1 (en) * | 2010-03-05 | 2010-10-20 | 재단법인 광양만권 유아이티연구소 | System for learning english depending on a situation based on the recognizing position |
RO131754A1 (en) * | 2015-09-29 | 2017-03-30 | Danimated Studio S.R.L. | Method and device for learning calligraphic writing |
CN107066973A (en) * | 2017-04-17 | 2017-08-18 | 杭州电子科技大学 | A kind of video content description method of utilization spatio-temporal attention model |
CN107592451A (en) * | 2017-08-31 | 2018-01-16 | 努比亚技术有限公司 | A kind of multi-mode auxiliary photo-taking method, apparatus and computer-readable recording medium |
CN108665769A (en) * | 2018-05-11 | 2018-10-16 | 深圳市鹰硕技术有限公司 | Network teaching method based on convolutional neural networks and device |
CN110599822A (en) * | 2019-08-28 | 2019-12-20 | 湖南优美科技发展有限公司 | Voice blackboard-writing display method, system and storage medium |
CN111986667A (en) * | 2020-08-17 | 2020-11-24 | 重庆大学 | Voice robot control method based on particle filter algorithm |
CN113076938A (en) * | 2021-05-06 | 2021-07-06 | 广西师范大学 | Deep learning target detection method combined with embedded hardware information |
CN113505775A (en) * | 2021-07-15 | 2021-10-15 | 大连民族大学 | Manchu word recognition method based on character positioning |
EP3936079A1 (en) * | 2020-07-10 | 2022-01-12 | Spine Align, LLC | Intraoperative alignment assessment system and method |
CN114463724A (en) * | 2022-04-11 | 2022-05-10 | 南京慧筑信息技术研究院有限公司 | Lane extraction and recognition method based on machine vision |
-
2022
- 2022-09-12 CN CN202211106551.7A patent/CN115641763B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100988759B1 (en) * | 2010-03-05 | 2010-10-20 | 재단법인 광양만권 유아이티연구소 | System for learning english depending on a situation based on the recognizing position |
RO131754A1 (en) * | 2015-09-29 | 2017-03-30 | Danimated Studio S.R.L. | Method and device for learning calligraphic writing |
CN107066973A (en) * | 2017-04-17 | 2017-08-18 | 杭州电子科技大学 | A kind of video content description method of utilization spatio-temporal attention model |
CN107592451A (en) * | 2017-08-31 | 2018-01-16 | 努比亚技术有限公司 | A kind of multi-mode auxiliary photo-taking method, apparatus and computer-readable recording medium |
CN108665769A (en) * | 2018-05-11 | 2018-10-16 | 深圳市鹰硕技术有限公司 | Network teaching method based on convolutional neural networks and device |
WO2019214019A1 (en) * | 2018-05-11 | 2019-11-14 | 深圳市鹰硕技术有限公司 | Online teaching method and apparatus based on convolutional neural network |
CN110599822A (en) * | 2019-08-28 | 2019-12-20 | 湖南优美科技发展有限公司 | Voice blackboard-writing display method, system and storage medium |
EP3936079A1 (en) * | 2020-07-10 | 2022-01-12 | Spine Align, LLC | Intraoperative alignment assessment system and method |
CN111986667A (en) * | 2020-08-17 | 2020-11-24 | 重庆大学 | Voice robot control method based on particle filter algorithm |
CN113076938A (en) * | 2021-05-06 | 2021-07-06 | 广西师范大学 | Deep learning target detection method combined with embedded hardware information |
CN113505775A (en) * | 2021-07-15 | 2021-10-15 | 大连民族大学 | Manchu word recognition method based on character positioning |
CN114463724A (en) * | 2022-04-11 | 2022-05-10 | 南京慧筑信息技术研究院有限公司 | Lane extraction and recognition method based on machine vision |
Non-Patent Citations (1)
Title |
---|
基于熵功率的手写识别系统设计;张敬林;王旭智;万旺根;吴永亮;;电子设计工程(03);第1-3页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115641763A (en) | 2023-01-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10176811B2 (en) | Neural network-based voiceprint information extraction method and apparatus | |
CN104756182B (en) | Auditory attention clue is combined to detect for phone/vowel/syllable boundaries with phoneme posteriority score | |
CN110136698B (en) | Method, apparatus, device and storage medium for determining mouth shape | |
US7280964B2 (en) | Method of recognizing spoken language with recognition of language color | |
CN109800700A (en) | A kind of underwater sound signal target classification identification method based on deep learning | |
US20140114655A1 (en) | Emotion recognition using auditory attention cues extracted from users voice | |
JP2000501847A (en) | Method and apparatus for obtaining complex information from speech signals of adaptive dialogue in education and testing | |
CN103503060A (en) | Speech syllable/vowel/phone boundary detection using auditory attention cues | |
US11410642B2 (en) | Method and system using phoneme embedding | |
CN116109455B (en) | Language teaching auxiliary system based on artificial intelligence | |
CN107293290A (en) | The method and apparatus for setting up Speech acoustics model | |
CN112802456A (en) | Voice evaluation scoring method and device, electronic equipment and storage medium | |
Minematsu et al. | Structural representation of the pronunciation and its use for CALL | |
CN115641763B (en) | Memory recitation auxiliary system | |
CN108109610A (en) | A kind of simulation vocal technique and simulation sonification system | |
CN107610720A (en) | Pronounce inclined error detection method, apparatus, storage medium and equipment | |
Kamble et al. | Emotion recognition for instantaneous Marathi spoken words | |
Zhao | Analysis of music teaching in basic education integrating scientific computing visualization and computer music technology | |
CN116364096A (en) | Electroencephalogram signal voice decoding method based on generation countermeasure network | |
CN116913244A (en) | Speech synthesis method, equipment and medium | |
CN112863486B (en) | Voice-based spoken language evaluation method and device and electronic equipment | |
Ahmad et al. | The Modeling of the Quranic Alphabets' Correct Pronunciation for Adults and Children Experts | |
CN117393000B (en) | Synthetic voice detection method based on neural network and feature fusion | |
Küçükbay et al. | Audio event detection using adaptive feature extraction scheme | |
CN109543063B (en) | Lyric song matching degree analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |