CN112232127A

CN112232127A - Intelligent speech training system and method

Info

Publication number: CN112232127A
Application number: CN202010961200.9A
Authority: CN
Inventors: 赵新博
Original assignee: Liaoning University Of International Business And Economics
Current assignee: Liaoning University Of International Business And Economics
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2021-01-15

Abstract

The invention discloses an intelligent speech training system, which utilizes a human body biological state sensor to collect gesture actions and sound information of a speaker, processes the collected gesture actions through an image processing module to obtain pure image graphic characteristic signals, converts the sound signals of the speaker into electric signals through a sound processing module, performs noise reduction and filtering, compares the image graphic characteristic signals and the electric signals of the speaker sound with prestored standard gesture actions and speech sounds through a central processing module, and gives improvement suggestions, and the improvement suggestions are output through an output module and are transmitted to the speaker for timely improvement. The invention solves the problems that the prior speech training is not standard, can not carry out standard guidance, has poor training effect and excessively depends on experienced persons for guidance.

Description

Intelligent speech training system and method

Technical Field

The invention relates to the field of speech training, in particular to an intelligent speech training system and method.

Background

In the traditional speech instruction training process, comprehensive judgment can be carried out only by manually watching the speech state of a speaker and observing the posture of the speaker and the sound size and the emotional intensity of the speaker. Moreover, different instructors have different opinions, so that a standard which is more standard and uniform cannot be formed, and deviation is easy to occur.

With the continuous development of science and technology and the introduction of information technology, computer technology and artificial intelligence technology, the research of robots has gradually gone out of the industrial field and gradually expanded to the fields of medical treatment, health care, families, entertainment, service industry and the like. The requirements of people on the robot are also improved from simple and repeated mechanical actions to an intelligent robot with anthropomorphic question answering, autonomy and interaction with other robots, and human-computer interaction also becomes an important factor for determining the development of the intelligent robot. The robot collects the speech state of the speaker and compares the speech state with the standard state, so that the teaching of speech training becomes a development trend

Disclosure of Invention

Therefore, the invention provides an intelligent speech training system and method, and aims to solve the problems that the existing speech training is not standard, standard guidance cannot be performed, the training effect is poor, and guidance is excessively performed by an experienced person.

In order to achieve the above purpose, the invention provides the following technical scheme:

according to the first aspect of the invention, the intelligent speech training system is disclosed, wherein a human body biological state sensor is utilized to collect gesture actions and sound information of a speaker, an image processing module is used for processing the collected gesture actions to obtain pure image graphic characteristic signals, a sound processing module is used for converting sound signals of the speaker into electric signals and carrying out noise reduction and filtering, a central processing module is used for comparing the image graphic characteristic signals and the electric signals of the speaker sound with pre-stored standard gesture actions and speech sounds and providing improvement suggestions, and the improvement suggestions are output through an output module and are transmitted to the speaker for timely improvement.

Further, the human body biological state sensor includes: the voice recognition system comprises a posture detecting instrument and a voice collecting device, wherein the posture detecting instrument is horizontally erected in a range of 2-3 m in front of a speaker through a support, the height of the erected support is 1.6 m, the voice collecting device is installed on clothes or a podium of the speaker, the posture detecting instrument collects facial expressions and limb actions of the speaker, and the voice collecting device collects voice signals of the speaker.

Furthermore, the image processing module is connected with the gesture detector and used for carrying out image graph signal extraction, image graph signal preprocessing, image graph signal feature extraction, direction analysis and intelligent tracking and image graph information coding storage on the collected facial expression and limb action images.

Further, the image processing module performs noise reduction processing and feature enhancement processing on the recorded image graphics signals to obtain relatively pure image graphics signal feature vectors, and the condition of subsequent feature extraction is met; the two parts of image graphic signal extraction and preprocessing are mainly realized by front-end camera equipment; the characteristic extraction of the image graphic signal is to extract characteristic information which is contained in the sequence image and can be used for target tracking; the direction analysis and intelligent tracking realize the recognition of a certain characteristic direction to judge the activity range and the activity frequency of the characteristic, and simultaneously, the intelligent tracking realizes the recording of the motion trail of the characteristic target; the image graphic information is coded and stored, and the analyzed and tracked image graphic information is coded in a coding mode, so that the information quantity is compressed and stored, and convenience is brought to subsequent systems for extracting information.

Furthermore, the sound collection device collects the sound information of the speaker and then sends the sound information to the sound processing module, the sound processing module converts the sound signal into an electric signal, the sound loudness, the frequency, the content, the duration and the time interval between each byte are obtained, the electric signal is filtered and denoised through the filter, and the electric signal which is purified and filtered to eliminate clutter interference is obtained through the amplifier.

Furthermore, central processing module embeds there are memory cell, training unit and comparison unit, the signal of telecommunication of figure image signal and sound is stored to the memory cell, the training unit utilizes the standard action figure image signal and the corresponding standard sound signal of telecommunication that a large amount of articles for speech correspond, the signal of telecommunication of figure image signal and sound and standard action figure image signal and the corresponding standard sound signal of telecommunication that the comparison unit will gather compare.

Furthermore, the training unit performs learning training by using corresponding actions and sounds corresponding to lecture materials prepared in advance, different types of lecture materials correspond to different gesture actions and lecture sound emotions, and standard guidance actions and sounds can be output for different lecture materials after training is completed.

Furthermore, the comparison unit compares the collected electric signals of the actual graph image signals and the sound with the electric signals of the standard graph image signals and the sound, labels different places, points out differences, provides corresponding improvement suggestions, and transmits the improvement suggestions to the output module for output.

Further, the output module transmits the specific improvement suggestion to a display screen in front of the speaker and transmits the related voice prompt information to a Bluetooth headset worn by the speaker through the Bluetooth module.

According to a second aspect of the present invention, a method for intelligent speech training is disclosed, the method comprising:

collecting facial expressions and limb actions of the speaker by using an attitude detector, and collecting a sound signal of the speaker by using a sound collection device;

the image processing module is connected with the gesture detector and is used for carrying out image graphic signal extraction, image graphic signal preprocessing, image graphic signal characteristic extraction, direction analysis and intelligent tracking and image graphic information coding storage on the collected facial expression and limb action images;

the sound processing module converts the sound signal into an electric signal, the electric signal is filtered and denoised by a filter and amplified by an amplifier to obtain a pure electric signal for filtering clutter interference;

a training unit in the central processing module is used for training by utilizing a large amount of data in advance, and after the training is finished, standard guide actions and sounds can be output according to different lecture materials;

comparing the collected electric signals of the actual graph image signals and the sound with the electric signals of the standard graph image signals and the sound, and giving improvement suggestions aiming at different places;

the improvement suggestion is conveyed through the display screen and the bluetooth headset that the lecturer is in front of, makes the lecturer receive relevant suggestion to carry out the adjustment training of speech state, make the speech more normal, the speech level obtains promoting.

The invention has the following advantages:

the invention discloses an intelligent speech training system and method, which are characterized in that attitude actions and sound signals of a speaker are collected, a central processing module is used for comparing the processed attitude actions and sound signals with standard actions and sound, different places are marked, improvement suggestions are output, and the improvement suggestions are fed back to the speaker through a display screen and a Bluetooth headset. The lecturer adjusts according to the given improvement suggestion, has better guiding significance, improves the lecture level of the lecturer, and can form a personalized guidance scheme aiming at different lecture contents.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.

Fig. 1 is a flowchart of an intelligent speech training system according to an embodiment of the present invention;

fig. 2 is a schematic diagram of hardware connection of an intelligent speech training system according to an embodiment of the present invention;

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

This embodiment discloses an intelligence speech training system, the system utilizes human biological state sensor to gather speaker's gesture action and sound information, handle the gesture action of gathering through image processing module, obtain pure image figure characteristic signal, turn into the signal of telecommunication through sound processing module with speaker's sound signal, and fall the noise, the filtering, central processing module compares the signal of telecommunication of image figure characteristic signal and speaker sound with the standard gesture action and the speech sound of prestoring, and give the improvement suggestion, the improvement suggestion is exported through output module, send to the speaker and in time improve.

The human body biological state sensor includes: the voice recognition system comprises a posture detecting instrument and a voice collecting device, wherein the posture detecting instrument is horizontally erected in a range of 2-3 m in front of a speaker through a support, the height of the erected support is 1.6 m, the voice collecting device is installed on clothes or a podium of the speaker, the posture detecting instrument collects facial expressions and limb actions of the speaker, and the voice collecting device collects voice signals of the speaker. The information that the gesture detection instrument gathered includes human facial expression action and health action, and facial expression includes: smile, anger, joy, sadness, excitement, etc., and the physical actions include: waving hands, clenching fists, clapping palms, etc. The sound collection device 2 collects speech sound content of a speaker, and the sound state of the speaker includes: thriving, low, gentle, cheerful and the like. And sending the collected information to a memory for storage.

The image processing module is connected with the gesture detector and used for carrying out image graph signal extraction, image graph signal preprocessing, image graph signal feature extraction, direction analysis and intelligent tracking and image graph information coding storage on the collected facial expression and limb action images. The image processing module carries out noise reduction processing and feature enhancement processing on the recorded image graphic signals to obtain relatively pure image graphic signal feature vectors, and the condition of subsequent feature extraction is met; the two parts of image graphic signal extraction and preprocessing are mainly realized by front-end camera equipment; the characteristic extraction of the image graphic signal is to extract characteristic information which is contained in the sequence image and can be used for target tracking; the direction analysis and intelligent tracking realize the recognition of a certain characteristic direction to judge the activity range and the activity frequency of the characteristic, and simultaneously, the intelligent tracking realizes the recording of the motion trail of the characteristic target; the image graphic information is coded and stored, and the analyzed and tracked image graphic information is coded in a coding mode, so that the information quantity is compressed and stored, and convenience is brought to subsequent systems for extracting information.

The voice collecting device collects voice information of a speaker and then sends the voice information to the voice processing module to establish an acoustic model, and the acoustic model aims to provide an effective method for calculating the distance between a feature vector sequence of voice and each pronunciation template. The design of acoustic models is closely related to the characteristics of speech pronunciation. The size of the acoustic model unit (a word pronunciation model, a semisyllable model, or a phoneme model) has a large influence on the size of the voice training data volume, the system recognition rate, and the flexibility. The size of the recognition unit must be determined according to the characteristics of different languages and the size of the vocabulary of the recognition system. The acoustic model elements commonly used at present are initials, finals, syllables or words, and different elements are selected according to different implementation purposes. The Chinese and tone words have 412 syllables including light tone words and 1282 toned syllable words, so that words are often selected as elements when the isolated word pronunciation is recognized in a small vocabulary, syllables or initial consonants and vowels are often adopted for the voice recognition in a large vocabulary, and initial consonant and vowel modeling is often adopted due to the influence of cooperative pronunciation when the continuous voice is recognized. The common statistical-based speech recognition model is an HMM model lambda (N, M, pi, A, B), and the related theories related to the HMM model include structure selection of the model, initialization of the model, reestimation of model parameters, a corresponding recognition algorithm and the like.

The sound processing module converts the sound signal into an electric signal, obtains the sound loudness, frequency, content, duration and time interval among all bytes, filters and reduces noise of the electric signal through a filter, and amplifies the electric signal through an amplifier to obtain the electric signal which is pure and filtered and has clutter interference.

The central processing module is internally provided with a storage unit, a training unit and a comparison unit, wherein the storage unit is internally provided with electric signals of graphic image signals and sound, the training unit utilizes a large number of standard action graphic image signals and corresponding standard sound electric signals corresponding to the lecture articles, and the comparison unit is used for comparing the collected electric signals of the graphic image signals and the sound with the standard action graphic image signals and the corresponding standard sound electric signals.

The training unit can adopt a convolutional neural network model for training, corresponding actions and sounds corresponding to the lecture materials prepared in advance are utilized for learning and training, different types of lecture materials correspond to different gesture actions and lecture sound emotions, and after training is completed, standard guidance actions and sounds can be output according to different lecture materials. The training data comes from the body movements, facial expressions, speech sound changes, emotion changes and the like of the speech experts aiming at different speech materials in the industry. After a large amount of training, the corresponding standard gesture action and speech sound can be output aiming at the speech manuscript.

The comparison unit compares the collected electric signals of the actual graph image signals and the sound with the electric signals of the standard graph image signals and the sound, and the comparison content comprises the following steps: the action amplitude, the gesture, the action making time, the action making period, the action making frequency and the change of the facial expression are compared aiming at the micro actions of the five sense organs; when the sound is compared, the comparison is carried out according to the tone, frequency, amplitude, loudness, emotional intensity and the like of the sound. And marking different places, indicating difference points, giving corresponding improvement suggestions, and transmitting the improvement suggestions to an output module for output.

The output module transmits the specific improvement suggestion to a display screen in front of the speaker and transmits the related voice prompt information to a Bluetooth headset worn by the speaker through the Bluetooth module. The display screen prompts the speaker of the places needing attention and improvement through text information, and the Bluetooth headset transmits related suggestions and instructions through voice. The lecturer can adjust the lecture state in time, and the continuous improvement training is carried out, so that the lecture level is improved.

Example 2

The embodiment discloses an intelligent speech training method, which comprises the following steps:

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. The utility model provides an intelligence speech training system, a serial communication port, the system utilizes human biological state sensor to gather speaker's gesture action and sound information, gesture action through image processing module to gathering is handled, obtain pure image figure characteristic signal, turn into the signal of telecommunication through sound processing module with speaker's sound signal, and fall the noise, the filtering, central processing module compares the signal of telecommunication of image figure characteristic signal and speaker sound with the standard gesture action and the speech sound of prestoring, and give improvement suggestion, improve the suggestion and export through output module, send to the speaker and in time improve.

2. The intelligent speech training system of claim 1, wherein the human body biological state sensor comprises: the voice recognition system comprises a posture detecting instrument and a voice collecting device, wherein the posture detecting instrument is horizontally erected in a range of 2-3 m in front of a speaker through a support, the height of the erected support is 1.6 m, the voice collecting device is installed on clothes or a podium of the speaker, the posture detecting instrument collects facial expressions and limb actions of the speaker, and the voice collecting device collects voice signals of the speaker.

3. The intelligent speech training system of claim 1, wherein the image processing module is connected to the gesture detector for image and graphic signal extraction, image and graphic signal preprocessing, image and graphic signal feature extraction, direction analysis and intelligent tracking, and image and graphic information encoding and storage of the collected facial expression and limb movement images.

4. The intelligent speech training system of claim 3, wherein the image processing module performs noise reduction and feature enhancement on the recorded image graphics signal to obtain a relatively clean image graphics signal feature vector, which satisfies the condition of subsequent feature extraction; the two parts of image graphic signal extraction and preprocessing are mainly realized by front-end camera equipment; the characteristic extraction of the image graphic signal is to extract characteristic information which is contained in the sequence image and can be used for target tracking; the direction analysis and intelligent tracking realize the recognition of a certain characteristic direction to judge the activity range and the activity frequency of the characteristic, and simultaneously, the intelligent tracking realizes the recording of the motion trail of the characteristic target; the image graphic information is coded and stored, and the analyzed and tracked image graphic information is coded in a coding mode, so that the information quantity is compressed and stored, and convenience is brought to subsequent systems for extracting information.

5. The intelligent speech training system according to claim 1, wherein the sound collection device collects the voice information of the speaker and sends the voice information to the sound processing module, the sound processing module converts the voice signal into an electrical signal, obtains the loudness, frequency, content, duration of the voice and the time interval between each byte, filters the electrical signal through a filter to reduce noise, and amplifies the electrical signal through an amplifier to obtain the electrical signal with pure noise interference removed.

6. The intelligent speech training system according to claim 1, wherein the central processing module is provided with a memory unit, a training unit and a comparison unit, the memory unit stores graphic image signals and sound electrical signals, the training unit utilizes a plurality of standard motion graphic image signals and corresponding standard sound electrical signals corresponding to speech articles, and the comparison unit compares the collected graphic image signals and sound electrical signals with the standard motion graphic image signals and corresponding standard sound electrical signals.

7. The intelligent speech training system according to claim 6, wherein the training unit performs learning training by using corresponding actions and sounds corresponding to speech materials prepared in advance, different types of speech materials correspond to different gesture actions and speech sound emotions, and after training is completed, standard guidance actions and sounds can be output for different speech materials.

8. The intelligent speech training system of claim 6, wherein the comparison unit compares the collected electric signals of the actual graphic image signal and the sound with the electric signals of the standard graphic image signal and the sound, labels different places, points of difference, gives a corresponding improvement suggestion, and transmits the improvement suggestion to the output module for output.

9. The intelligent lecture training system of claim 8, wherein the output module transmits specific improvement suggestions to a display screen in front of the lecturer and transmits related voice prompt information to a bluetooth headset worn by the lecturer through the bluetooth module.

10. An intelligent speech training method is characterized by comprising the following steps: