CN115910036A

CN115910036A - AI (artificial intelligence) interaction-based 3D virtual image auditory speech training method

Info

Publication number: CN115910036A
Application number: CN202211106912.8A
Authority: CN
Inventors: 蔡希睿; 克里斯多夫.丁.肖; 安德鲁-彼得·莱恩; 刘焱; 陈浩强; 张�成; 林夕园
Original assignee: Yunnan Beifei Technology Co ltd
Current assignee: Yunnan Beifei Technology Co ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-04-04

Abstract

The invention discloses a 3D virtual image auditory speech training method based on AI interaction, which can be a brand-new English speech learning method combining auditory training with body action stimulation and oral (speaking) practice, combines a intonation auditory method with 3D virtual image and AI speech recognition and evaluation, can effectively improve the speech level of English learners, makes up the defects of speech teaching in a classroom and the defect of poor speech condition of teachers, solves the problem of timeliness in classroom teaching, and more importantly, is a positive innovation for promoting education fairness and realizing common balanced development of education.

Description

AI (artificial intelligence) interaction-based 3D virtual image auditory speech training method

Technical Field

The invention relates to the technical field of modern education, in particular to a 3D virtual image auditory speech training method based on AI interaction.

Background

The first step in English learning is to learn correct speech because speech allows a speaker to express his/her own ideas, and the correct pronunciation makes it easy for the speaker to understand. A large number of researches prove that in the initial English learning stage, the subsequent whole English learning is directly influenced by the voice learning effect. However, in practical english teaching, teachers often give priority to vocabulary and grammar teaching, and neglect phonetic teaching and practice, and especially when english teachers do not have good phonetic conditions and lack phonetic teaching experiences, the users will pronounce abnormally, which affects subsequent learning.

Disclosure of Invention

The invention aims to at least solve the problems in the existing education process, provides a 3D virtual auditory speech training method based on AI interaction, can combine auditory training with body action stimulation and oral (speaking) practice to form a brand-new English speech learning method, combines intonation auditory methods with 3D virtual and AI speech recognition and evaluation, can effectively improve the speech level of English learners, makes up the defects of insufficient speech teaching in a classroom and poor speech condition of teachers, solves the problem of timeliness in classroom teaching, and more importantly, the method capable of being operated on a mobile intelligent terminal is a positive innovation for promoting fair education and realizing common balanced development of education.

In order to achieve the purpose, the invention provides the following technical scheme:

A3D virtual image auditory speech training method based on AI interaction is characterized in that: the method comprises the following steps:

s1, inputting a voice signal, and performing low-pass filtering processing on an original voice to obtain a low-frequency voice type, namely, keeping a low-frequency voice frequency below 300Hz, wherein the low-frequency voice frequency keeps the prosodic features of a speech, including accent, rhythm, loudness, intonation and the like. Thus, the high frequency that can recognize words is removed, while the low frequency speech signal that preserves the prosody of the utterance can effectively reduce the processing load of the learner's semantic and syntactic processing, and release more attention resources for other cognitive processing processes;

and S2, dividing the processed recorded voice signal into eight voice sentence types, namely a positive statement sentence, a negative statement sentence, a general question sentence, a special question sentence, a selected question sentence, an adversary question sentence, a quizzy sentence and an exclamation sentence, obtaining a hierarchical sentence library, and continuously training the user for not less than 30S. The sentence audio used for training is subjected to low-pass filtering processing, so that the speech parameters such as rhythm, intonation, tone, tension, pause, duration, loudness and the like in speech are highlighted, and the body perception of a user on a language signal can be enhanced; in the training, a user develops a brain nerve path to the maximum extent through the integrated training of senses, hearing and vision, and the learning potential of the learner is expanded, particularly, the user coordinates the body movement according to the rhythm of a voice signal to improve the motility, the spatial orientation and the memory breadth of the learner, and the coordinated development of proprioception, hearing and speaking is achieved;

s3, creating virtual imaging, creating a virtual character in Unity 3D, matching the skeleton structure of the virtual character with a skeleton structure predefined by Mecanim, creating an Animator component for each action, constructing arm-waving, rotating and beating animation effects, triggering each action through a controller instruction, and setting corresponding parameters to control the action amplitude of each action effect;

s4, integrating the hierarchical sentence library and corresponding 3D virtual imaging, wherein the 3D virtual image action has two forms, one is that the arm floats up and down, and the other is that the two hands open and close, and the highest point and the lowest point in the audio frequency fundamental frequency value are obtained, and the numerical value of the arm waving highest point or the two hands opening maximum degree in the 3D virtual character is set as the highest point in the audio frequency fundamental frequency value; setting the value of the lowest point of arm waving or the minimum degree of closing of the two hands in the 3D virtual character as the lowest point of the fundamental frequency values in the audio frequency, and setting the amplitude of the 3D virtual character as 100% (current frequency-minimum frequency)/(maximum frequency-minimum frequency) = amplitude% (amplitude is between 0 and 100) at the point of the highest point and the lowest point of the fundamental frequency values in the audio frequency;

when the formal animation is played, a user can independently select the animation in two forms or the system randomly displays one of the two forms; acquiring the current time in real time, and displaying the animation of the 3D virtual image according to the time and the calculated action amplitude; when an undefined value is met, the animation is not changed until an effective frequency parameter is obtained;

s5, playing the voice signal as a unit, and separating the low-frequency-band voice with two channels of 0Hz to 300Hz and fundamental frequency (F) determining pitch through a frequency band separator ₀ ) The fundamental frequency determines pitch variation of tone fluctuation, and based on a fundamental frequency curve of each sentence, the method sets animation limit values and motion tracks of a 3D image to guide a user to do proper body rhythm along with the tone fluctuation of the sentence;

s6, dividing words in the hierarchical sentence library into high and low voice words according to tone levels to display word levels; the user listens to the low-pass filtering audio material and watches the 3D virtual image of the body movement, the user needs to simulate low-pass filtering voice in the step and does body movement according to the 3D virtual image, the 3D virtual image takes the rhythm of the sentence as rhythm, and the cartoon character matches the stress, rhythm and tone of the sentence to do proper body rhythm, such as the body rhythm as the arm floats up and down or the two hands open and close along with the pitch change and tone change of the sentence, and at the step, the user can simulate the low-pass filtering voice and do the body rhythm;

s7, AI evaluation, namely evaluating the pronunciation of the user through an AI evaluation module, wherein the evaluation module comprises a display module of tone, action and combination thereof; marking the intonation, the action and the combined content thereof to form an evaluation module; a marking module is used for marking the marking result; a result output module; and an evaluation result and suggestion module.

Preferably, the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring voice data and motion data of a plurality of users; the voice data is collected by a voice acquisition device and comprises a 3D virtual image watched from the visual angle of a user; the motion data is collected by a motion acquisition device and comprises vector data generated by the motion of a user;

the scene generation module is used for responding to the action form of the user and selecting and generating a training scene from a plurality of preset training scenes;

the role matching module is used for determining the viewed content of the user aiming at the scene content in the training scene according to the voice data of the user and matching the hand action corresponding to the user from the training scene according to the viewed content;

the action generating module is used for generating actions corresponding to the roles in the training scene according to the motion data of the user;

the panoramic integration module is used for integrating roles and actions of the users and outputting the roles and the actions in a panoramic mode;

the outputting in the panorama mode includes: determining the aspect ratio of the panoramic mode according to the dispersion degree of the user when performing panoramic integration; or when panoramic integration is carried out, videos containing users are extracted from a scene, and the videos containing the users are spliced;

the feedback module is used for judging whether the difference between the action and the reference action is within a preset range or not and generating feedback information when the difference is not within the preset range; the feedback information comprises a step of reminding a user of improper current operation in a vibration mode, specifically: and reminding improper exercise by using the patch of the action corresponding part which is not in the preset range.

Preferably, the data acquisition module includes:

the identification unit is used for generating a corresponding user identification number for each user;

a matching unit for matching the user identification number to a corresponding eye movement acquiring device and motion acquiring device;

an acquisition unit configured to generate eye movement data and motion data of each user by the eye movement acquisition device and the motion acquisition device.

Preferably, the data obtaining module is further configured to obtain audio data of a plurality of users; the panoramic integration module is further used for integrating the audio data with the role corresponding to the user; the interactive training system further comprises:

and the audio playing module is used for receiving the 3D virtual image or the hand action of the user and playing corresponding audio data.

Preferably, the data obtaining module is further configured to obtain audio data of a plurality of users;

and the audio adjusting module is used for integrating the roles, the actions and the audio data of the users, adjusting the sound size according to the distance of the roles in the panoramic mode and storing the data.

Preferably, the stored data is subjected to detection scoring through AI evaluation.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention discloses a brand-new English voice learning method, which combines a tone hearing method, a 3D virtual image and AI voice recognition and evaluation, is convenient to operate, can effectively improve the voice level of an English learner, makes up the defects of voice teaching in a classroom and the defect of poor voice condition of a teacher, and solves the problem of timeliness in classroom teaching.

Drawings

The invention is further illustrated with reference to the following figures and examples:

FIG. 1 is a low pass filtered audio spectrogram;

FIG. 2 is a graph of an unfiltered audio spectrogram;

FIG. 3 shows a pictorial diagram of a sentence map;

FIG. 4 is a diagram of a generic question map;

FIG. 5 is a diagram of a special question map;

FIG. 6 is a diagram of a imperative sentence map;

FIG. 7 is a diagram of a negative statement sentence map;

FIG. 8 is a schematic diagram of an adverse question map;

FIG. 9 is a schematic diagram of a selected question atlas;

fig. 10 is a diagram of an exclamatory sentence pattern.

Detailed Description

The invention provides a technical scheme that: AI-interaction-based 3D virtual image auditory speech training method is characterized in that: the method comprises the following steps:

s1, inputting a voice signal, and performing low-pass filtering processing on an original voice to obtain a low-frequency voice type, namely, keeping a low-frequency voice frequency below 300Hz, wherein the low-frequency voice frequency keeps the prosodic features of a speech, including accent, rhythm, loudness, intonation and the like. In this way, high frequencies that can recognize words are removed, while low frequency speech signals that preserve speech prosody can effectively reduce the processing load of learner semantic and syntactic processing and release more attention resources for other cognitive processing processes.

And S2, dividing the processed recorded voice signal into eight voice sentence types, namely a positive statement sentence, a negative statement sentence, a general question sentence, a special question sentence, a selected question sentence, an adversary question sentence, a quizzy sentence and an exclamation sentence, obtaining a hierarchical sentence library, and continuously training the user for not less than 30S. The sentence audio used for training is subjected to low-pass filtering processing, so that the speech parameters such as rhythm, intonation, tone, tension, pause, duration, loudness and the like in speech are highlighted, and the body perception of a user on a language signal can be enhanced; in the training, a user develops a cerebral nerve path to the maximum extent through the integrated training of senses, hearing and vision, and the learning potential of the learner is expanded, particularly, the user cooperates with body movement according to the rhythm of a voice signal to improve the motor power, space orientation and memory breadth of the learner, and the coordinated development of proprioception, hearing and speaking is achieved;

s3, creating virtual imaging, creating a virtual character in Unity 3D, matching the skeleton structure of the virtual character with a skeleton structure predefined by Mecanim, creating an Animator component for each action, constructing arm-waving, rotating and beating animation effects, triggering each action through a controller instruction, and setting corresponding parameters for each action effect to control the action amplitude;

s4, integrating the hierarchical sentence library and corresponding 3D virtual imaging, wherein the 3D virtual image action has two forms, one is that the arm floats up and down, and the other is that the two hands open and close, and the highest point and the lowest point in the audio frequency fundamental frequency value are obtained, and the numerical value of the arm waving highest point or the two hands opening maximum degree in the 3D virtual character is set as the highest point in the audio frequency fundamental frequency value; setting the value of the minimum degree of arm waving or two-hand closing in the 3D virtual character as the lowest point in the fundamental frequency value in the audio, and setting the amplitude of the 3D virtual character as 100% (the current frequency-the minimum frequency)/(the maximum frequency-the minimum frequency) = amplitude% (the amplitude is between 0 and 100) at the point in the highest point and the lowest point in the fundamental frequency value of the audio;

and when the audio is played, the current time is acquired in real time, and the 3D animation is displayed according to the time and the calculated action amplitude.

When an undefined value is encountered, the animation does not change until a valid frequency parameter is obtained.

As shown in the following table:

s5, playing the voice signal as a unit, and separating the low-frequency-band voice with two channels of 0Hz to 300Hz and fundamental frequency (F) determining pitch through a frequency band separator ₀ ) The fundamental frequency determines the pitch variation of pitch fluctuation, based on each sentenceThe method sets the animation limit value and the motion trail of the 3D image so as to guide the user to do proper body rhythm along with the fluctuation of the sentence;

s6, dividing words in the hierarchical sentence library into high and low voice words according to tone levels to display word levels; the user listens to the low-pass filtering audio material and watches the 3D virtual image of the body action, the user needs to simulate low-pass filtering voice in the step and does body movement according to the 3D virtual image, the 3D virtual image takes the rhythm of a sentence as rhythm, and the cartoon character matches the stress, rhythm and intonation of the sentence to do proper body rhythm, such as the body rhythm of up-and-down arm floating or two hands opening and closing along with the pitch change and the fluctuation change of the sentence, and the user can simulate the low-pass filtering voice and do the body rhythm at the same time in the step;

Preferably, the data acquisition module includes:

Preferably, the stored data is subjected to detection scoring by AI evaluation.

The training method is based on a intonation auditory method for training the user, and the intonation auditory method is a method taking auditory sense as a home position and develops spoken language through binaural listening. The intonation auditory method takes the neural plasticity of the brain as the root, and enables learners to develop brain neural pathways to the maximum extent through the sensory integration training of sense, hearing and vision, thereby expanding the learning potential of learners. The combination of multiple senses connects the auditory cortex of the brain of a learner with the relevant vestibular perception system and the motion region (especially speaking), namely connects the brain, the body and the pronunciation organ, combines auditory training with body action stimulation (vestibule) and oral (speaking) practice, recombines and strengthens the brain nerve pathway connection to improve the learning effect, as shown in fig. 1. The method regards voice perception and spoken language production as a multi-sensory and whole-body experience, and enables the vestibular system, body movement and vocalization to synchronously develop so as to maximally utilize the neural plasticity to recombine the neural pathways of the brain and improve the auditory sense and spoken language level. The intonation auditory method firstly achieves better effect in clinical practice, and improves the auditory sense and speaking skills of children and adults with hearing loss; meanwhile, the foreign language level of the foreign language learner can be obviously improved. Research on French, chinese, english and the like as learners of foreign languages shows that the intonation auditory method effectively improves French phonemes, chinese tones, comprehensive skills of spoken English, english listening ability, pronunciation correction and phonological work memory.

The key to the success of the intonation auditory method in speech therapy and foreign language learning is the emphasis on prosody and speaking patterns, since both are the basis of listening skills and speaking skills. With low frequency speech patterns, the vestibule and cochlea are stimulated by prosodic and intonation changes, both of which are particularly sensitive to the prosody of the speech. This is because the development of auditory organs is initiated by the perception of low frequency sounds (speech rhythms) in the uterus, and infants begin to develop proprioceptive memory, which provides the basis for the later development of auditory memory, i.e., from "sensation" to "hearing". In speech, the vestibular and auditory systems are sensitive to prosodic signals with frequencies below 300 Hz. Therefore, the voice signal is passed through a low-pass filter, and the low-frequency below 300Hz is preserved, namely, the prosodic features of the speech, including accent, rhythm, loudness, intonation and the like, are preserved. Thus, the high frequency capable of recognizing words is removed, and the low frequency speech signal with the speech rhythm preserved can effectively reduce the processing load of semantic and syntactic processing of learners, and release more attention resources for other cognitive processing processes, and the main functions of the vestibular system are body action sensing and gravity sensing. In the peripheral system, the vestibule is a part of the inner ear, connects to the cochlea, and the auditory sense develops from the vestibular sensation, and the two complement each other while sensing and hearing the voice. The importance of the vestibular system is that it is a complex and organized system of all sensations, giving rise to spatial perception. The input of all sensory information needs to be integrated via the body, which is why vestibular perception is trained and stimulated. Only when the speech is uniformly and coordinately matched with the current vestibular perception and body movement, the neuron develops new synapses to be connected with other neurons to the maximum extent due to the neural plasticity, and the voice information is received by proprioception and vestibular end organs to promote language development and learning.

In conclusion, the intonation auditory method can effectively promote language learning of English learners, sensory information is provided to the brain through the body (vestibular system) and the ears (auditory system) to serve as the basis of brain information processing and spoken language expression, and the motility, spatial orientation and memory breadth of learners can be actively improved through vestibular training and body actions with vocalization, so that the coordinated development of proprioception, hearing and speaking is achieved. In short, the intonation auditory method allows an english learner to listen to speech while effectively realizing spoken speech control with the help of physical actions. The listening material used by the method is English short sentences, covers eight sentence patterns, and is positive statement sentences, negative statement sentences, general interrogative sentences, special interrogative sentences, selective interrogative sentences, countering interrogative sentences, imperative sentences and exclamatory sentences respectively. According to Gimson's Pronunciation of English (Ji Msen English Pronunciation course), commonly statement sentence, special question sentence, emissary sentence and exclamation sentence end are in multi-use tone-lowering; rising tone at the end of a general question sentence; selecting the first choice of the question sentence to be used for rising tone, and selecting the last choice to be used for falling tone; the anti-doubt question selects either a down-tune (meaning that the speaker agrees with the listener) or an up-tune (meaning that the opposite party is not forced to agree with his own opinion, expressing a simple question) depending on the meaning expressed. The eight sentence patterns cover different english intonation patterns, and each sentence pattern has 10 short sentences, and total 80 short sentences. All the words in the sentence are selected from 2000 words required by the standard 2022 edition of English course of obligation education. The voice material is recorded by a person who pronounces English of two digits of a man and a woman as a mother language, pronounces naturally, and is recorded in 32-digit stereo sound at a sampling rate of 44.1kHz by using Adobe Audio CC (version 11.1.0) and is stored as a star. Each phrase is recorded for about 2000 ms, about 5 syllables and about 140 words per minute. Each sentence is read and recorded by two speakers separately, and the total number of recorded speech samples is 160.

The low-pass filtering is also done using Adobe audio CC (version 11.1.0). And separating the low-frequency-band voice with the double channels of 0Hz to 300Hz by a frequency band separator. Low-frequency speech below 300Hz contains a pitch-determining fundamental frequency (F) ₀ ) It is the fundamental frequency that determines pitch variations of pitch fluctuations. Based on the fundamental frequency curve of each sentence, the method sets the animation limit value and the motion trail of the 3D image so as to guide the user to do proper body rhythm along with the fluctuation of the tone of the sentence.

The AI voice evaluation technology is characterized in that voice is input into feature extraction, voice recognition is carried out according to a stored voice library and a stored text library through an acoustic model and a language model which are constructed through a machine learning algorithm, content analysis, pronunciation analysis and rhythm analysis are carried out after the voice recognition, machine evaluation is carried out according to an evaluation model which is trained through the machine learning algorithm through a database which is labeled manually, and therefore the cost of manual evaluation is effectively reduced.

The introduction of the AI technology solves the problem that various real teachers can not solve, thereby improving the teaching efficiency.

The method specifically comprises the following steps:

a) The AI is more accurate in processing low-frequency speech, teachers can have teaching differences by means of own experiences and according to the level of the teachers, and the AI is accurately reflected according to audio data, so that standardization is achieved, and human errors are reduced.

b) The AI teacher can solve the problem of timeliness and can learn anytime and anywhere.

c) AI mr can carry out one-to-one training simultaneously, to the oral education, the problem of many people teaching can't be solved to real man mr.

d) The AI teacher can give the voice evaluation result and feedback of the user in time. And (3) making a scoring standard according to the requirement of oral expression ability in the 'Chinese English ability grade scale' and the group standard of 'computer evaluation standard for English oral ability grade test' (T/CIIA 009-2021). The system points include word intonation, syllable emphasis, word emphasis, weak reading, continuous reading, omission and tone. The evaluation result comprises phoneme scores, vocabulary scores, sentence total scores, fluency scores, integrity scores and prosody scores. The full evaluation score is 100 points, the quality is more than 80 points, the quality is more than 60 points, and the quality is not qualified when the evaluation score is less than 60 points. The AI evaluation standard is unified, and the accuracy of phonemes, the accuracy of accents and the accuracy of prosody can be identified. These require a sufficiently high level of teacher to be evaluated. And the machine can achieve timely feedback, and the teacher may need to listen many times to find out all the problems.

The AI teacher can give the evaluation result of each student and the change curve of the front and back voice results, and the students can carry out targeted training on key sentences/phonemes according to the evaluation result of each time and the low-grade evaluation result.

It should be noted that, for the interactive training method of the present invention, it can be understood by those skilled in the art that all or part of the process of implementing the embodiment of the present invention can be implemented by controlling the relevant hardware through a computer program, where the computer program can be stored in a computer readable storage medium, such as a memory of a server, and executed by at least one processor in the server, and during the execution process, the process of implementing the embodiment of the information sharing method can be included. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

Fig. 1 and 2 show the sentence "Can I have soup? "300Hz low pass filtered audio spectrogram and non-low pass filtered audio spectrogram, since the low pass filtered sound is limited in bandwidth, in order to ensure the same sound intensity, all the low pass filtered speech signal amplitudes are normalized to 100%, and all the non-low pass filtered speech signal amplitudes are normalized to 70%. Each sentence has two audio materials, two-channel low-pass filtered and two audio materials without low-pass filtering, for a total of 320 audio materials.

And generating a data table according to two columns of parameters, wherein the x axis is time, the y axis is frequency, undefined data is that no frequency is detected at the time point, and other data construct a dot diagram according to the time and the frequency, so that the dot diagram can be seen to be a frequency diagram with fluctuation characteristics.

As shown in fig. 3 to 10

According to the characteristics of the statement sentences, rising tones exist in the sentences, and falling tones exist at the tail of the sentences. It refers to weather specifically, and the time interval corresponding to the cold as the keyword of the sentence which is re-read in the sentence and raised intonation is a waveform diagram of x coordinates 56-110 in the diagram; and the time interval corresponding to the key-down is shown as a waveform diagram of x-coordinates 118-152 in the diagram at the end of the period.

According to the characteristics of general question sentences, the question words have rising tone, and the sentence tails have rising tone. In the above general question sentence, the time interval corresponding to the Can I have me souppocan I question word is a waveform diagram of x coordinates 0-32 in the diagram, the key word of the soup sentence is re-read and raised in tone, that is, the sentence end is presented with a waveform diagram of x coordinates 115-132 in the diagram, corresponding to the raised tone.

According to the characteristics of the special question sentence, the question words and the key words are in rising tone, and the tail of the sentence is in falling tone. In the above example of the special question, what time is it? The time interval corresponding to the special query word what is a waveform diagram of x coordinates 0-16 in the diagram, the time interval corresponding to the key word time rereading and rising tone is a waveform diagram of x coordinates 38-70 in the diagram, and the time interval corresponding to the falling tone is a waveform diagram of x coordinates 71-118 in the diagram.

According to the characteristics of the imperative sentence, the imperative verb exists to be read repeatedly, and the whole sentence appears to be in a tone-reducing mode. The imperative sentence has from the Lunch, and the verb has are waveform diagrams in which the time interval corresponding to the re-reading of the keyword is x coordinates 0-18 in the diagram, and the time interval corresponding to the falling tone of the whole sentence is x coordinates 33-116 in the diagram.

According to the characteristics of rising and falling tone of each sentence pattern and the fundamental frequency value extracted according to the audio frequency, in order to better help students to train tone through limb action coordination, the system is provided with 3D image animation to present rising and falling tone changes of each voice sentence, the 3D image is used for displaying action animation forms, guidance is provided for the students, the students are helped to imitate, and therefore the training method is better completed.

According to the characteristics of the negative statement sentence, the negative words and the keywords in the sentence are reread and are increased in tone. The negative statement sentence is exemplified as You didn't come to school. The negative word didn't is used as the meaning of the sentence to be emphasized, and is re-read and raised in tone, and the corresponding time interval is the waveform diagram of x coordinates 175-419 in the diagram. The keyword school is read again and read with raised tone, and the corresponding time interval is the waveform diagram of x coordinate 1079-1387 in the diagram.

According to the characteristics of the contrary doubtful question sentences, the subject language and the key words are severe and have rising intonation. With the above example of the counterintuitive question Jane has been new ben to America, has she? The main language Jane is rereaded and raised in tone, the corresponding time interval is a waveform diagram of x coordinates 0-0.287 in the diagram, the keyword America is rereaded and raised in tone, the corresponding time interval is a waveform diagram of x coordinates 1.06-2 in the diagram, and the contrary question part at the end of the sentence inquires about the opinion of the other party by the raised tone to confirm the judgment of the user.

According to the characteristics of the selected questioning sentences, the questioning words and the keywords have rising tone. Is there the above example of selecting an question Can you sing or dance? The query word Can you is raised in tone, the corresponding time interval is a waveform diagram of x coordinates 5-509 in the diagram and the singing tone is raised, the corresponding time interval is a waveform diagram of x coordinates 707-960 in the diagram, the query word or is selected to be lowered and read weakly, the corresponding time interval is a waveform diagram of x coordinates 135-165 in the diagram, the other part of the dance of the query is selected to be re-read and raised in tone again, and the corresponding time interval is a waveform diagram of x coordinates 183-236 in the diagram.

According to the characteristics of the exclamation sentence, the exclamation words and the keywords are read repeatedly and have rising tone. Using the exclamation sentence example mentioned above, what an intervening file! The exclamation word what is rereaded and vocalized, the corresponding time interval is a waveform diagram of x coordinates 0-454 in the diagram, the sentence keyword intersecting file is rereaded and vocalized, and the corresponding time interval is a waveform diagram of x coordinates 744-1773 in the diagram.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. AI-interaction-based 3D virtual image auditory speech training method is characterized in that: the method comprises the following steps:

s1, inputting a voice signal, and performing low-pass filtering processing on an original voice to obtain a low-frequency voice type, namely, keeping a low audio frequency below 300Hz, wherein the low-frequency audio frequency keeps the prosodic features of a speech, including accent, rhythm, loudness and intonation; thus, the high frequency capable of recognizing words is removed, and the low frequency voice signal with the reserved speaking rhythm can effectively reduce the processing load of semantic and syntactic processing of learners and release more attention resources for other cognitive processing processes;

s2, dividing the processed input voice signal into eight voice sentence types, namely a positive statement sentence, a negative statement sentence, a general question sentence, a special question sentence, a selected question sentence, an adversarial question sentence, an imperative question sentence and an exclamation sentence, obtaining a hierarchical sentence library, and continuously training a user for not less than 30S; the sentence audio used for training is subjected to low-pass filtering processing, so that the speech parameters such as rhythm, intonation, tone, tension, pause, duration, loudness and the like in speech are highlighted, and the body perception of a user on a language signal can be enhanced; in the training, a user develops a cerebral nerve path to the maximum extent through the integrated training of senses, hearing and vision, and the learning potential of the learner is expanded, particularly, the user cooperates with body movement according to the rhythm of a voice signal to improve the motor power, space orientation and memory breadth of the learner, and the coordinated development of proprioception, hearing and speaking is achieved;

when the formal animation is played, a user can independently select the animation in two forms or the system randomly displays one of the two forms; acquiring the current time in real time, and performing animation display on the 3D virtual imaging according to the time and the calculated action amplitude; when an undefined value is met, the animation is not changed until an effective frequency parameter is obtained;

s5, playing the voice signal as a unit, and separating the low-frequency-band voice of the dual-channel 0 Hz-300 Hz containing the fundamental frequency (F) determining the pitch through a frequency band separator ₀ ) The fundamental frequency determines pitch variation of pitch fluctuation, and the method sets up a fundamental frequency curve of each sentenceThe animation limit value and the motion trail of the 3D image are used for guiding a user to do proper body rhythm along with the fluctuation of the tone of the sentence;

s6, dividing words in the grading sentence library into high and low voice words according to tone and displaying word grades; the user listens to the low-pass filtering audio material and watches the 3D virtual image of the body action, the user needs to simulate low-pass filtering voice in the step and does body movement according to the 3D virtual image, the 3D virtual image takes the rhythm of a sentence as rhythm, and the cartoon character matches the stress, rhythm and intonation of the sentence to do proper body rhythm, such as the body rhythm of up-and-down arm floating or two hands opening and closing along with the pitch change and the fluctuation change of the sentence, and the user can simulate the low-pass filtering voice and do the body rhythm at the same time in the step;

2. The AI interaction-based 3D avatar auditory speech training method of claim 1, wherein: the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring voice data and motion data of a plurality of users; the voice data is collected by a voice acquisition device and comprises a 3D virtual image observed from the visual angle of a user; the motion data is collected by a motion acquisition device and comprises vector data generated by the motion of a user;

3. The AI interaction-based 3D avatar auditory speech training method of claim 2, wherein the data acquisition module comprises:

4. The AI interaction-based 3D avatar auditory speech training method of claim 3, wherein the data acquisition module is further configured to acquire audio data of a plurality of users; the panoramic integration module is further used for integrating the audio data with the role corresponding to the user; the interactive training system further comprises:

5. The interactive training system as claimed in claim 3 or 4,

the data acquisition module is also used for acquiring audio data of a plurality of users;

6. The interactive training system as claimed in claim 5, characterised in that the stored data is detection scored by means of AI measurements.