CN113837907A - Man-machine interaction system and method for English teaching - Google Patents

Man-machine interaction system and method for English teaching Download PDF

Info

Publication number
CN113837907A
CN113837907A CN202111128380.3A CN202111128380A CN113837907A CN 113837907 A CN113837907 A CN 113837907A CN 202111128380 A CN202111128380 A CN 202111128380A CN 113837907 A CN113837907 A CN 113837907A
Authority
CN
China
Prior art keywords
english
user
data
semantic
video data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111128380.3A
Other languages
Chinese (zh)
Inventor
盛婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuchang University of Technology
Original Assignee
Wuchang University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuchang University of Technology filed Critical Wuchang University of Technology
Priority to CN202111128380.3A priority Critical patent/CN113837907A/en
Publication of CN113837907A publication Critical patent/CN113837907A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Educational Administration (AREA)
  • Human Computer Interaction (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • Computer Security & Cryptography (AREA)
  • Educational Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention belongs to the technical field of data identification, and discloses an English teaching human-computer interaction system and a method thereof, wherein the system comprises a human-computer interaction unit, a semantic analysis unit and a cloud data center, and the method comprises the following steps: the user selects a common interaction mode, collects the English spoken language video data of the user and sends the English spoken language video data to the semantic analysis unit for analysis to obtain corresponding English reaction video data; the method comprises the steps that a user selects an intelligent interaction mode, collects English spoken language video data of the user, sends the English spoken language video data to a cloud data center for intelligent correction and emotion analysis to obtain a corresponding user portrait, and obtains corresponding English reaction video data according to the user portrait and the English spoken language video data; and the human-computer interaction unit broadcasts corresponding English reaction video data. The invention solves the problems of high manual teaching cost investment, lack of English spoken language learning environment, low intelligent degree of English teaching equipment, simple function and low real-time and interactivity of spoken language training in the prior art.

Description

Man-machine interaction system and method for English teaching
Technical Field
The invention belongs to the technical field of data identification, and particularly relates to a human-computer interaction system and a human-computer interaction method for English teaching.
Background
English is a world-wide language, which is one of the most popular official languages worldwide and is the most advantageous tool for communication with other countries, so that the demand of people for english learning is increasing with social development and economic progress; present english teaching often adopts the most traditional manual work of teacher to give lessons, teach through books, this kind of teaching mode is big to english teacher's demand, self quality requirement to teacher is high, the teaching is subject to the factor in place, can't obtain more extensive popularization and expansion, and because the lagging of its teaching mode, lack the oral english learning environment, student's oral english level can't obtain effectual training, current english teaching equipment often adopts simple audio playback, can't satisfy the requirement of student to the real-time and the interactivity of oral english training.
Disclosure of Invention
The invention aims to provide a man-machine interaction system for English teaching, which aims to solve the problems that manual teaching cost investment is high, an English spoken language learning environment is lacked, the intelligent degree of English teaching equipment is low, the function is simple, and the real-time performance and the interactivity of spoken language training are low in the prior art.
The technical scheme adopted by the invention is as follows:
a man-machine interaction system for English teaching comprises a man-machine interaction unit, a semantic analysis unit and a cloud data center, wherein the man-machine interaction unit is respectively in communication connection with the semantic analysis unit and the cloud data center;
the human-computer interaction unit is used for acquiring user login information and English spoken language video data of a user, sending the user login information to the cloud data center for user verification, sending the acquired English spoken language audio data to the semantic analysis unit and the cloud data center for semantic analysis, and broadcasting corresponding English reaction video data;
the semantic analysis unit is used for extracting oral English audio data included by the oral English video data, identifying the oral English audio data to obtain corresponding oral English character data, analyzing the oral English character data to obtain semantic information, if the semantic information cannot be obtained, sending the corresponding oral English video data to the cloud data center for semantic analysis, obtaining corresponding English reaction video data according to the semantic information, and sending the English reaction video data to the human-computer interaction unit;
the cloud data center is used for analyzing the oral English video data to obtain semantic information and obtain corresponding user portrait, storing the obtained user portrait, the oral English video data, the semantic information and the user login information to a corresponding user database, obtaining corresponding English reaction video data according to the semantic information, and sending the English reaction video data to the human-computer interaction unit.
Furthermore, the human-computer interaction unit comprises a control module, a communication module, a microphone, an audio playing module, a touch display screen and a camera, wherein the communication module, the microphone, the audio playing module, the touch display screen and the camera are all in communication connection with the control module;
the semantic analysis unit comprises a microprocessor, a first audio extraction module, a first audio character conversion module, a first semantic identification module and a first reaction video acquisition module which are all in communication connection with the microprocessor, the microprocessor is in communication connection with the control module, and the microprocessor is provided with a first audio database, a first semantic database and a first reaction video database;
the cloud data center is provided with a data server, the data server comprises a central processing unit, an image recognition module, a lip language information acquisition module, a semantic completion module, a semantic emotion analysis module, an audio data correction module, a user image acquisition module, a second audio extraction module, a second audio character conversion module, a second semantic recognition module and a second reaction video acquisition module, and the central processing unit is provided with a second audio database, a second semantic database, a second reaction video database, a user database and a lip language database.
A man-machine interaction method for English teaching is based on an English teaching man-machine interaction system and comprises the following steps:
after the user verification is passed, performing English teaching man-machine interaction according to the interaction mode selected by the user;
the user selects a common interaction mode, the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the semantic analysis unit for analysis to obtain corresponding English reaction video data, and if the English spoken language video data cannot be analyzed by the semantic analysis unit, the current English spoken language video data is sent to the cloud data center for analysis;
the user selects an intelligent interaction mode, the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the cloud data center for intelligent correction and emotion analysis to obtain a corresponding user portrait, and corresponding English reaction video data are obtained according to the user portrait and the English spoken language video data;
and the human-computer interaction unit broadcasts corresponding English reaction video data.
Further, the user authentication is performed, which comprises the following steps:
the human-computer interaction unit collects user login information and sends the user login information to the cloud data center, if the user login information is abnormal data, user face image data of a current user are collected, the image recognition module carries out face identity verification according to the face image data of the current user, and otherwise the cloud data center judges whether the current user is a known user according to the user login information;
if the current user face image data cannot be matched with the user database of the known user, returning verification failure information to the human-computer interaction unit, otherwise, connecting to the user database corresponding to the user image data and sending verification passing information to the human-computer interaction unit;
if the current user login information is a known user, connecting the corresponding user database and sending verification passing information to the man-machine interaction unit, otherwise, establishing a new user database according to the current user login information and sending the verification passing information to the man-machine interaction unit.
Further, acquiring English reaction video data in a common interaction mode, comprising the following steps:
the user selects a common interaction mode, and the human-computer interaction unit acquires English spoken language video data of the user and sends the English spoken language video data to the semantic analysis unit;
the first audio extraction module extracts oral English audio data in the oral English video data;
the first audio character conversion module converts the spoken English audio data into spoken English character data;
the first semantic recognition module calls a first semantic database according to the English grammar structure to perform semantic analysis on the spoken English character data to obtain corresponding semantic information;
if the first semantic recognition module cannot acquire corresponding semantic information obtained by performing semantic analysis on the spoken English character data, sending the corresponding spoken English video data to a cloud data center for semantic analysis;
and the first reaction video acquisition module calls the first reaction video database according to the semantic information to obtain corresponding English reaction video data.
Further, the method for acquiring English reaction video data in the intelligent interaction mode comprises the following steps:
the user selects an intelligent interaction mode, and the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the cloud data center;
the lip information acquisition module acquires lip information of the English spoken video data, and the second semantic recognition module calls a lip database according to the lip information to perform lip semantic analysis to obtain lip semantic information;
the audio data correction module calls an audio database to correct the oral English audio data to obtain corrected oral English audio data;
the second audio character conversion module converts the corrected spoken English audio data into spoken English character data, and the second semantic recognition module calls a second semantic database according to an English grammar structure to perform semantic analysis on the spoken English character data to obtain corresponding character semantic information;
when the matching degree of the text semantic information and the lip language semantic information is the same, preferentially selecting the text semantic information, and when the matching degree of the text semantic information and the lip language semantic information is different, preferentially selecting the lip language semantic information;
the semantic completion module performs semantic completion on the character semantic information/lip language semantic information according to the English grammar structure to obtain completed semantic information;
the emotion analysis module carries out emotion analysis on the complemented semantic information to obtain user emotion information, and the user image acquisition module acquires a corresponding user portrait according to the spoken English video data and the user emotion information and updates the user portrait of a corresponding user database;
and the second reaction video acquisition module calls a second reaction video database according to the user portrait and the complemented semantic information to obtain corresponding English reaction video data.
Further, the lip language information acquisition method comprises the following steps:
the lip information acquisition module acquires lip images of continuous frames by performing frame interception on lip parts of the English spoken language video data;
performing image recognition according to a plurality of positioning points preset at the lip position of the current lip image to obtain lip angle data;
and traversing the lip images of all the frames, obtaining continuous lip angle change data according to all the lip angle data, and obtaining lip language information according to the continuous lip angle change data.
Further, acquiring corrected spoken English audio data, comprising the steps of:
the central processing unit establishes an audio matching model based on the neural network;
the audio data correction module marks continuous pronunciation nodes of the oral English audio data and cuts the oral English audio data into a plurality of correction audio segments according to the marks;
the audio data correction module calls pre-stored audio segments in an audio database according to the frequency to match a plurality of corrected audio segments to obtain matched pre-stored audio segments and corresponding audio segment semantics, and extracts abnormal audio segments with abnormal audio segment semantics according to the semantic relationship between an English grammar structure and adjacent audio segment semantics;
inputting an abnormal audio segment with abnormal audio segment semantics into an audio matching model, calling an audio database to search and match, obtaining a pre-stored audio segment with the highest frequency matching degree, and replacing the abnormal audio segment;
and if all the abnormal audio sections are replaced, obtaining corrected oral English audio data, and otherwise, repeating the correction steps.
Further, emotion analysis is performed to obtain a corresponding user portrait, and the method comprises the following steps:
the central processing unit establishes an emotion analysis model and a user image acquisition model based on the neural network;
the emotion analysis module extracts multi-mode combination characteristics of the complemented semantic information;
the emotion analysis model fully fuses the multi-mode combination features by using a hierarchical self-attention method to obtain an emotion fusion vector;
judging according to the emotion fusion vector to obtain user emotion information;
the user image acquisition model carries out image recognition on the English spoken language video data to obtain the facial features of the user, and the age and the gender of the user are obtained according to the facial features;
the user image acquisition model identifies the frequency and the tone of the English spoken language audio data to obtain the sound characteristics of the user, and the age, the gender and the tone of the user are obtained according to the sound characteristics;
the user image acquisition model performs keyword analysis on English typing character data to obtain keyword characteristics of the user, and the age and the gender of the user are obtained according to the keyword characteristics;
and the user image acquisition module performs comprehensive analysis according to the age, sex, tone and user emotion information of the user to obtain a user portrait, and updates the user portrait of the corresponding user database.
Further, the method for acquiring English reaction video data comprises the following steps:
the central processing unit establishes a response video matching model based on the neural network;
the reaction video matching model calls a second reaction video database to screen according to the complemented semantic information of the current oral English video data to obtain a plurality of alternative English reaction video data;
and screening a plurality of candidate English reaction video data by the reaction video matching model according to the current user image to obtain final English reaction video data.
The invention has the beneficial effects that:
1) the English teaching man-machine interaction system provided by the invention frees place limitation of English teaching, adopts artificial intelligence and system setting, avoids traditional artificial teaching, reduces the demand and labor cost input for English teachers, and provides an English oral learning environment for users.
2) The man-machine interaction method for English teaching provided by the invention collects the English spoken language video data of the user to perform semantic analysis and match with the corresponding English reaction video data, realizes the interaction between the user and intelligent AI, improves the intelligent degree, truly simulates the English dialogue environment, and meets the requirements on the real-time performance and the interactivity of the English spoken language training; the method identifies and analyzes the video data of the user based on the neural network, improves the practicability and the data identification efficiency, adopts emotion analysis and user portrait to match English reaction video data, and greatly improves the randomness and the authenticity of dialogue training.
Other advantageous effects of the present invention will be further described in the detailed description.
Drawings
FIG. 1 is a block diagram of the human-computer interaction system for English teaching in the present invention.
FIG. 2 is a flowchart of a method for man-machine interaction in English teaching according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
as shown in fig. 1, the embodiment provides a human-computer interaction system for english teaching, which includes a human-computer interaction unit, a semantic analysis unit, and a cloud data center, where the human-computer interaction unit is in communication connection with the semantic analysis unit and the cloud data center, respectively;
the human-computer interaction unit is used for acquiring user login information and English spoken language video data of a user, sending the user login information to the cloud data center for user verification, sending the acquired English spoken language audio data to the semantic analysis unit and the cloud data center for semantic analysis, and broadcasting corresponding English reaction video data;
the semantic analysis unit is used for extracting oral English audio data included by the oral English video data, identifying the oral English audio data to obtain corresponding oral English character data, analyzing the oral English character data to obtain semantic information, if the semantic information cannot be obtained, sending the corresponding oral English video data to the cloud data center for semantic analysis, obtaining corresponding English reaction video data according to the semantic information, and sending the English reaction video data to the human-computer interaction unit;
the cloud data center is used for analyzing the oral English video data to obtain semantic information and obtain corresponding user portrait, storing the obtained user portrait, the oral English video data, the semantic information and the user login information to a corresponding user database, obtaining corresponding English reaction video data according to the semantic information, and sending the English reaction video data to the human-computer interaction unit.
The English teaching man-machine interaction system provided by the invention frees place limitation of English teaching, adopts artificial intelligence and system setting, avoids traditional artificial teaching, reduces the demand and labor cost input for English teachers, and provides an English oral learning environment for users.
Preferably, the human-computer interaction unit comprises a control module, a communication module, a microphone, an audio playing module, a touch display screen and a camera, wherein the communication module, the microphone, the audio playing module, the touch display screen and the camera are all in communication connection with the control module; the control module controls the communication module, the microphone, the audio playing module, the touch display screen and the camera to work, the microphone collects audio data of a user, the camera collects video data of the user, the touch display screen collects user login data of the user and displays reaction video data, the audio playing module broadcasts the reaction audio data, man-machine interaction is achieved, and the communication module achieves data transmission between the man-machine interaction unit and the cloud data center;
the semantic analysis unit comprises a microprocessor, a first audio extraction module, a first audio character conversion module, a first semantic identification module and a first reaction video acquisition module which are all in communication connection with the microprocessor, the microprocessor is in communication connection with the control module, and the microprocessor is provided with a first audio database, a first semantic database and a first reaction video database; the microprocessor controls the work of the first audio extraction module, the first audio character conversion module, the first semantic recognition module and the first reaction video acquisition module, and the first audio database, the first semantic database and the first reaction video database store corresponding data;
the cloud data center is provided with a data server, the data server comprises a central processing unit, an image recognition module, a lip language information acquisition module, a semantic completion module, a semantic emotion analysis module, an audio data correction module, a user image acquisition module, a second audio extraction module, a second audio character conversion module, a second semantic recognition module and a second reaction video acquisition module, and the central processing unit is provided with a second audio database, a second semantic database, a second reaction video database, a user database and a lip language database; the central processing unit controls the work of the image recognition module, the lip language information acquisition module, the semantic completion module, the semantic emotion analysis module, the audio data correction module, the user image acquisition module, the second audio extraction module, the second audio character conversion module, the second semantic recognition module and the second response video acquisition module, and the second audio database, the second semantic database, the second response video database, the user database and the lip language database store corresponding data.
Example 2:
the present embodiment is an improvement of the technical solution based on embodiment 1, and the difference from embodiment 1 is that:
a man-machine interaction method for English teaching is shown in figure 2, and based on an English teaching man-machine interaction system, the method comprises the following steps:
performing user authentication, comprising the steps of:
the human-computer interaction unit collects user login information and sends the user login information to the cloud data center, if the user login information is abnormal data, user face image data of a current user are collected, the image recognition module carries out face identity verification according to the face image data of the current user, and otherwise the cloud data center judges whether the current user is a known user according to the user login information;
if the current user face image data cannot be matched with the user database of the known user, returning verification failure information to the human-computer interaction unit, otherwise, connecting to the user database corresponding to the user image data and sending verification passing information to the human-computer interaction unit;
if the current user login information is a known user, connecting a corresponding user database and sending verification passing information to the man-machine interaction unit, otherwise, establishing a new user database according to the current user login information and sending the verification passing information to the man-machine interaction unit;
after the user verification is passed, performing English teaching man-machine interaction according to the interaction mode selected by the user;
the user selects a common interaction mode, the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the semantic analysis unit for analysis to obtain corresponding English reaction video data, and the method comprises the following steps of:
the user selects a common interaction mode, and the human-computer interaction unit acquires English spoken language video data of the user and sends the English spoken language video data to the semantic analysis unit;
the first audio extraction module extracts oral English audio data in the oral English video data;
the first audio character conversion module converts the spoken English audio data into spoken English character data;
the first semantic recognition module calls a first semantic database according to the English grammar structure to perform semantic analysis on the spoken English character data to obtain corresponding semantic information;
if the first semantic recognition module cannot acquire corresponding semantic information obtained by performing semantic analysis on the spoken English character data, sending the corresponding spoken English video data to a cloud data center for semantic analysis;
the first reaction video acquisition module calls a first reaction video database according to the semantic information to obtain corresponding English reaction video data;
if the semantic analysis unit cannot analyze the oral English video data, sending the current oral English video data to a cloud data center for analysis;
the user selects an intelligent interaction mode, the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the cloud data center to carry out intelligent correction and emotion analysis to obtain corresponding user portrait, and corresponding English reaction video data are obtained according to the user portrait and the English spoken language video data, and the method comprises the following steps:
the user selects an intelligent interaction mode, and the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the cloud data center;
the method for acquiring the lip information of the spoken English video data by the lip information acquisition module comprises the following steps:
the lip information acquisition module acquires lip images of continuous frames by performing frame interception on lip parts of the English spoken language video data;
performing image recognition according to a plurality of positioning points preset at the lip position of the current lip image to obtain lip angle data;
in the embodiment, the mouth corner of the lip, the middle position where the lip is connected with the person, the position where the lip is connected with the lower jaw, the middle position where the lip is connected with the person and the mouth corner, and the middle position where the lip is connected with the lower jaw and the mouth corner are positioned as positioning points, and the more the positioning points are arranged, the more accurate the acquired lip angle data is;
traversing the lip images of all the frames, obtaining continuous lip angle change data according to all the lip angle data, and obtaining lip language information according to the continuous lip angle change data;
the second semantic recognition module calls a lip language database according to the lip language information to perform lip language semantic analysis to obtain lip language semantic information;
the audio data correction module calls an audio database to correct the oral English audio data to obtain corrected oral English audio data, and the method comprises the following steps of:
the central processing unit establishes an audio matching model based on the neural network;
the audio data correction module marks continuous pronunciation nodes of the oral English audio data and cuts the oral English audio data into a plurality of correction audio segments according to the marks;
the audio data correction module calls pre-stored audio segments in an audio database according to the frequency to match a plurality of corrected audio segments to obtain matched pre-stored audio segments and corresponding audio segment semantics, and extracts abnormal audio segments with abnormal audio segment semantics according to the semantic relationship between an English grammar structure and adjacent audio segment semantics;
inputting an abnormal audio segment with abnormal audio segment semantics into an audio matching model, calling an audio database to search and match, obtaining a pre-stored audio segment with the highest frequency matching degree, and replacing the abnormal audio segment;
if all abnormal audio segments are replaced, corrected oral English audio data are obtained, otherwise, the correction steps are repeated;
the second audio character conversion module converts the corrected spoken English audio data into spoken English character data, and the second semantic recognition module calls a second semantic database according to an English grammar structure to perform semantic analysis on the spoken English character data to obtain corresponding character semantic information;
when the matching degree of the text semantic information and the lip language semantic information is the same, preferentially selecting the text semantic information, and when the matching degree of the text semantic information and the lip language semantic information is different, preferentially selecting the lip language semantic information;
the semantic completion module performs semantic completion on the character semantic information/lip language semantic information according to the English grammar structure to obtain completed semantic information;
the semantic completion module is suitable for a word segmentation tool to perform word segmentation processing on the text semantic information/lip language semantic information, word vectors of each segmented word are obtained based on an English grammar structure, the word vectors lacking head and tail coordinates are used as abnormal word vectors, and common keywords in a user database are extracted for completion;
the emotion analysis module carries out emotion analysis on the complemented semantic information to obtain user emotion information, the user image acquisition module acquires a corresponding user portrait according to the spoken English video data and the user emotion information and updates the user portrait of a corresponding user database, and the user image comprises an ID (identity), a facial feature, a sound feature, a keyword feature, an emotion state, age, gender and tone of a user;
the emotion analysis is carried out to obtain the corresponding user portrait, and the method comprises the following steps:
the central processing unit establishes an emotion analysis model and a user image acquisition model based on the neural network;
the emotion analysis module extracts multi-mode combination characteristics of the complemented semantic information;
the emotion analysis model fully fuses the multi-mode combination features by using a hierarchical self-attention method to obtain an emotion fusion vector;
judging according to the emotion fusion vector to obtain user emotion information, wherein the user emotion information comprises calmness, fear, happiness, difficulty, anger and excitement;
the user image acquisition model carries out image recognition on the English spoken language video data to obtain the facial features of the user, and the age and the gender of the user are obtained according to the facial features;
the user image acquisition model identifies the frequency and the tone of the English spoken language audio data to obtain the sound characteristics of the user, and the age, the gender and the tone of the user are obtained according to the sound characteristics;
the user image acquisition model performs keyword analysis on English typing character data by using a TextRank algorithm to obtain keyword characteristics of a user, and obtains the age and the gender of the user according to the keyword characteristics;
the user image acquisition module carries out comprehensive analysis according to the age, the gender and the tone of the user and the emotional information of the user to obtain a user portrait and updates the user portrait of a corresponding user database;
the second reaction video acquisition module calls a second reaction video database according to the user portrait and the complemented semantic information to obtain corresponding English reaction video data, and the method comprises the following steps:
the central processing unit establishes a response video matching model based on the neural network;
the reaction video matching model calls a second reaction video database to screen according to the complemented semantic information of the current oral English video data to obtain a plurality of alternative English reaction video data; the supplemented semantic information can more accurately acquire the spoken semantics expressed by the spoken English video data;
the reaction video matching model screens a plurality of candidate English reaction video data according to the current user image to obtain final English reaction video data; corresponding English reaction video data are selected according to emotion information of the current user, specific age, gender and idiom, personal customized man-machine interaction scenes are achieved, and the intellectualization of the method is improved;
and the human-computer interaction unit broadcasts corresponding English reaction video data.
The man-machine interaction method for English teaching provided by the invention collects the English spoken language video data of the user to perform semantic analysis and match with the corresponding English reaction video data, realizes the interaction between the user and intelligent AI, improves the intelligent degree, truly simulates the English dialogue environment, and meets the requirements on the real-time performance and the interactivity of the English spoken language training; the method identifies and analyzes the video data of the user based on the neural network, improves the practicability and the data identification efficiency, adopts emotion analysis and user portrait to match English reaction video data, and greatly improves the randomness and the authenticity of dialogue training.
The present invention is not limited to the above-described alternative embodiments, and various other forms of products can be obtained by anyone in light of the present invention. The above detailed description should not be taken as limiting the scope of the invention, which is defined in the claims, and which the description is intended to be interpreted accordingly.

Claims (10)

1. The utility model provides an english teaching man-machine interactive system which characterized in that: the cloud data center is communicated with the human-computer interaction unit;
the human-computer interaction unit is used for acquiring user login information and English spoken language video data of a user, sending the user login information to the cloud data center for user verification, sending the acquired English spoken language audio data to the semantic analysis unit and the cloud data center for semantic analysis, and broadcasting corresponding English reaction video data;
the semantic analysis unit is used for extracting oral English audio data included by the oral English video data, identifying the oral English audio data to obtain corresponding oral English character data, analyzing the oral English character data to obtain semantic information, if the semantic information cannot be obtained, sending the corresponding oral English video data to the cloud data center for semantic analysis, obtaining corresponding English reaction video data according to the semantic information, and sending the English reaction video data to the human-computer interaction unit;
the cloud data center is used for analyzing the oral English video data to obtain semantic information and obtain corresponding user portrait, storing the obtained user portrait, the oral English video data, the semantic information and the user login information to a corresponding user database, obtaining corresponding English reaction video data according to the semantic information, and sending the English reaction video data to the human-computer interaction unit.
2. The human-computer interaction system for English teaching of claim 1, wherein: the human-computer interaction unit comprises a control module, a communication module, a microphone, an audio playing module, a touch display screen and a camera, wherein the communication module, the microphone, the audio playing module, the touch display screen and the camera are all in communication connection with the control module;
the semantic analysis unit comprises a microprocessor, a first audio extraction module, a first audio character conversion module, a first semantic identification module and a first reaction video acquisition module, wherein the first audio extraction module, the first audio character conversion module, the first semantic identification module and the first reaction video acquisition module are all in communication connection with the microprocessor;
the cloud data center is provided with a data server, the data server comprises a central processing unit, an image identification module, a lip language information acquisition module, a semantic completion module, a semantic emotion analysis module, an audio data correction module, a user image acquisition module, a second audio extraction module, a second audio character conversion module, a second semantic identification module and a second reaction video acquisition module, and the central processing unit is provided with a second audio database, a second semantic database, a second reaction video database, a user database and a lip language database.
3. An English teaching man-machine interaction method based on the English teaching man-machine interaction system of claim 2, characterized in that: the method comprises the following steps:
after the user verification is passed, performing English teaching man-machine interaction according to the interaction mode selected by the user;
the user selects a common interaction mode, the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the semantic analysis unit for analysis to obtain corresponding English reaction video data, and if the English spoken language video data cannot be analyzed by the semantic analysis unit, the current English spoken language video data is sent to the cloud data center for analysis;
the user selects an intelligent interaction mode, the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the cloud data center for intelligent correction and emotion analysis to obtain a corresponding user portrait, and corresponding English reaction video data are obtained according to the user portrait and the English spoken language video data;
and the human-computer interaction unit broadcasts corresponding English reaction video data.
4. The English teaching human-computer interaction method of claim 3, wherein: performing user authentication, comprising the steps of:
the human-computer interaction unit collects user login information and sends the user login information to the cloud data center, if the user login information is abnormal data, user face image data of a current user are collected, the image recognition module carries out face identity verification according to the face image data of the current user, and otherwise the cloud data center judges whether the current user is a known user according to the user login information;
if the current user face image data cannot be matched with the user database of the known user, returning verification failure information to the human-computer interaction unit, otherwise, connecting to the user database corresponding to the user image data and sending verification passing information to the human-computer interaction unit;
if the current user login information is a known user, connecting the corresponding user database and sending verification passing information to the man-machine interaction unit, otherwise, establishing a new user database according to the current user login information and sending the verification passing information to the man-machine interaction unit.
5. The English teaching human-computer interaction method of claim 4, wherein: the method for acquiring English reaction video data in the common interactive mode comprises the following steps:
the user selects a common interaction mode, and the human-computer interaction unit acquires English spoken language video data of the user and sends the English spoken language video data to the semantic analysis unit;
the first audio extraction module extracts oral English audio data in the oral English video data;
the first audio character conversion module converts the spoken English audio data into spoken English character data;
the first semantic recognition module calls a first semantic database according to the English grammar structure to perform semantic analysis on the spoken English character data to obtain corresponding semantic information;
if the first semantic recognition module cannot acquire corresponding semantic information obtained by performing semantic analysis on the spoken English character data, sending the corresponding spoken English video data to a cloud data center for semantic analysis;
and the first reaction video acquisition module calls the first reaction video database according to the semantic information to obtain corresponding English reaction video data.
6. The English teaching human-computer interaction method of claim 5, wherein: the method for acquiring English reaction video data in the intelligent interactive mode comprises the following steps:
the user selects an intelligent interaction mode, and the human-computer interaction unit collects English spoken language video data of the user and sends the English spoken language video data to the cloud data center;
the lip information acquisition module acquires lip information of the English spoken video data, and the second semantic recognition module calls a lip database according to the lip information to perform lip semantic analysis to obtain lip semantic information;
the audio data correction module calls an audio database to correct the oral English audio data to obtain corrected oral English audio data;
the second audio character conversion module converts the corrected spoken English audio data into spoken English character data, and the second semantic recognition module calls a second semantic database according to an English grammar structure to perform semantic analysis on the spoken English character data to obtain corresponding character semantic information;
when the matching degree of the text semantic information and the lip language semantic information is the same, preferentially selecting the text semantic information, and when the matching degree of the text semantic information and the lip language semantic information is different, preferentially selecting the lip language semantic information;
the semantic completion module performs semantic completion on the character semantic information/lip language semantic information according to the English grammar structure to obtain completed semantic information;
the emotion analysis module carries out emotion analysis on the complemented semantic information to obtain user emotion information, and the user image acquisition module acquires a corresponding user portrait according to the spoken English video data and the user emotion information and updates the user portrait of a corresponding user database;
and the second reaction video acquisition module calls a second reaction video database according to the user portrait and the complemented semantic information to obtain corresponding English reaction video data.
7. The English teaching human-computer interaction method of claim 6, wherein: the method for acquiring the lip language information comprises the following steps:
the lip information acquisition module acquires lip images of continuous frames by performing frame interception on lip parts of the English spoken language video data;
performing image recognition according to a plurality of positioning points preset at the lip position of the current lip image to obtain lip angle data;
and traversing the lip images of all the frames, obtaining continuous lip angle change data according to all the lip angle data, and obtaining lip language information according to the continuous lip angle change data.
8. The English teaching human-computer interaction method of claim 7, wherein: acquiring corrected oral English audio data, comprising the following steps:
the central processing unit establishes an audio matching model based on the neural network;
the audio data correction module marks continuous pronunciation nodes of the oral English audio data and cuts the oral English audio data into a plurality of correction audio segments according to the marks;
the audio data correction module calls pre-stored audio segments in an audio database according to the frequency to match a plurality of corrected audio segments to obtain matched pre-stored audio segments and corresponding audio segment semantics, and extracts abnormal audio segments with abnormal audio segment semantics according to the semantic relationship between an English grammar structure and adjacent audio segment semantics;
inputting an abnormal audio segment with abnormal audio segment semantics into an audio matching model, calling an audio database to search and match, obtaining a pre-stored audio segment with the highest frequency matching degree, and replacing the abnormal audio segment;
and if all the abnormal audio sections are replaced, obtaining corrected oral English audio data, and otherwise, repeating the correction steps.
9. The English teaching human-computer interaction method of claim 8, wherein: the emotion analysis is carried out to obtain the corresponding user portrait, and the method comprises the following steps:
the central processing unit establishes an emotion analysis model and a user image acquisition model based on the neural network;
the emotion analysis module extracts multi-mode combination characteristics of the complemented semantic information;
the emotion analysis model fully fuses the multi-mode combination features by using a hierarchical self-attention method to obtain an emotion fusion vector;
judging according to the emotion fusion vector to obtain user emotion information;
the user image acquisition model carries out image recognition on the English spoken language video data to obtain the facial features of the user, and the age and the gender of the user are obtained according to the facial features;
the user image acquisition model identifies the frequency and the tone of the English spoken language audio data to obtain the sound characteristics of the user, and the age, the gender and the tone of the user are obtained according to the sound characteristics;
the user image acquisition model performs keyword analysis on English typing character data to obtain keyword characteristics of the user, and the age and the gender of the user are obtained according to the keyword characteristics;
and the user image acquisition module performs comprehensive analysis according to the age, sex, tone and user emotion information of the user to obtain a user portrait, and updates the user portrait of the corresponding user database.
10. The human-computer interaction method for English teaching according to claim 9, wherein: the method for acquiring English reaction video data comprises the following steps:
the central processing unit establishes a response video matching model based on the neural network;
the reaction video matching model calls a second reaction video database to screen according to the complemented semantic information of the current oral English video data to obtain a plurality of alternative English reaction video data;
and screening a plurality of candidate English reaction video data by the reaction video matching model according to the current user image to obtain final English reaction video data.
CN202111128380.3A 2021-09-26 2021-09-26 Man-machine interaction system and method for English teaching Withdrawn CN113837907A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111128380.3A CN113837907A (en) 2021-09-26 2021-09-26 Man-machine interaction system and method for English teaching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111128380.3A CN113837907A (en) 2021-09-26 2021-09-26 Man-machine interaction system and method for English teaching

Publications (1)

Publication Number Publication Date
CN113837907A true CN113837907A (en) 2021-12-24

Family

ID=78970369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111128380.3A Withdrawn CN113837907A (en) 2021-09-26 2021-09-26 Man-machine interaction system and method for English teaching

Country Status (1)

Country Link
CN (1) CN113837907A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708642A (en) * 2022-05-24 2022-07-05 成都锦城学院 Business English simulation training device, system, method and storage medium
CN117275456A (en) * 2023-10-18 2023-12-22 南京龙垣信息科技有限公司 Intelligent listening and speaking training device supporting multiple languages

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708642A (en) * 2022-05-24 2022-07-05 成都锦城学院 Business English simulation training device, system, method and storage medium
CN117275456A (en) * 2023-10-18 2023-12-22 南京龙垣信息科技有限公司 Intelligent listening and speaking training device supporting multiple languages

Similar Documents

Publication Publication Date Title
CN107993665B (en) Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system
CN106331893B (en) Real-time caption presentation method and system
CN107657017A (en) Method and apparatus for providing voice service
CN107452379B (en) Dialect language identification method and virtual reality teaching method and system
CN107992195A (en) A kind of processing method of the content of courses, device, server and storage medium
CN110600033B (en) Learning condition evaluation method and device, storage medium and electronic equipment
CN111259976B (en) Personality detection method based on multi-modal alignment and multi-vector characterization
CN113837907A (en) Man-machine interaction system and method for English teaching
CN111428175A (en) Micro-expression recognition-based online course recommendation method and related equipment
CN109933198B (en) Semantic recognition method and device
CN113592251B (en) Multi-mode integrated teaching state analysis system
CN111046148A (en) Intelligent interaction system and intelligent customer service robot
CN115052126A (en) Ultra-high definition video conference analysis management system based on artificial intelligence
CN113920534A (en) Method, system and storage medium for extracting video highlight
CN114996506A (en) Corpus generation method and device, electronic equipment and computer-readable storage medium
CN109961789A (en) One kind being based on video and interactive voice service equipment
CN114254096A (en) Multi-mode emotion prediction method and system based on interactive robot conversation
CN210516214U (en) Service equipment based on video and voice interaction
CN115171673A (en) Role portrait based communication auxiliary method and device and storage medium
CN112309183A (en) Interactive listening and speaking exercise system suitable for foreign language teaching
CN111312211A (en) Dialect speech recognition system based on oversampling technology
CN111914777B (en) Method and system for identifying robot instruction in cross-mode manner
CN115410061B (en) Image-text emotion analysis system based on natural language processing
CN117174092B (en) Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis
CN117809682A (en) Server, display equipment and digital human interaction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20211224

WW01 Invention patent application withdrawn after publication