CN112599115A

CN112599115A - Spoken language evaluation system and method thereof

Info

Publication number: CN112599115A
Application number: CN202011299829.8A
Authority: CN
Inventors: 潘晨杰; 蔡骋
Original assignee: Shanghai Dianji University
Current assignee: Shanghai Dianji University
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-04-02

Abstract

The invention relates to a spoken language evaluating system and a method thereof, wherein the system comprises a teacher client and a student client which are respectively connected with a server, wherein the teacher client is used for a teacher to manage student information, upload corpus data, upload teacher user information, upload teaching plan information and check spoken language scores of all students; the student client is used for the students to upload student user information, acquire corpus data and teaching plan information, upload voices to be tested and check the spoken scores of the current users; the server is used for storing teacher and student user information, storing corpus data, evaluating the accuracy, integrity and fluency of the voice to be tested and obtaining spoken scores corresponding to the student users. Compared with the prior art, the method can effectively improve the interaction between the teacher and the students and comprehensively and accurately evaluate the voice to be tested, so that the teacher can timely, completely and accurately acquire and manage the spoken language evaluation results of all the students, and good spoken language teaching can be carried out subsequently.

Description

Spoken language evaluation system and method thereof

Technical Field

The invention relates to the technical field of voice recognition and evaluation, in particular to a spoken language evaluation system and a spoken language evaluation method.

Background

Along with the popularization of emerging teaching modes such as online classroom, cloud teaching for the student can carry out the course study at any time, nevertheless when carrying out spoken language teaching, receive equipment, place and teaching duration's influence, be difficult to carry out reliable effectual spoken language test, the teacher also can not in time, completely, accurately acquire and manage all students' spoken language study achievements, the interactivity between teacher and the student is relatively poor, the application of network spoken language teaching has been restricted greatly, spoken language teaching can not well be carried out.

At present, a plurality of client software supporting users to independently practice spoken language exist, but most of the software only supports voice recognition and cannot evaluate the accuracy, integrity and fluency of the user voice.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a spoken language evaluation system and a spoken language evaluation method thereof, which can comprehensively evaluate spoken languages of student users and enable teachers to timely, completely and accurately acquire and manage spoken language evaluation results of all students so as to carry out good spoken language teaching in the following process.

The purpose of the invention can be realized by the following technical scheme: a spoken language evaluation system comprises a teacher client, student clients and a server, wherein the teacher client and the student clients are respectively connected with the server, and the teacher client is used for a teacher user to manage student information, upload corpus data, upload teacher user information, upload teaching plan information and check spoken language scores of all students;

the student client is used for uploading student user information, acquiring corpus data and teaching plan information, uploading to-be-tested voice and checking the spoken language score of the current user by a student user;

the server is used for storing teacher user information and student user information, storing corpus data, evaluating the accuracy, integrity and fluency of the voice to be tested and obtaining spoken scores corresponding to the student users.

Furthermore, a communication module, a database and a voice evaluation module are arranged in the server, the teacher client and the student clients are connected with the database through the communication module, the input end of the voice evaluation module is connected with the student clients through the communication module, the output end of the voice evaluation module is connected with the database, and the voice evaluation module is used for evaluating the accuracy, integrity and fluency of the voice to be tested from the student clients to obtain corresponding spoken scores;

the database is used for storing teacher user information and student user information, storing corpus data and teaching plan information from a teacher client and storing spoken language scores corresponding to the student user information.

Further, the communication module specifically adopts a two-layer C/S architecture, a Serverless architecture or a hybrid architecture, and the hybrid architecture includes a C/S architecture and a Serverless architecture.

Furthermore, the voice evaluation module comprises a preprocessing unit, a feature extraction unit, a mode matching unit and a post-processing unit which are sequentially connected, wherein the preprocessing unit is connected with the student client, the post-processing unit is connected with the database, and the preprocessing unit is used for filtering and framing the voice to be detected;

the feature extraction unit is used for extracting feature vectors from the preprocessed voice signals to be detected;

the pattern matching unit is used for identifying statement information corresponding to the voice to be detected;

and the post-processing unit obtains the accuracy, integrity and fluency scores corresponding to the voice to be detected, namely the spoken language scores, according to the voice to be detected and the corresponding statement information and corpus data thereof.

Furthermore, a training model connected with the feature extraction unit is arranged in the pattern matching unit, and the training model is used for recognizing and outputting corresponding statement information according to the feature vector of the voice to be detected.

Further, the training model specifically selects a model structure combining hidden markov with a neural network.

Further, the training model comprises an acoustic model and a language model, wherein the acoustic model is obtained by training a large amount of voice data, the input of the acoustic model is a voice feature vector, and the output of the acoustic model is phoneme information;

the language model is obtained by training a large amount of text information, and the output is the probability that single characters or words are associated with each other.

A spoken language evaluating method comprises the following steps:

s1, the teacher client acquires teacher user information, a student list, teaching plan information and corpus data, and transmits the acquired information data to the server for storage;

s2, the student client side obtains the student user information and transmits the student user information to the server for storage;

s3, the student client acquires the corpus data from the server and outputs the acquired corpus data to the student user;

s4, the student client receives the voice to be tested from the student user and transmits the voice to the server;

s5, the server evaluates the accuracy, integrity and fluency of the voice to be tested to obtain a spoken language score, stores the spoken language score corresponding to the student user information, and outputs the spoken language score to the student client;

and S6, the teacher client acquires the spoken language scores corresponding to the existing students in the student list from the server, so that the teacher user can check the spoken language learning results of the students.

Furthermore, the corpus data includes test statement information and corresponding voice data.

Further, the specific process of the server in step S5 for evaluating the accuracy, integrity and fluency of the speech to be tested is as follows:

s51, preprocessing the voice to be detected, specifically, denoising the digital signal of the voice to be detected by adopting a filtering method, and carrying out end point detection and sound segment segmentation on the digital signal of the voice to be detected to obtain a starting point and an end point of effective voice;

s52, extracting feature vectors capable of representing voice characteristics from the preprocessed effective voice;

s53, inputting the feature vectors into an acoustic model to obtain corresponding phoneme strings, finding out the start-stop time of each phoneme and the demarcation point between the phonemes based on a dynamic time warping method, and obtaining text information corresponding to the phonemes in a dictionary matching mode;

s54, inputting the text information corresponding to each phoneme into a language model to obtain statement information with the highest probability corresponding to the speech to be detected;

s55, comparing the sentence information with the highest probability corresponding to the voice to be tested with the test sentence information to generate an accuracy score;

and counting the pronunciation interval, the pronunciation starting point and the pronunciation ending point of the voice to be detected, and generating a completeness score and a fluency score.

Compared with the prior art, the invention has the following advantages:

according to the invention, through the arrangement of the teacher client and the student clients which are respectively connected with the server, the teacher user and the student users can access the server at two ends, so that the interactivity between the teacher user and the student users is improved, the student users can acquire the test corpus data uploaded by the teacher user from the server, and the teacher user can timely, completely and accurately acquire the spoken language evaluation scores of all students from the server, so that the effectiveness and reliability of network spoken language teaching tests are greatly improved, the time, place and equipment limitation is avoided, and the teacher user and the student users can independently complete corresponding operations at respective clients.

The server can realize data transmission with the teacher client and the student clients, correspondingly store data information uploaded by the teacher client and the student clients, and simultaneously can comprehensively evaluate the accuracy, integrity and fluency of the voice to be tested, thereby ensuring the comprehensiveness of spoken language evaluation.

Thirdly, preprocessing, feature extraction, pattern matching and post-processing are sequentially carried out on the voice to be tested, pattern matching is carried out on the voice to be tested based on the trained acoustic model and the trained language model, sentence information corresponding to the voice to be tested is obtained through recognition, finally the sentence information obtained through recognition is compared with the tested corpus data, accurate accuracy grading can be obtained, in addition, accurate integrity grading and smoothness grading can be further obtained through statistics of pronunciation intervals, pronunciation starting points and ending points of the voice to be tested, and therefore accuracy of a spoken language evaluation result is guaranteed.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a schematic workflow diagram of a teacher client in an embodiment;

FIG. 3 is a schematic workflow diagram of a student client in an embodiment;

FIG. 4 is a schematic diagram of a workflow of the speech evaluation module in an embodiment;

FIG. 5 is a diagram of a communication module employing a C/S architecture;

FIG. 6 is a diagram of a communication module employing a Serverless architecture;

FIG. 7 is a diagram of a hybrid architecture for a communication module;

FIG. 8 is a schematic diagram illustrating a comparison between page loading speeds of the C/S architecture and the Serverless architecture;

FIG. 9 is a schematic flow chart of the method of the present invention;

the notation in the figure is: 1. teacher client side, 2 student client side, 3, server, 31, communication module, 32, database, 33, pronunciation evaluation module.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

Examples

As shown in fig. 1, a spoken language evaluation system includes a teacher client 1, a student client 2 and a server 3, the teacher client 1 and the student client 2 are respectively connected to the server 3, a communication module 31, a database 32 and a voice evaluation module 33 are provided in the server 3, the teacher client 1 and the student client 2 are both connected to the database 32 through the communication module 31, an input end of the voice evaluation module 33 is connected to the student client 2 through the communication module 31, and an output end of the voice evaluation module 33 is connected to the database 32.

The teacher client 1 is used for teacher user management student information, uploading corpus data, uploading teacher user information, uploading teaching plan information and checking all student spoken scores, as shown in fig. 2, the teacher client 1 mainly relates to the functions of teacher login, class creation, class management, student information addition, student information management, corpus management, teaching plan making, student score checking and the like, and relates to the user information database, the voice corpus and the plan information database in the database 32, so as to be well communicated with the data of the student clients 2, and achieve high efficiency, time saving and labor saving;

the student client 2 is used for the student user to upload student user information, obtain corpus data and teaching plan information, upload the pronunciation that awaits measuring and look over current user spoken score, as shown in fig. 3, the student client 2 mainly involves the function and has: user login, language type selection, test type and grade selection, trial listening and reading information, voice recording and testing, voice recording and recording, result page display, plan viewing and historical score viewing functions, and relates to a user information database memory voice corpus in the database 32;

the server 3 is used for storing teacher user information and student user information, storing corpus data, evaluating the accuracy, integrity and fluency of the voice to be tested and obtaining spoken scores corresponding to the student users, and the database 32 in the server 3 is used for storing the teacher user information and the student user information, storing the corpus data and teaching plan information from the teacher client 1 and storing the spoken scores corresponding to the student user information;

the voice evaluation module 33 in the server 3 is used for evaluating the accuracy, integrity and fluency of the voice to be tested from the student client 2 to obtain a corresponding spoken score, the voice evaluation module 33 comprises a preprocessing unit, a feature extraction unit, a mode matching unit and a post-processing unit which are sequentially connected, the preprocessing unit is connected with the student client 2, the post-processing unit is connected with the database 32, and the preprocessing unit is used for filtering and framing the voice to be tested;

the model matching unit is used for recognizing statement information corresponding to the voice to be detected, a training model connected with the feature extraction unit is arranged in the model matching unit and used for recognizing and outputting the corresponding statement information according to a feature vector of the voice to be detected, the training model specifically adopts a model structure combining hidden Markov with a neural network, the training model comprises an acoustic model and a language model, the acoustic model is obtained by training a large amount of voice data, the input of the acoustic model is the voice feature vector, the output of the acoustic model is phoneme information, the language model is obtained by training a large amount of text information, and the output of the language model is the probability of mutual correlation of single words or words;

the post-processing unit obtains the accuracy, integrity and fluency scores corresponding to the voice to be detected, namely the spoken language scores, according to the voice to be detected and the corresponding statement information and corpus data;

specifically, as shown in fig. 4, the speech evaluation module 33 processes the speech signal collected from the hardware, and obtains the recognized text information and the related evaluation score after the processing such as denoising, feature extraction, construction of the acoustic model and the language model, pattern matching, and the like, and mainly includes: the method comprises a preprocessing stage, a model construction stage and a decoding part, wherein in the preprocessing stage, filtering, framing, available voice endpoint detection and feature extraction are mainly carried out on a sound signal, the feature extraction work converts the sound signal from a time domain to a frequency domain to provide a proper feature vector for an acoustic model, and the score of each feature vector on acoustic features is calculated in the acoustic model according to acoustic characteristics; the language model calculates the probability of the sound signal corresponding to the possible phrase sequence according to the theory related to linguistics; and finally, decoding the phrase sequence according to the existing dictionary to obtain text information with the highest probability, splicing and synthesizing, and performing further data analysis by combining the information corresponding to the preprocessed and recognized text in the scoring part to obtain scores of accuracy, integrity and fluency.

In practical application, the communication module 31 in the server 3 may adopt a two-layer C/S architecture, a Serverless architecture or a hybrid architecture (including a C/S architecture and a Serverless architecture), as shown in fig. 5, when the two-layer C/S architecture is adopted, the server is responsible for data management, the client is responsible for completing an interaction task with a user, the client is connected with the server through a local area network, receives a request of the user, and provides the request to the server through the network to operate the database; the server receives the request of the client, submits the data to the client, the client calculates the data and presents the result to the user, the server also provides perfect safety protection, processes the data integrity and other operations, and allows a plurality of clients to access the server simultaneously, the dual-end CS architecture can access the server at the mobile phone end and the webpage end respectively, and the performance of the server is utilized to the maximum.

As shown in fig. 6, when the Serverless architecture is adopted, that is, there is no server architecture, there is no need to purchase a server, there is no need to configure a virtual machine or a physical machine, it uses a way of computing and hosting, a user does not need to worry about its security when using it, and also does not need to worry about a failure caused by a possible server downtime, the operation way of the Serverless architecture has a feature, the service logic is triggered to operate, after a cloud function is communicated with each cloud product or cloud service, events generated by each product or service can trigger the operation of the service logic, the Serverless architecture only needs to perform all operations at the mobile phone end after the application development at the mobile phone end is completed, and there is no need to additionally build a server.

As shown in fig. 7, when the hybrid architecture is adopted, the C/S architecture and the Serverless architecture are fused, and the database data is automatically synchronized by using the characteristic of dual-end interconnection, so that one of the architectures can be prevented from operating without interruption during maintenance.

In this embodiment, by analyzing and comparing the page loading speeds corresponding to the C/S architecture and the Serverless architecture, as shown in fig. 8, the response speed of the Serverless architecture has a certain speed difference compared with the C/S architecture due to the trigger mechanism, and the service logic of the Serverless architecture is triggered to run, which results in that the system must transmit variables in the sequence of the service logic during data transmission, on the premise of the same page loading capacity, the higher the interaction degree of the database is, the more obvious the delay of the architecture is, the faster the C/S architecture is compared with the Serverless architecture, the main page loading speed is 41.6%, and the overall response speed is 20%, so that it is proposed to use a hybrid architecture in which the C/S architecture is used as a main structure and the Serverless architecture is used as an auxiliary means, and to have an emergency response mechanism while ensuring the performance to the maximum.

The system is applied to practice, and a specific spoken language evaluation process is shown in fig. 9, and comprises the following steps:

s3, the student client acquires the corpus data from the server and outputs the acquired corpus data to the student user, wherein the corpus data comprises test statement information and corresponding voice data;

s5, the server evaluates the accuracy, integrity and fluency of the voice to be tested to obtain a spoken language score, correspondingly stores the spoken language score and the student user information, outputs the spoken language score to the student client, and when the spoken language evaluation is carried out, the specific process is as follows:

counting pronunciation intervals, pronunciation starting points and pronunciation ending points of the voice to be detected, and generating integrity grade and fluency grade;

In the embodiment, when a student client is constructed, the same functions are completed by two deployment modes of a server and cloud development, the problem of mobile phone version compatibility is considered, the mode of developing an applet by an applet developer tool is used, the application of a mobile phone end is quickly and effectively deployed, various technologies related to the cloud development are utilized, a series of functions are completed by including cloud storage, a cloud function, a cloud database and the like, data transmission, remote voice playing, remote database access, reading, file transmission, JSON data string analysis, voice recognition, voice evaluation, phoneme judgment and other technologies are adopted, the application of a voice evaluation model is realized, the voice evaluation feedback speed of a system is synchronized in real time, and a user has good use experience;

when a teacher client is constructed, a PHP webpage framework deployed on a server is adopted, data management and operation are carried out by a database, domain name resolution technology and CA certificate deployment applied by https protocol are used, the safety and applicability of links are ensured, the webpage is deployed on the PHP-based server, the execution efficiency and the simplicity of code maintenance are ensured, a teacher mainly provides text information and audio files required for teaching for uploading and publishing, and students can receive corresponding exercises in programs and effectively carry out interaction;

considering that both ends of teachers and students access a database, using a server to erect the database and a PHP language to design an access port, sending a message to the port in a form of a program, returning a value of a result in a res field, reconstructing data while collecting the data, and analyzing and splicing json files so that parameters required by a page can be normally displayed and accessed;

in order to ensure the comprehensive accuracy of spoken language evaluation, the speech to be tested is evaluated mainly by a speech evaluation module in the server:

1) firstly, inputting the voice of a learner, collecting clear and complete voice information, converting a voice analog signal collected from hardware into a digital signal, and storing the voice as an mp3 format file with a sampling rate of 22 kHz.

2) Preprocessing the collected files, denoising the collected digital signals, carrying out end point detection and sound segment segmentation, and detecting the initial point of the effective voice.

3) Feature extraction: and extracting characteristic parameters capable of representing the voice characteristics from the preprocessed effective voice through signal processing methods such as information prediction analysis and cepstrum analysis.

4) And training a model, wherein an Acoustic Model (AM) is obtained by training voice data, the input is a feature vector, the output is phoneme information, and a Language Model (LM) obtains the probability of the correlation of single characters or words by training a large amount of text information.

5) Pattern matching:

acoustic model: mapping from the voice characteristics to the phonemes is obtained, and corresponding Chinese characters (words) or words are found according to the phonemes identified by the acoustic model through a 'dictionary' (Chinese is the correspondence between pinyin and Chinese characters, and English is the correspondence between phonetic symbols and words), so that a bridge is established between the acoustic model and the language model, and the text information corresponding to the phonemes is obtained by connecting the acoustic model and the language model.

The language model is as follows: when the text information converted by the acoustic model is converted into sentences by using the collocation information between adjacent words in the context, the sentences with the maximum probability can be calculated, so that the automatic conversion from characters to sentences is realized, manual selection by a user is not needed, the problem of repeated codes of a plurality of characters corresponding to the same pronunciation is avoided, and the sentence information identified by the highest probability corresponding to the voice to be detected is obtained.

6) And (3) post-treatment: after the best (highest probability) recognition result is obtained, comparing a sentence generated by the voice to be detected of the user with a standard sentence (namely, the test corpus data) to generate an accuracy grade, and counting a pronunciation interval, a pronunciation starting point and an ending point by utilizing the collected voice audio signal to be detected so as to generate a completeness grade and a fluency grade.

7) And finally, encapsulating the result to generate a json byte stream and outputting the result.

In conclusion, the student and teacher dual-system process is designed, the requirements of a student end bilingual page and bilingual content can be met, the voice module and the evaluation module are fused into the voice evaluation module, and therefore the voice can be recognized and evaluated.

The invention designs a software architecture scheme, so that a proper scheme is selected according to the user requirement, automatic synchronization of database data is carried out by utilizing the characteristic of double-end interconnection, the operation that one architecture can be continuously deployed during maintenance can be prevented, the requirements of system maintenance and overhaul are met, the response speed is analyzed and researched, and the conclusion that different architectures are selected according to the actual requirement of the user is obtained.

The invention provides a voice evaluation module, which combines the functions of voice recognition and parameter evaluation, simplifies the process, defines the module and can obtain accurate and comprehensive spoken language evaluation results.

Claims

1. The spoken language evaluation system is characterized by comprising a teacher client (1), student clients (2) and a server (3), wherein the teacher client (1) and the student clients (2) are respectively connected with the server (3), and the teacher client (1) is used for teacher user management of student information, language data uploading, teacher user information uploading, teaching plan information uploading and spoken language score checking of all students;

the student client (2) is used for uploading student user information, acquiring corpus data and teaching plan information, uploading voice to be tested and checking the spoken score of the current user by a student user;

the server (3) is used for storing teacher user information and student user information, storing corpus data, evaluating the accuracy, integrity and fluency of the voice to be tested and obtaining spoken scores corresponding to the student users.

2. The spoken language evaluation system according to claim 1, wherein a communication module (31), a database (32) and a speech evaluation module (33) are arranged in the server (3), the teacher client (1) and the student client (2) are both connected to the database (32) through the communication module (31), an input end of the speech evaluation module (33) is connected to the student client (2) through the communication module (31), an output end of the speech evaluation module (33) is connected to the database (32), and the speech evaluation module (33) is configured to evaluate accuracy, integrity and fluency of speech to be tested from the student client (2) to obtain a corresponding spoken language score;

the database (32) is used for storing teacher user information and student user information, storing corpus data and teaching plan information from the teacher client (1), and storing spoken scores corresponding to the student user information.

3. The spoken language evaluation system of claim 1, wherein the communication module (31) is implemented with a two-layer C/S architecture, a Serverless architecture, or a hybrid architecture, and wherein the hybrid architecture comprises a C/S architecture and a Serverless architecture.

4. The spoken language evaluation system of claim 2, wherein the speech evaluation module (33) comprises a preprocessing unit, a feature extraction unit, a pattern matching unit, and a post-processing unit, which are connected in sequence, the preprocessing unit is connected to the student client (2), the post-processing unit is connected to the database (32), and the preprocessing unit is configured to filter and frame the speech to be tested;

5. The spoken language evaluation system of claim 4, wherein a training model connected to the feature extraction unit is disposed in the pattern matching unit, and the training model is configured to recognize and output corresponding sentence information according to a feature vector of the speech to be tested.

6. The system for evaluating a spoken language according to claim 5, wherein the training model is a model structure specifically based on hidden Markov combined neural network.

7. The spoken language evaluation system of claim 6, wherein the training model comprises an acoustic model and a language model, the acoustic model is obtained by training a large amount of speech data, the input of the acoustic model is speech feature vectors, and the output of the acoustic model is phoneme information;

8. A spoken language evaluation method using the spoken language evaluation system according to claim 1, comprising the steps of:

9. The spoken language evaluation method of claim 8, wherein the corpus data includes test sentence information and corresponding speech data.

10. The spoken language evaluation system of claim 9, wherein the specific process of evaluating the accuracy, integrity and fluency of the speech to be tested by the server in step S5 is as follows: