CN111833853A - Voice processing method and device, electronic equipment and computer readable storage medium - Google Patents

Voice processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN111833853A
CN111833853A CN202010630225.0A CN202010630225A CN111833853A CN 111833853 A CN111833853 A CN 111833853A CN 202010630225 A CN202010630225 A CN 202010630225A CN 111833853 A CN111833853 A CN 111833853A
Authority
CN
China
Prior art keywords
text
voice
score
speech
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010630225.0A
Other languages
Chinese (zh)
Other versions
CN111833853B (en
Inventor
林炳怀
王丽园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010630225.0A priority Critical patent/CN111833853B/en
Publication of CN111833853A publication Critical patent/CN111833853A/en
Application granted granted Critical
Publication of CN111833853B publication Critical patent/CN111833853B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a voice processing method and a voice processing device. The method comprises the following steps: acquiring acoustic parameters and recognition texts obtained by performing recognition processing on voices; extracting acoustic features of the voice according to the acoustic parameters, and extracting text features of the voice according to the recognition text; according to the score point type associated with the voice, the acoustic feature and the text feature are input into a score prediction model matched with the score point type, score values output by the score prediction model according to the acoustic feature, the text feature and the score point type and aiming at the voice are obtained, and the score prediction models matched with different score point types are different. According to the technical scheme, scoring errors caused by mismatching of the scoring point type associated with the voice and the scoring prediction model can be avoided, and accuracy of voice scoring is improved.

Description

Voice processing method and device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a speech processing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
With the research and progress of artificial intelligence technology, the artificial intelligence technology has been developed and researched in a plurality of fields, for example, in the fields of common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, automatic driving, unmanned aerial vehicles, robots, smart customer service, and the like, the artificial intelligence technology plays an increasingly important role.
In an open spoken language examination scene, in order to conveniently obtain the scores of the spoken language examination, an intelligent scoring system developed based on an artificial intelligence technology is developed at present, but how to improve the scoring accuracy of the intelligent scoring system is a technical problem that technicians in the field need to continuously research.
Disclosure of Invention
In order to improve the accuracy of an intelligent scoring system for scoring spoken language of a user, embodiments of the application provide a voice processing method and apparatus, a method and apparatus for scoring a spoken language test, an electronic device, and a computer-readable storage medium.
Wherein, the technical scheme who this application adopted does:
a method of speech processing comprising: acquiring acoustic parameters and recognition texts obtained by performing recognition processing on voices; extracting acoustic features of the voice according to the acoustic parameters, and extracting text features of the voice according to the recognition text; according to the score point type associated with the voice, the acoustic feature and the text feature are input into a score prediction model matched with the score point type, score values output by the score prediction model according to the acoustic feature, the text feature and the score point type and aiming at the voice are obtained, and the score prediction models matched with different score point types are different.
A scoring method for a spoken language test comprises the following steps: displaying examination questions on a spoken language examination interface; when an audio recording instruction is detected to be triggered, recording voice input aiming at the test question; and displaying a scoring value aiming at the voice in the spoken language test interface, wherein the scoring value is obtained by scoring the voice according to the acoustic characteristic and the text characteristic of the voice and the question type of the test question by a scoring prediction model matched with the question type of the test question.
A speech processing apparatus comprising: the recognition processing module is used for acquiring acoustic parameters and recognition texts obtained by performing recognition processing on the voice; the feature extraction module is used for extracting the acoustic features of the voice according to the acoustic parameters and extracting the text features of the voice according to the recognition text; and the score acquisition module is used for inputting the acoustic features and the text features into score prediction models matched with the score types according to the score types associated with the voice, so that score values output by the score prediction models according to the acoustic features, the text features and the score types and aiming at the voice are obtained, and the score prediction models matched with different score types are different.
A scoring device for a spoken language examination, comprising: the examination question display module is used for displaying examination questions on a spoken language examination interface; the voice recording module is used for recording voice input aiming at the examination question when an audio recording instruction is detected to be triggered; and the grading display module is used for displaying a grading value aiming at the voice in the spoken language test interface, wherein the grading value is obtained by grading the voice according to the acoustic characteristic and the text characteristic of the voice and the question type of the test question by a grading prediction model matched with the question type of the test question.
An electronic device comprising a processor and a memory, the memory having stored thereon computer readable instructions which, when executed by the processor, implement a speech processing method or a scoring method for a spoken test as described above.
A computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to execute the voice processing method or the scoring method of a spoken test as described above.
In the technical scheme, because the scoring prediction models matched with the voices corresponding to different scoring point types are different, the extracted acoustic features and the extracted text features are input into the scoring prediction models matched with the scoring point types associated with the voices, so that the scoring prediction models matched with the scoring point types associated with the voices can score the voices more accurately according to the features corresponding to the scoring point types associated with the voices, scoring errors caused by the fact that the scoring point types associated with the voices are not matched with the scoring prediction models are avoided, and the accuracy of voice scoring is improved.
In an open spoken language examination scene, the voice processing method provided by the application can be used for enabling the scoring prediction model matched with the examination question type to score the voice corresponding to the examination question type, so that the obtained scoring value takes the scoring point of the examination question type corresponding to the voice into consideration, and the accuracy is higher.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 is a schematic illustration of an implementation environment to which the present application relates;
FIG. 2 is a schematic diagram illustrating the structure of a speech processing model in accordance with an exemplary embodiment;
FIG. 3 is a diagram of acoustic parameters and acoustic features corresponding to an exemplary speech;
FIG. 4 is a diagram of recognized text and text features for an exemplary phonetic correspondence;
FIG. 5 is a flow diagram illustrating a method of speech processing according to an exemplary embodiment;
fig. 6 is a flow chart illustrating a method of scoring a spoken test according to an exemplary embodiment;
FIG. 7 is an interaction flow diagram illustrating a spoken test interface according to an exemplary embodiment;
FIG. 8 is a block diagram illustrating a speech processing apparatus according to an example embodiment;
fig. 9 is a schematic structural diagram of an electronic device according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
For example, key technologies for Speech Technology (Speech Technology) include automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel based on the voice technology, and the method is an important development direction of future human-computer interaction.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language, and thus, the research in this field will involve natural language, i.e., language used by people on a daily basis, and therefore, it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.
The application provides a voice processing method and device, a scoring method and device for a spoken language examination, an electronic device and a computer readable storage medium, and particularly relates to a voice processing technology and a machine learning technology in the field of artificial intelligence. The present application describes these methods, apparatuses, electronic devices, and computer-readable storage media in detail by the following embodiments.
It should also be noted that the term "plurality" as used herein is to be understood to mean at least two.
Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment related to the present application. The implementation environment includes a terminal 100 and a server 200, and the terminal 100 and the server 200 communicate with each other through a wired or wireless network.
A client that scores a voice is operated in the terminal 100, and when the terminal 100 obtains an input voice, the voice is transmitted to the server 200 so that the server 200 scores the voice. For example, in an open spoken language test scenario, a spoken language test client is specifically operated in the terminal 100, and the terminal 100 can obtain input speech by recording answer speech of a user. The terminal 100 may be any electronic device capable of operating a client for scoring voice, such as a smart phone, a tablet, a laptop, a computer, and the like, without limitation.
Multiple scoring prediction models are deployed in server 200, and different scoring prediction models are used to predict scoring values for speech corresponding to different scoring point types. After receiving the voice sent by the terminal 100, the server 200 performs recognition processing on the voice to obtain acoustic parameters and a recognition text, extracts acoustic features of the voice according to the acoustic parameters, extracts text features of the voice according to the recognition text, and then inputs the extracted acoustic features and text features into a scoring prediction model matched with a scoring point type associated with the current voice according to the scoring point type associated with the voice to obtain a scoring value output by the scoring prediction model for the current voice. The server 200 also transmits the obtained value of the credit of the voice to the terminal 100 to cause the terminal 100 to display the value of the credit.
The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and an artificial intelligence platform, which is not limited herein.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating a structure of a speech processing model according to an exemplary embodiment. The speech processing model is based on artificial intelligence technology, and can be configured in the server 200 in the implementation environment shown in fig. 1 to score the speech transmitted by the terminal 100.
As shown in fig. 2, the speech processing model includes a speech recognition module 210, an intermediate layer module, and a score prediction module 250, and the intermediate layer module specifically includes an acoustic feature extraction module 220, a text post-processing module 230, and a text feature extraction module 240.
The speech recognition module 210 is configured with a speech recognition algorithm, and is configured to perform recognition processing on an input speech to obtain an acoustic parameter and a recognition text corresponding to the speech. The acoustic parameters are related parameters for describing acoustic characteristics of the speech, and may include, for example, parameters such as a pronunciation duration of each phoneme in the speech, a pronunciation total duration of the speech, a sustained pronunciation time period of each word in the speech, pronunciation intensity, and a lowest frequency in a sound wave corresponding to each word, which are not limited herein. The recognition text is a text sequence consisting of individual words in speech.
The middle layer module is used for performing feature extraction on the acoustic parameters and the recognition texts output by the speech recognition module 210 to obtain acoustic features and text features of the speech.
Specifically, the acoustic feature extraction module 220 is configured with one or more acoustic feature extraction algorithms, and is configured to perform feature extraction on the acoustic parameters of the speech output by the speech recognition module 210 to obtain the acoustic features of the speech. The acoustic feature of the speech is a feature expression obtained by extracting features from acoustic parameters of the speech, and may specifically include features such as pronunciation accuracy, pronunciation fluency, pronunciation prosody, and the like of the speech, which is not limited in this respect.
The text post-processing module 250 is configured with one or more text processing algorithms, and is configured to perform processing such as adding punctuation marks to the recognized text of the speech output by the speech recognition module 210, and removing illicit text. For example, punctuation marks may be added to the recognition text based on a grammar rule, or punctuation marks may be added to the recognition text by using a sequence tagging algorithm in a natural language processing technology, which is not limited herein. Identifying objectionable text components contained in the text may include identifying words of tone, repeated words, words representing revisions, words representing a sentence restart, etc. in the text.
The text feature extraction module 240 is configured with one or more text feature extraction algorithms to perform text feature extraction on the recognized text output by the text post-processing module 250 to obtain text features of the speech. The text feature of the speech is a feature expression obtained by extracting features of the recognized text of the speech, and may include, for example, a keyword feature, a semantic feature, a pragmatic feature, a disfluency feature, and the like of the speech, which is not limited herein.
The keyword feature of the speech is a feature for representing a relationship between a keyword contained in the recognized text of the speech and a keyword contained in a standard text corresponding to the speech, and for example, in an open spoken language test scenario, the standard text corresponding to the speech refers to a standard answer text corresponding to an examination question. The disfluency characteristic is related to the proportion of disfluency text components contained in the recognized text of speech.
It should be noted that by adding punctuation marks to the recognition text and removing the illicit text components contained in the recognition text, the features of keywords, semantics, grammar, etc. in the recognition text can be extracted better, so that the text feature extraction based on the recognition text added with punctuation marks and removed illicit text components can more accurately obtain the keyword features, semantic features and pragmatic features of the speech.
In order to facilitate extracting the illicit feature of the speech, a removal trace of the illicit text component may be further left in the recognition text output by the text post-processing module 250, for example, the illicit text component is marked with a special symbol, so as to obtain the illicit feature of the speech based on the illicit text component marked in the recognition text. For example, an exemplary recognized text output by the text post-processing module 230 is "My favorite sport [ is ] is [ uh ] swimmming", where the marked "[ is ] and" [ uh ] both represent unhygienic text components in the recognized text.
The score prediction module 250 is configured with a plurality of score prediction models, and different score prediction models are configured with score prediction algorithms for scoring different score point types, so that the different score prediction models are used for predicting score values of voices associated with different score point types. For example, the scoring prediction algorithm configured in the scoring prediction model may be a machine learning regression algorithm such as a support vector regression algorithm, which is not limited herein. The score type associated with the voice is that the respective scores of different voices need to be considered when scoring different voices, for example, in an open spoken language test scene, the score type associated with the voice recorded for the test question corresponds to the question type of the test question, and the scores for scoring the voice are different based on the difference of the question types, so that the voices are scored accurately.
The score prediction module 250 takes the acoustic features of the speech output by the acoustic feature extraction module 220 and the text features of the speech output by the text feature extraction module 240 as input signals, and performs score prediction on the input signals through a score prediction model matched with the score point type associated with the speech to obtain a score value of the speech.
It should be noted that the score value for the speech output by the score prediction model is obtained by performing score prediction on the speech based on the acoustic features, the text features and the score point type associated with the speech. For example, in one embodiment, the scoring prediction model scores the speech based on the text, the pronunciation, and the score point type associated with the speech, and obtains the score value of the speech by weighted addition of the scores of the speech in all aspects, or inputs the score of the speech in all aspects into other machine learning models to obtain the score value of the speech output by other machine learning models. Thus, each scoring prediction model may voice score a unique pair of features for the score point type it matches.
It should be further noted that the score point type associated with the speech may be determined according to an actual application scenario, which is not limited in this embodiment. For example, the open spoken language test scenario is still taken as an example, the speech is specifically the answer speech of the user, and the score point type associated with the speech corresponds to the question type of the corresponding test question, which may include, for example, talking on the picture, topic expression, and the like. For the voice corresponding to the talking-with-picture type, the score prediction model can determine the score of the voice in the score point type of the voice association according to the matching degree between the text content and the picture content reflected by the text characteristics; for the voice corresponding to the topic expression type, the score of the voice in the score point type related to the voice can be determined by the score prediction model according to the matching degree between the text content reflected by the text characteristics and the topic content, or the score can be scored according to the topic expression capability reflected by the acoustic characteristics, so that the score of the voice in the score point type related to the voice can be obtained.
Each score prediction model is trained on the acoustic features and text features of the speech corresponding to the plurality of score point types and the score values set for the speech of the plurality of score point types. The specific training process comprises the following steps: and inputting the acoustic features, the text features and the score values of the voices of the score point types into score prediction models matched with the score point types associated with the voices, so that the score prediction models continuously learn the relationship between the input acoustic features and text features and the input score values until the difference between the score values predicted by the score prediction models aiming at the input acoustic features and text features and the input score values is smaller than a set threshold value.
The score values set for the voices of the multiple score point types are set in correspondence with the score point types of the respective voices in consideration, and for example, in the case of an open spoken language test, the set score values are the scoring results of the respective voices by the teacher according to the respective question types. In the process of learning the relation between the input acoustic features and the input score values and the relation between the text features and the input score values, each score prediction model also learns the relation between the output score values and the score point types matched with the model, so that the trained score prediction models can output the score values aiming at the voice according to the acoustic features and the text features of the voice and the score point types related to semantics.
Therefore, in the speech processing model provided in this embodiment, the intermediate layer module is used to extract the common effective features of different types of speech, and the common effective features extracted by the intermediate layer module are input into the score prediction model matched with the score point type associated with the current speech, so as to predict the score value of the current speech based on the common effective features corresponding to the current speech. Because the scoring prediction model matched with the score point type associated with the current voice can score the voice based on the unique characteristics of the score point type associated with the current voice, scoring errors caused by mismatching of the scoring prediction model and the score point type of the current voice can be avoided, and the accuracy of voice scoring can be greatly improved based on the voice processing model provided by the embodiment.
In the open spoken language test scene, the voice processing model provided by the embodiment is used for scoring the answering questions, so that the requirements of different test question types on different scoring points can be met, and more accurate scoring values can be obtained for the answering questions.
In addition, in order to more clearly understand the structure of the speech processing model proposed in the present application, the following describes in detail the process of extracting the acoustic features and the text features of speech by using the speech processing model according to a specific embodiment.
As shown in fig. 3, in an exemplary embodiment, the acoustic parameters obtained by performing speech recognition on an exemplary speech include a sustained pronunciation time period, pronunciation strength, and lowest frequency in a sound wave corresponding to each word in the speech, and also include parameters such as pronunciation duration of each phoneme in the speech and total pronunciation duration of the speech (not shown in fig. 3).
The acoustic features obtained by feature extraction of the acoustic parameters of the voice comprise pronunciation accuracy, pronunciation fluency and pronunciation prosody. Pronunciation accuracy is evaluated from three levels of phonemes, words and sentences contained in the speech, wherein the words and sentences contained in the speech can be obtained by combining factors contained in the speech. The sentence contained in the speech may specifically be a long sentence or a short sentence, for example, the long sentence contained in the speech has complete semantics, and usually corresponds to a sentence with a punctuation added to a period. The short sentences contained in the speech correspond to a continuous pronunciation section, and a long sentence can be composed of a plurality of short sentences.
Based on the speech recognition of the speech, a confidence score of the speech and each phoneme can be obtained, the confidence score is used for representing the recognition accuracy of the phoneme, and based on the product of the confidence scores of each phoneme, a confidence score of each word contained in the speech and a confidence score of each sentence can be obtained correspondingly. For example, if phone 1 and phone 2 are combined to obtain word a, the confidence score for phone 1 is 0.8, the confidence score for phone 2 is 0.9, and the confidence score for word a is 0.72.
Pronunciation fluency includes speech speed characteristics during pronunciation of speech, including average speech speed, average duration of pronunciation segments, and average interval duration of pronunciation segments, for example. The pronunciation segment in the speech is composed of a plurality of factors of continuous pronunciation, for example, one word or adjacent words are usually continuously pronounced, so the pronunciation time interval corresponding to the word or adjacent words is taken as a pronunciation segment. As shown in fig. 3, it can be obtained that the words "my" and "favorite" are continuously pronounced without pause based on the acoustic parameters, so that the pronunciation period corresponding to the two words can be regarded as a pronunciation segment, and the pronunciation time of the pronunciation segment is 0.4 seconds.
Specifically, the average speech speed of the speech can be calculated according to the total pronunciation duration of the speech and the total number of phonemes in the speech, the average pronunciation duration of each pronunciation section can be calculated according to the pronunciation duration of each pronunciation section in the speech, and the average interval duration of the pronunciation sections can be calculated according to the interval duration between two adjacent pronunciations. It should be noted that the interval duration between two adjacent sound segments can also be obtained according to the acoustic parameters.
Based on parameters such as the average speech rate of the speech, the average duration of the pronunciation segments, the average interval duration of the pronunciation segments, etc., the speech rate score and the pause score of the speech can be determined. For example, a standard speech rate may be preset, and the score corresponding to the standard speech rate is set to be 1, and the speech rate score of the speech may be calculated according to the proportional relationship between the average speech rate of the speech and the standard speech rate. The pause score of the speech can be obtained from the average duration of the pronunciation section and the average interval duration of the pronunciation section, if the average duration of the pronunciation section is longer, the pause score of the speech is higher, and if the average interval duration of the pronunciation section is shorter, the pause score of the speech is also higher.
In some embodiments, the speeches score and the pauses score of the speech may also be obtained based on prediction by the machine learning model, for example, parameters such as total pronunciation duration of the speech, pronunciation duration of each pronunciation segment included in the speech, and interval duration between two adjacent pronunciation segments are input into the machine learning model, so that the speeches score and the pauses score of the speech predicted by the machine learning model can be obtained, where a specific obtaining manner of the speeches score and the pauses score of the speech is not limited.
The pronunciation prosody includes the scoring of the pronunciation in terms of pronunciation rhythm, correctness of word re-reading in the sentence, sentence boundary tone, etc. For example, the acoustic parameters of the speech may be input to the prosody evaluation model, so that the prosody evaluation model scores the speech according to the input acoustic parameters in terms of pronunciation rhythm, correctness of word re-reading in the sentence, sentence boundary tone, and the like, thereby obtaining a rhythm score, a re-reading score, and a boundary tone score of the speech.
As shown in fig. 4, a recognized text obtained by performing speech recognition on an exemplary speech does not contain punctuation marks, and there are unfavorable text components such as a mood word, a repetitive word, a word representing correction, and a word representing sentence restart, and therefore, it is necessary to add punctuation marks in the recognized text and remove the unfavorable text components, while retaining the removal traces of the unfavorable text components, so as to extract the text features of the speech.
The extracted text features may include disfluency, keyword, semantic, and pragmatic features corresponding to speech. The disfluency feature specifically comprises a text disfluency score, and can be determined according to a ratio between the number of disfluency components contained in the recognized text and the total number of words contained in the recognized text, wherein the smaller the ratio, the smaller the proportion of the disfluency components contained in the speech is, and therefore the smaller the text disfluency score is. In one embodiment, the text disfluency score may also be derived from a preset machine learning model.
The keyword features are used for representing the features of the relationship between the keywords contained in the recognition text of the voice and the keywords contained in the standard text corresponding to the voice. In one embodiment, keyword extraction needs to be performed on the recognition text and the standard text corresponding to the voice to obtain keywords corresponding to the recognition text and keywords corresponding to the standard text, then the keywords corresponding to the standard text are used as standard results of the keyword extraction performed on the recognition text, keyword evaluation indexes corresponding to the recognition text are calculated, and the obtained keyword evaluation indexes are used as the keyword features of the voice. For example, the keyword evaluation index may include an accuracy rate and a recall rate.
For example, assuming that a keyword a and a keyword B are extracted from the recognized text, if the extracted keyword in the standard text also contains the keyword a but does not contain the keyword B, it indicates that the keyword a is a sample predicted to be positive in the recognized text, and the keyword B is a sample predicted to be positive in the recognized text from a negative sample. If the extracted keyword C in the standard text does not appear in the keywords contained in the recognized text, the keyword C is a sample in the recognized text that predicts a positive sample as a negative sample. Based on the rule, different types of keyword samples in the recognized text can be determined, and therefore keyword features such as accuracy and recall rate can be calculated.
Semantic features may include identifying subject-matter features and tif-idf (term frequency-inverse text frequency index, a technique for assessing the importance of a word to one of a set of documents or a corpus) features of text, and the like. Illustratively, the topic features may be obtained by performing topic content classification recognition on the recognition text through a topic recognition model, and the tif-idf features may be obtained by performing feature extraction on the recognition text through a tif-idf feature extraction model, which is not limited herein.
The pragmatic features may include the number of types of words, the number of patterns, the accuracy score of grammar, etc. corresponding to the recognition text, so that the pragmatic features reflect the diversity of words, the diversity of patterns and the accuracy of grammar contained in the recognition text. The types of words in the recognized text can be distinguished by parts of speech, for example, words corresponding to different parts of speech such as nouns, verbs, adjectives and the like belong to different types, and the sentence pattern can be a common sentence pattern such as a main predicate object, a fixed form and the like. The pragmatic features contained in the recognized text can be obtained through a preset language model, and the method is not limited in this place.
Based on the above detailed description, the speech processing module provided in the present application can extract various acoustic features and text features of speech, and these acoustic features and text features all input an input signal for scoring the speech as a scoring prediction model into a scoring prediction model matching the type to which the speech belongs, so that the scoring prediction model matching the type to which the speech belongs accurately scores the speech according to these acoustic features and text features.
FIG. 5 is a flow diagram illustrating a method of speech processing according to an example embodiment. The method is proposed based on the speech processing model shown in fig. 2, and may be applied to the implementation environment shown in fig. 1, for example, and is specifically executed by the server 200 in the implementation environment shown in fig. 1.
As shown in fig. 5, in an exemplary embodiment, the speech processing method at least includes the following steps:
step 310, obtaining acoustic parameters and recognition texts obtained by performing recognition processing on the voice.
As mentioned above, the acoustic parameters are related parameters for describing acoustic characteristics of the speech, and may include, for example and without limitation, pronunciation time lengths of respective phonemes in the speech, total pronunciation time lengths of the speech, sustained pronunciation time periods of respective words in the speech, pronunciation strengths, and lowest frequencies of sound waves corresponding to the respective words.
The recognition text is a text sequence composed of individual words in speech. The acoustic parameters and the recognition text corresponding to the speech can be obtained by specifically recognizing the speech by the speech recognition module 210 in the speech processing model shown in fig. 2.
However, it should be noted that, since the speech recognition module 210 is a process of recognizing each pronunciation unit (such as phoneme and word) contained in the speech, the obtained recognized text does not contain punctuation marks and contains unfavorable text components.
Step 330, extracting the acoustic features of the voice according to the acoustic parameters, and extracting the text features of the voice according to the recognized text.
As described above, the acoustic features are feature expressions obtained by feature extraction on acoustic parameters of the speech, and may include features such as pronunciation accuracy, pronunciation fluency, pronunciation prosody, and the like of the speech; the text features are feature expressions obtained by feature extraction of the recognized text of the voice, and may include keyword features, semantic features, pragmatic features, disfluency features, and the like of the voice, and the specific feature types of the acoustic features and the text features of the voice are not limited in this place.
The acoustic features of the speech may be obtained by performing feature extraction on the acoustic features of the speech by using the acoustic feature extraction module 220 in the speech processing model shown in fig. 2. The text feature of the speech can be specifically extracted from the recognized text of the speech by the text feature extraction module 240 in the speech processing model shown in fig. 2. The process of extracting the text features of the acoustic features of the speech may specifically refer to the content described in the embodiments corresponding to fig. 3 and fig. 4, and is not described herein again.
Since the recognized text of the speech obtained in step 310 does not contain punctuation marks, which is not beneficial to the extracted text features of the speech, in an embodiment, punctuation marks are further added to the recognized text before extracting the text features of the recognized text, for example, the punctuation marks are added to the recognized text by the text post-processing module 230 in the speech processing model shown in fig. 2, so as to extract the text features based on the recognized text to which the punctuation marks are added.
Also, the text component of the speech obtained in step 310 is not removed, and the text feature of the speech extracted for the recognition text is not beneficial, so in another embodiment, before the text feature is extracted for the recognition text, the text component of the speech is also detected in the recognition text, and the detected text component of the speech is removed from the recognition text, so as to extract the text feature based on the recognition text from which the text component of the speech is removed. Identifying the unfavorable text components contained in the text can also be performed by the text post-processing module 230 in the speech processing model shown in fig. 2.
In other embodiments, punctuation marks can be added into the recognition text at the same time, and unfavorable text components contained in the recognition text can be removed, so as to ensure the accuracy of text feature extraction for the recognition text to the maximum extent.
In addition, in order to extract the disfluency feature of the speech, it is necessary to leave a trace of removal of text components of the disfluency in the recognized text. For example, each illicit text component in the detected recognition text is marked by using a special symbol, and when the semantic feature, the pragmatic feature or the keyword feature of the voice is extracted, each illicit text component marked is automatically ignored so as to ensure the accuracy of the text features such as the semantic feature, the pragmatic feature and the keyword feature.
And 350, inputting the acoustic features and the text features into a scoring prediction model matched with the scoring point types according to the scoring point types associated with the voice, wherein the scoring prediction models output according to the acoustic features, the text features and the scoring point types of the voice are different in scoring prediction model matched with different scoring point types.
As described above, the score type associated with a voice is a score corresponding to a voice to be considered when scoring the voice, and corresponds to a question type to which a test question belongs in an open spoken language test scenario. In this embodiment, according to the score point type associated with the speech, a score prediction model matching the score point type may be determined from score prediction models 250 included in the speech processing model shown in fig. 2, so as to score the speech based on the score prediction model.
The scoring prediction model 250 shown in fig. 2 is configured with a plurality of scoring prediction models, and each scoring prediction model scores based on some unique features corresponding to the score point type of model matching in the process of scoring the voice according to the acoustic features and the text features of the voice, so that each scoring prediction model can predict the score value of the voice corresponding to different score point types.
In this embodiment, the acoustic features and the text features extracted in step 330 need to be input into a scoring prediction model matching the score point type associated with the current speech, so as to predict the score value of the current speech according to the input acoustic features and text features through the scoring prediction model matching the score point type associated with the current speech. Because the score value output by the score prediction model and aiming at the voice is obtained by performing score prediction on the voice based on the acoustic characteristics and the text characteristics of the voice and the score point type associated with the voice, the score error caused by mismatching of the score prediction model and the score point type associated with the current voice can be avoided, and the accurate scoring of the voice is realized.
As shown in fig. 6, another exemplary embodiment of the present application provides a method for scoring a spoken language examination, which is still applicable to the implementation environment shown in fig. 1 and can be specifically executed by the terminal 1() () in the implementation environment shown in fig. 1. The scoring method of the spoken language test at least comprises the following steps:
step 410, displaying examination questions on a spoken language examination interface;
step 430, recording the voice input aiming at the test question when detecting that the audio recording instruction is triggered;
and step 450, displaying a scoring value aiming at the voice in the oral test interface, wherein the scoring value is obtained by scoring the voice according to the acoustic characteristics and the text characteristics of the voice and the question type of the examination question by a scoring prediction model matched with the question type of the examination question.
First, it should be noted that the method provided in this embodiment is specifically applied to an application scenario of an open spoken language examination, where a spoken language examination interface is a client interface displayed in a terminal, and an intelligent spoken language examination scenario can be implemented based on interaction between the spoken language examination interface and a user.
In the oral test interface with the examination questions displayed, when an audio recording instruction is detected to be triggered, the user is indicated to trigger an answer operation, and therefore the voice input aiming at the examination questions needs to be recorded, specifically the answer voice of the user.
And because the answer voice of the user is recorded aiming at the test questions displayed in the oral test interface, the recorded answer voice is associated with the question types of the test questions.
According to the answer voice of the user, the voice is graded according to the acoustic characteristic and the text characteristic of the voice and the question type of the current examination question through a grading prediction model matched with the question type of the current examination question, a grading value corresponding to the voice can be obtained, and the grading value is displayed in a spoken language examination interface, so that the examination aiming at the current examination question is completed. It should be noted that, for the scoring process of the answer speech of the user, please refer to the speech processing process described in the foregoing embodiment, which is not described herein again.
In addition, the scoring process for the answer voice of the user can be specifically executed by a terminal where the oral test interface is located, or after the terminal where the oral test interface is located obtains the answer voice of the user, the answer voice of the user is sent to the server, so that the server performs voice processing on the answer voice of the user to obtain a score value, the obtained score value is returned to the terminal where the oral test interface is located, and the terminal correspondingly displays the received score value in the oral test interface.
Referring to fig. 7, fig. 7 is an interaction flow diagram illustrating a spoken test interface according to an exemplary embodiment. Wherein, an examination question 1 is displayed in fig. 7(a), and the examination question 1 belongs to the topic expression type, when the user clicks the "start recording" button in fig. 7(a), the terminal detects a triggered audio recording instruction, and further records the answer voice for the examination question 1. When the terminal obtains the score value of the answer speech for the test question 1, the score value of the answer speech for the test question 1 is displayed in the oral test interface shown in fig. 7(b), so that the user can obtain the score of answering the test question 1.
After completion of the answer for the test question 1, the test for the next test question 1 is continued. As shown in fig. 7(c), the spoken language test interface continues to display the test question 2, and the test question 2 belongs to the type of talking on picture. Similarly, when the user clicks the "start recording" button in fig. 7(c), the terminal detects the triggered audio recording instruction, and further records the answer voice for the test question 2. When the terminal obtains the score value of the answer speech for the test question 2, the score value of the answer speech for the test question 2 is displayed in the oral test interface shown in fig. 7(d), and the user can obtain the score of answering the test question 2.
Therefore, the user can timely know the examination score after recording the answering voice, and the process of the oral examination is more intelligent. The test result displayed on the oral test interface is matched with the question type of the test question, namely the answer result of the oral test is the corresponding score obtained by considering the characteristic that the score points of each question type are different, so that the accuracy of the obtained test result is higher.
In addition, in order to prove the accuracy of the voice processing model provided by the application to voice scoring, the application adopts a plurality of examination questions of a talking-in-picture class and a plurality of examination questions of a topic expression class respectively to test the voice processing model.
Specifically, voices for answering a plurality of examination questions in the class of talking on the picture are collected in advance, voices for answering a plurality of examination questions in the class of topic expression are collected, a plurality of teachers score the voices based on the examination questions and the types of the examination questions, and the voices are automatically scored through a voice processing model provided by the application. By obtaining the consistency rate of various question types, the effect of the voice processing model for scoring the voices of various question types can be judged.
Wherein the coincidence rate is understood as the proportion of the number of voices in which the difference between the score value for the voice output by the voice processing model and the score value of the teacher is within a set threshold to the total number of voices. Therefore, the higher the consistency rate is, the better the scoring effect of the voice processing model is, and the score value output by the voice processing model is more accurate.
In addition, in order to prove the influence of each processing module contained in the middle layer module of the voice processing model on the model scoring effect, the voice processing model without the acoustic feature extraction module, the voice processing model without the text post-processing module, the voice processing model without the text feature extraction module and the voice processing model simultaneously containing the acoustic feature extraction module, the text post-processing module and the text feature extraction module are respectively tested during the experiment to automatically score the voice, and the obtained consistency rate result is shown in the following table 1:
Figure BDA0002565294910000161
TABLE 1
As can be seen from table 1, the voice processing model including the acoustic feature extraction module, the text post-processing module, and the text feature extraction module has the highest score consistency rate, and has the highest score consistency rate on any topic type, and the voice processing model has the best score effect. And since the acoustic features and the text features of the speech are important feature bases for the speech processing model to score the speech, no matter which topic type is aimed at, the score consistency rate of the speech processing model without the acoustic feature extraction module and the speech processing model without the text feature extraction module is low, so that the acoustic features and the text features of the speech are essential features for inputting the score prediction model when the score prediction model scores the speech. The voice processing model without the text post-processing module has higher rating consistency on any topic, but still has a certain difference compared with the voice processing model simultaneously comprising the acoustic feature extraction module, the text post-processing module and the text feature extraction module.
Based on the above experimental results, the speech processing model provided by the application can be adapted to scoring the speech of different scoring point types, and in the speech processing model provided by the application, the acoustic feature extraction module, the text post-processing module and the text feature extraction module all play an important role in improving the accuracy of speech scoring.
FIG. 8 is a block diagram illustrating a speech processing apparatus according to an example embodiment. The speech processing apparatus is applicable to the implementation environment shown in fig. 1, and may be specifically configured in the server 200 in the implementation environment shown in fig. 1.
As shown in fig. 8, in an exemplary embodiment, the speech processing apparatus includes a recognition processing module 510, a feature extraction module 530, and a score acquisition module 550. The recognition processing module 510 is configured to obtain acoustic parameters and recognition text obtained by performing recognition processing on speech. The feature extraction module 530 is configured to extract an acoustic feature of the speech according to the acoustic parameter, and extract a text feature of the speech according to the recognized text. The score obtaining module 550 is configured to input the acoustic feature and the text feature into a score prediction model matched with the type to which the voice belongs according to the score point type associated with the voice, and obtain score values for the voice output by the score prediction model according to the acoustic feature, the text feature and the score point type associated with the voice, where the score prediction models matched with different types of voices are different.
In another exemplary embodiment, the feature extraction module 530 includes a disfluency component detection unit and a disfluency component removal unit. The disfluency component detection unit is used for detecting disfluency text components in the recognition text. The illiquid component removing unit is used for removing illiquid text components contained in the recognition text and extracting text features based on the recognition text from which the illiquid text components are removed.
In another exemplary embodiment, the apparatus further includes a punctuation mark adding module, configured to add punctuation marks to the recognition text, so as to perform extraction of text features based on the recognition text to which the punctuation marks are added.
In another exemplary embodiment, the feature extraction module 530 includes a first confidence validation unit and a second confidence validation unit. The first confidence confirming unit is used for confirming the confidence of each phoneme contained in the voice according to the acoustic parameters. The second confidence confirming unit is used for combining each phoneme to obtain each phoneme set contained in the voice, determining the confidence of each phoneme set based on the confidence of each phoneme, and taking the confidence of each factor and the confidence of each factor set as the acoustic characteristics of the voice.
In another exemplary embodiment, the feature extraction module 530 includes a duration parameter confirmation unit and a pronunciation fluency confirmation unit. The duration parameter confirming unit is used for confirming duration parameters of the voice in the pronunciation process according to the acoustic parameters. The pronunciation fluency determination unit is used for determining pronunciation fluency of the voice according to the time length parameter and taking the pronunciation fluency as the acoustic feature of the voice.
In another exemplary embodiment, the pronunciation fluency comprises an average pace of speech, an average duration of pronunciation segments, and an average interval duration of pronunciation segments; the pronunciation fluency confirming unit comprises a time information confirming subunit and a fluency characteristic confirming subunit. The time information confirming subunit is used for determining the total pronunciation time of the voice, the pronunciation time of each pronunciation section contained in the voice and the interval time between two adjacent pronunciation sections according to the time length parameter. The fluency characteristic confirmation subunit is used for determining the average speech speed of the speech according to the total pronunciation duration and the total number of phonemes contained in the speech, determining the average pronunciation duration of each pronunciation section according to the pronunciation duration of each pronunciation section, and determining the average interval duration of the pronunciation sections according to the interval duration between two adjacent pronunciation sections.
In another exemplary embodiment, the acoustic features include pronunciation prosody corresponding to speech; the feature extraction module 530 includes a prosody determining unit, which is configured to input the acoustic parameters into the prosody evaluation model to obtain pronunciation prosody corresponding to the voice evaluated by the prosody evaluation model according to the acoustic parameters.
In another exemplary embodiment, the feature extraction module 530 includes a keyword extraction unit and an evaluation index calculation unit. The keyword extraction unit is used for respectively extracting keywords from the recognition text and the standard text corresponding to the voice to obtain keywords corresponding to the recognition text and keywords corresponding to the standard text. The evaluation index calculation unit is used for taking the keywords corresponding to the standard text as the standard result of keyword extraction of the recognition text, calculating the keyword evaluation index corresponding to the recognition text, and taking the obtained keyword evaluation index as the text feature of the voice.
In another exemplary embodiment, the feature extraction module 530 includes a disfluency number validation unit and a disfluency value validation unit. The disfluency amount confirmation unit is used for determining the amount of disfluency text components contained in the recognition text. The disfluency ratio determination unit is used for determining the text characteristics of the voice according to the ratio between the number of disfluency text components and the total number of words contained in the recognized text.
In another exemplary embodiment, the apparatus further comprises a parameter acquisition module and a model training module. The parameter obtaining module is used for obtaining acoustic features and text features corresponding to the voices corresponding to the scoring point types and obtaining scoring values set for the voices. The model training module is used for inputting the acoustic features and the text features corresponding to the voices and the score values set for the voices into the score prediction models matched with the score point types associated with the voices so as to train the score prediction models matched with the score point types associated with the voices.
In another exemplary embodiment, a scoring device for a spoken language examination is also provided. The scoring device for spoken language examination is applicable to the implementation environment shown in fig. 1, and may be specifically configured in the terminal 100 in the implementation environment shown in fig. 1.
The scoring device of the spoken language test comprises a test question display module, a voice recording module and a scoring display module. The test question display module is used for displaying test questions on a spoken test interface. The voice recording module is used for recording voice input aiming at the test questions when the voice recording instruction is triggered. The scoring display module is used for displaying a scoring value aiming at the voice in the oral test interface, wherein the scoring value is obtained by scoring the voice according to the acoustic characteristic and the text characteristic of the voice and the question type of the test question by a scoring prediction model matched with the question type of the test question.
It should be noted that the apparatus provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module and unit execute operations has been described in detail in the method embodiment, and is not described again here.
Embodiments of the present application also provide an electronic device including a processor and a memory, wherein the memory has stored thereon computer readable instructions, which when executed by the processor, implement the voice processing method or the scoring method of a spoken language test as described above.
Fig. 9 is a schematic structural diagram of an electronic device according to an exemplary embodiment.
It should be noted that the electronic device is only an example adapted to the application and should not be considered as providing any limitation to the scope of use of the application. The electronic device is also not to be construed as requiring reliance on, or necessity of, one or more components of the exemplary electronic device illustrated in fig. 9.
As shown in fig. 9, in an exemplary embodiment, the electronic device includes a processing component 801, a memory 802, a power component 803, a multimedia component 804, an audio component 805, a sensor component 807, and a communication component 808. The above components are not all necessary, and the electronic device may add other components or reduce some components according to its own functional requirements, which is not limited in this embodiment.
The processing component 801 generally controls overall operation of the electronic device, such as operations associated with display, data communication, and log data processing. The processing component 801 may include one or more processors 809 to execute instructions to perform all or a portion of the above-described operations. Further, the processing component 801 may include one or more modules that facilitate interaction between the processing component 801 and other components. For example, the processing component 801 may include a multimedia module to facilitate interaction between the multimedia component 804 and the processing component 801.
The memory 802 is configured to store various types of data to support operation at the electronic device, examples of which include instructions for any application or method operating on the electronic device. The memory 802 has stored therein one or more modules configured to be executed by the one or more processors 809 to perform all or part of the steps of the speech processing method or scoring method of a spoken language test described in the above embodiments.
The power supply component 803 provides power to the various components of the electronic device. The power components 803 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for an electronic device.
The multimedia component 804 includes a screen that provides an output interface between the electronic device and the user. In some embodiments, the screen may include a TP (Touch Panel) and an LCD (Liquid Crystal Display). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The audio component 805 is configured to output and/or input audio signals. For example, the audio component 805 includes a microphone configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. In some embodiments, the audio component 805 also includes a speaker for outputting audio signals.
The sensor assembly 807 includes one or more sensors for providing various aspects of status assessment for the electronic device. For example, the sensor assembly 807 may detect an open/closed state of the electronic device, and may also detect a temperature change of the electronic device.
The communication component 808 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a Wireless network based on a communication standard, such as Wi-Fi (Wireless-Fidelity, Wireless network).
It will be appreciated that the configuration shown in fig. 9 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 9, or have different components than shown in fig. 9. Each of the components shown in fig. 9 may be implemented in hardware, software, or a combination thereof.
Another aspect of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the voice processing method or the scoring method of a spoken test as described above. The computer-readable storage medium may be included in the electronic device described in the above embodiment, or may exist separately without being incorporated in the electronic device.
The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. A method of speech processing, comprising:
acquiring acoustic parameters and recognition texts obtained by performing recognition processing on voices;
extracting acoustic features of the voice according to the acoustic parameters, and extracting text features of the voice according to the recognition text;
according to the score point type associated with the voice, the acoustic feature and the text feature are input into a score prediction model matched with the score point type, score values output by the score prediction model according to the acoustic feature, the text feature and the score point type and aiming at the voice are obtained, and the score prediction models matched with different score point types are different.
2. The method of claim 1, wherein extracting text features of the speech from the recognized text comprises:
detecting an illicit text component in the recognition text;
and removing the disfluency text components contained in the recognition text, and extracting the text features based on the recognition text from which the disfluency text components are removed.
3. The method according to claim 1 or 2, wherein before extracting the text features of the speech from the recognized text, the method further comprises:
punctuation marks are added in the recognition texts, so that the text features are extracted based on the recognition texts added with the punctuation marks.
4. The method of claim 1, wherein extracting the acoustic features of the speech from the acoustic parameters comprises:
determining the confidence of each phoneme contained in the voice according to the acoustic parameters;
and combining the phonemes to obtain phoneme sets contained in the voice, determining the confidence coefficient of each phoneme set based on the confidence coefficient of each phoneme, and taking the confidence coefficient of each factor and the confidence coefficient of each factor set as the acoustic features of the voice.
5. The method of claim 1, wherein extracting the acoustic features of the speech from the acoustic parameters comprises:
determining a duration parameter of the voice in the pronunciation process according to the acoustic parameter;
and determining pronunciation fluency of the voice according to the duration parameter, and taking the pronunciation fluency as the acoustic feature of the voice.
6. The method of claim 5, wherein the pronunciation fluency comprises an average speech rate, an average duration of pronunciation segments, and an average interval duration of pronunciation segments of the speech; determining pronunciation fluency of the voice according to the duration parameter, comprising:
determining the total pronunciation duration of the voice, the pronunciation duration of each pronunciation section contained in the voice and the interval duration between two adjacent pronunciation sections according to the duration parameter;
determining the average speech speed according to the total pronunciation duration and the total number of phonemes in the speech, determining the average pronunciation duration of each pronunciation section according to the pronunciation duration of each pronunciation section, and determining the average interval duration of each pronunciation section according to the interval duration between two adjacent pronunciation sections.
7. The method of claim 1, wherein the acoustic features include a degree of prosody of pronunciation corresponding to the speech; extracting the acoustic features of the speech according to the acoustic parameters, comprising:
and inputting the acoustic parameters into a prosody evaluation model to obtain pronunciation prosody degrees corresponding to the voice, which are obtained by evaluating the prosody evaluation model according to the acoustic parameters.
8. The method of claim 1, wherein extracting text features of the speech from the recognized text comprises:
respectively extracting keywords from the recognition text and the standard text corresponding to the voice to obtain keywords corresponding to the recognition text and keywords corresponding to the standard text;
and taking the keywords corresponding to the standard text as a standard result of extracting the keywords from the recognition text, calculating the keyword evaluation indexes corresponding to the recognition text, and taking the obtained keyword evaluation indexes as the text features of the voice.
9. The method of claim 1, wherein extracting text features of the speech from the recognized text comprises:
determining the number of disfluency text components contained in the recognition text;
determining text features of the speech based on a ratio between the number of disfluency text components and a total number of words contained in the recognized text.
10. The method of claim 1, further comprising:
aiming at voices corresponding to a plurality of score point types, obtaining acoustic features and text features corresponding to each voice, and obtaining score values set for each voice;
and inputting the acoustic features and the text features corresponding to the voices and the score values set for the voices into score prediction models matched with the score point types associated with the voices so as to train the score prediction models matched with the score point types associated with the voices.
11. The method of claim 1, wherein the speech is speech answered to an examination question in a spoken test, and wherein the speech is associated with a score point type corresponding to a question type of the examination question.
12. A scoring method for a spoken language examination, comprising:
displaying examination questions on a spoken language examination interface;
when an audio recording instruction is detected to be triggered, recording voice input aiming at the test question;
and displaying a scoring value aiming at the voice in the spoken language test interface, wherein the scoring value is obtained by scoring the voice according to the acoustic characteristics and the text characteristics of the voice and the question type of the examination question by a scoring prediction model matched with the question type of the examination question.
13. A speech processing apparatus, comprising:
the recognition processing module is used for acquiring acoustic parameters and recognition texts obtained by performing recognition processing on the voice;
the feature extraction module is used for extracting the acoustic features of the voice according to the acoustic parameters and extracting the text features of the voice according to the recognition text;
and the score acquisition module is used for inputting the acoustic features and the text features into score prediction models matched with the score types according to the score types associated with the voice, so that score values output by the score prediction models according to the acoustic features, the text features and the score types and aiming at the voice are obtained, and the score prediction models matched with different score types are different.
14. An electronic device, comprising:
a memory storing computer readable instructions;
a processor to read computer readable instructions stored by the memory to perform the method of any of claims 1-12.
15. A computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor of a computer, cause the computer to perform the method of any one of claims 1-12.
CN202010630225.0A 2020-07-01 2020-07-01 Voice processing method and device, electronic equipment and computer readable storage medium Active CN111833853B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010630225.0A CN111833853B (en) 2020-07-01 2020-07-01 Voice processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010630225.0A CN111833853B (en) 2020-07-01 2020-07-01 Voice processing method and device, electronic equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111833853A true CN111833853A (en) 2020-10-27
CN111833853B CN111833853B (en) 2023-10-27

Family

ID=72899665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010630225.0A Active CN111833853B (en) 2020-07-01 2020-07-01 Voice processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111833853B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466332A (en) * 2020-11-13 2021-03-09 阳光保险集团股份有限公司 Method and device for scoring speed, electronic equipment and storage medium
CN112738555A (en) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 Video processing method and device
CN112863484A (en) * 2021-01-25 2021-05-28 中国科学技术大学 Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method
CN113064994A (en) * 2021-03-25 2021-07-02 平安银行股份有限公司 Conference quality evaluation method, device, equipment and storage medium
CN113205729A (en) * 2021-04-12 2021-08-03 华侨大学 Foreign student-oriented speech evaluation method, device and system
CN113539272A (en) * 2021-09-13 2021-10-22 腾讯科技(深圳)有限公司 Voice recognition method and device, storage medium and electronic equipment
CN113744736A (en) * 2021-09-08 2021-12-03 北京声智科技有限公司 Command word recognition method and device, electronic equipment and storage medium
CN113823329A (en) * 2021-07-30 2021-12-21 腾讯科技(深圳)有限公司 Data processing method and computer device
CN114155831A (en) * 2021-12-06 2022-03-08 科大讯飞股份有限公司 Voice evaluation method, related equipment and readable storage medium
CN115346421A (en) * 2021-05-12 2022-11-15 北京猿力未来科技有限公司 Spoken language fluency scoring method, computing device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110040554A1 (en) * 2009-08-15 2011-02-17 International Business Machines Corporation Automatic Evaluation of Spoken Fluency
CN102354495A (en) * 2011-08-31 2012-02-15 中国科学院自动化研究所 Testing method and system of semi-opened spoken language examination questions
CN102509483A (en) * 2011-10-31 2012-06-20 苏州思必驰信息科技有限公司 Distributive automatic grading system for spoken language test and method thereof
CN103559894A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and system for evaluating spoken language
KR101609473B1 (en) * 2014-10-14 2016-04-05 충북대학교 산학협력단 System and method for automatic fluency evaluation of english speaking tests
CN109192224A (en) * 2018-09-14 2019-01-11 科大讯飞股份有限公司 A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing
CN110797010A (en) * 2019-10-31 2020-02-14 腾讯科技(深圳)有限公司 Question-answer scoring method, device, equipment and storage medium based on artificial intelligence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110040554A1 (en) * 2009-08-15 2011-02-17 International Business Machines Corporation Automatic Evaluation of Spoken Fluency
CN102354495A (en) * 2011-08-31 2012-02-15 中国科学院自动化研究所 Testing method and system of semi-opened spoken language examination questions
CN102509483A (en) * 2011-10-31 2012-06-20 苏州思必驰信息科技有限公司 Distributive automatic grading system for spoken language test and method thereof
CN103559894A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and system for evaluating spoken language
KR101609473B1 (en) * 2014-10-14 2016-04-05 충북대학교 산학협력단 System and method for automatic fluency evaluation of english speaking tests
CN109192224A (en) * 2018-09-14 2019-01-11 科大讯飞股份有限公司 A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing
CN110797010A (en) * 2019-10-31 2020-02-14 腾讯科技(深圳)有限公司 Question-answer scoring method, device, equipment and storage medium based on artificial intelligence

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466332A (en) * 2020-11-13 2021-03-09 阳光保险集团股份有限公司 Method and device for scoring speed, electronic equipment and storage medium
CN112466332B (en) * 2020-11-13 2024-05-28 阳光保险集团股份有限公司 Method and device for scoring speech rate, electronic equipment and storage medium
CN112738555A (en) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 Video processing method and device
CN112738555B (en) * 2020-12-22 2024-03-29 上海幻电信息科技有限公司 Video processing method and device
CN112863484A (en) * 2021-01-25 2021-05-28 中国科学技术大学 Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method
CN112863484B (en) * 2021-01-25 2024-04-09 中国科学技术大学 Prosodic phrase boundary prediction model training method and prosodic phrase boundary prediction method
CN113064994A (en) * 2021-03-25 2021-07-02 平安银行股份有限公司 Conference quality evaluation method, device, equipment and storage medium
CN113205729A (en) * 2021-04-12 2021-08-03 华侨大学 Foreign student-oriented speech evaluation method, device and system
CN115346421A (en) * 2021-05-12 2022-11-15 北京猿力未来科技有限公司 Spoken language fluency scoring method, computing device and storage medium
CN113823329A (en) * 2021-07-30 2021-12-21 腾讯科技(深圳)有限公司 Data processing method and computer device
CN113823329B (en) * 2021-07-30 2024-07-09 腾讯科技(深圳)有限公司 Data processing method and computer device
CN113744736B (en) * 2021-09-08 2023-12-08 北京声智科技有限公司 Command word recognition method and device, electronic equipment and storage medium
CN113744736A (en) * 2021-09-08 2021-12-03 北京声智科技有限公司 Command word recognition method and device, electronic equipment and storage medium
CN113539272A (en) * 2021-09-13 2021-10-22 腾讯科技(深圳)有限公司 Voice recognition method and device, storage medium and electronic equipment
CN114155831A (en) * 2021-12-06 2022-03-08 科大讯飞股份有限公司 Voice evaluation method, related equipment and readable storage medium

Also Published As

Publication number Publication date
CN111833853B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN111833853B (en) Voice processing method and device, electronic equipment and computer readable storage medium
Chen et al. Automated scoring of nonnative speech using the speechrater sm v. 5.0 engine
Narayanan et al. Behavioral signal processing: Deriving human behavioral informatics from speech and language
Johar Emotion, affect and personality in speech: The Bias of language and paralanguage
CN107133303A (en) Method and apparatus for output information
CN112307742A (en) Session type human-computer interaction spoken language evaluation method, device and storage medium
CN101551947A (en) Computer system for assisting spoken language learning
CN110675292A (en) Child language ability evaluation method based on artificial intelligence
Wang Detecting pronunciation errors in spoken English tests based on multifeature fusion algorithm
Han et al. [Retracted] The Modular Design of an English Pronunciation Level Evaluation System Based on Machine Learning
CN104572617A (en) Oral test answer deviation detection method and device
CN109697975B (en) Voice evaluation method and device
CN113409768A (en) Pronunciation detection method, pronunciation detection device and computer readable medium
CN112233648A (en) Data processing method, device, equipment and storage medium combining RPA and AI
Shufang Design of an automatic english pronunciation error correction system based on radio magnetic pronunciation recording devices
Hönig Automatic assessment of prosody in second language learning
US20230298615A1 (en) System and method for extracting hidden cues in interactive communications
CN115905475A (en) Answer scoring method, model training method, device, storage medium and equipment
CN115116474A (en) Spoken language scoring model training method, scoring method, device and electronic equipment
Dielen Improving the Automatic Speech Recognition Model Whisper with Voice Activity Detection
CN114241835A (en) Student spoken language quality evaluation method and device
Zhang An automatic assessment method for spoken English based on multimodal feature fusion
Johnson et al. An Analysis of Large Language Models for African American English Speaking Children’s Oral Language Assessment
Desai et al. Virtual Assistant for Enhancing English Speaking Skills
CN115186083B (en) Data processing method, device, server, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant