CN106558252B

CN106558252B - Spoken language practice method and device realized by computer

Info

Publication number: CN106558252B
Application number: CN201510629522.2A
Authority: CN
Inventors: 陈冉; 廉晓洋; 陆睿; 高松; 乔恩奇; 孟杰; 安子岩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-09-28
Filing date: 2015-09-28
Publication date: 2020-08-21
Anticipated expiration: 2035-09-28
Also published as: CN106558252A

Abstract

The invention aims to provide a spoken language practice method and device realized by computer equipment. The method comprises the steps that a computer device obtains voice input information of a user, wherein the voice input information corresponds to a specific spoken language conversation scene; providing the user with response information corresponding to the voice input information within the spoken conversation scenario; when the session is ended, providing feedback information related to the session for the user, wherein the feedback information comprises session evaluation information and/or session suggestion information of the user. Compared with the prior art, the method and the device provided by the invention surround the specific spoken language conversation scene, provide corresponding response information for the voice input information of the user, and thus realize one-to-one conversation between human and machines.

Description

Spoken language practice method and device realized by computer

Technical Field

The invention relates to the field of human-computer interaction, in particular to a spoken language practice method and device realized by computer equipment.

Background

With the development of speech processing technology and natural language processing technology, various computer devices can more accurately understand human languages, and speech input information becomes an important input mode in the human-computer interaction process. With the development of society, foreign language learning becomes an urgent need of people, and the spoken language ability is the most important ring in the foreign language ability, can pronounce purely and accurately express the important index which means measuring the foreign language ability.

At present, the traditional foreign language spoken language learning can only exercise spoken language by means of imitation, reading aloud, simulated conversation with other students and the like, or one-to-one spoken language teaching is performed by hiring teachers.

However, in the course of exercising spoken language by means of imitation, reading aloud, simulated dialogue with other students, etc., it is difficult to find out the problems and defects of the spoken language ability, so that the spoken language ability cannot be improved in a targeted manner. Although the one-to-one oral teaching by the teacher can be used for practicing the oral language in a targeted manner, the method is high in cost and the effect depends on the abilities of the teacher.

Disclosure of Invention

The invention aims to provide a spoken language practice method and device realized by computer equipment.

According to an aspect of the present invention, there is provided a spoken language practice method implemented by a computer device, wherein the method comprises the steps of:

-obtaining voice input information of a user, the voice input information corresponding to a specific spoken conversation scenario;

-providing the user with answer information corresponding to the speech input information within the spoken conversation scenario;

-when the session is ended, providing the user with feedback information about the session, the feedback information comprising session rating information and/or session advice information for the user.

According to another aspect of the present invention, there is also provided an apparatus for implementing spoken language practice in a computer device, wherein the apparatus comprises:

-means for obtaining voice input information of a user, the voice input information corresponding to a specific spoken conversation scenario;

-means for providing said user with answer information corresponding to said speech input information within said spoken conversation scenario;

-means for providing the user with feedback information about the session when the session is ended, said feedback information comprising session rating information and/or session advice information for the user.

According to still another aspect of the present invention, there is also provided a spoken language practice system, wherein the system includes a user device and a network device, wherein the network device includes the aforementioned apparatus for implementing spoken language practice in a computer device according to another aspect of the present invention, and the user device includes:

-means for receiving voice input information of a user, the voice input information corresponding to a specific spoken conversation scenario;

-means for sending and receiving information to said network device, in particular for:

-sending the voice input information to the network device;

-receiving answer information corresponding to the voice input information from the network device;

-receiving feedback information on this session from the network device, the feedback information comprising session rating information and/or session advice information for the user;

-means for feeding back to said user, in particular for:

-providing said response information to said user;

-providing feedback information to the user about this session.

Compared with the prior art, the method and the device provided by the invention surround the specific spoken language conversation scene, provide corresponding response information for the voice input information of the user, and thus realize one-to-one conversation between human and machines.

In addition, in the conversation process, the invention can also record and analyze various problems of the user in spoken language pronunciation and semantic expression, and after the conversation is finished, comment the conversation of the user, so that the user can develop better spoken language habits. This enables the user to practice the foreign spoken language by himself, to know the deficiency of his spoken language and to correct it without the need for one-to-one teaching.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is a flow diagram of a method for spoken language practice in accordance with one embodiment of the present invention;

fig. 2 is a schematic view of an apparatus for spoken language practice according to another embodiment of the present invention.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The term "computer device" or "computer" in this context refers to an intelligent electronic device that can execute predetermined processes such as numerical calculation and/or logic calculation by running predetermined programs or instructions, and may include a processor and a memory, wherein the predetermined processes are executed by the processor by executing program instructions prestored in the memory, or the predetermined processes are executed by hardware such as ASIC, FPGA, DSP, or a combination thereof. Computer devices include, but are not limited to, servers, personal computers, laptops, tablets, smart phones, and the like.

The computer equipment comprises user equipment and network equipment. Wherein the user equipment includes but is not limited to computers, smart phones, PDAs, etc.; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers. Wherein the computer device can be operated alone to implement the invention, or can be accessed to a network and implement the invention through interoperation with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

It should be noted that the user equipment, the network device, the network, etc. are only examples, and other existing or future computer devices or networks may also be included in the scope of the present invention, and are included by reference.

The methods discussed below, some of which are illustrated by flow diagrams, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. The processor(s) may perform the necessary tasks.

Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present invention. The present invention may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.

It will be understood that, although the terms "first," "second," etc. may be used herein to describe various apparatus, these apparatus should not be limited by these terms. These terms are used merely to distinguish one device from another. For example, a first apparatus may be referred to as a second apparatus, and similarly, a second apparatus may be referred to as a first apparatus, without departing from the scope of the example embodiments. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood that when an apparatus is referred to as being "connected" or "coupled" to another apparatus, it can be directly connected or coupled to the other apparatus or intervening apparatuses may be present. In contrast, when an apparatus is referred to as being "directly connected" or "directly coupled" to another apparatus, there are no intervening apparatuses present. Other words used to describe the relationship between devices (e.g., "between," "adjacent to," than "directly adjacent to," etc.) should be interpreted in a similar manner.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

In a general scenario, the scheme of the present invention is implemented by a network device. Specifically, the network equipment acquires voice input information input by a user, wherein the voice input information corresponds to a specific spoken language conversation scene; then, the network equipment provides the user with response information corresponding to the voice input information of the user in the spoken language conversation scene; then, when the session is ended, the network device provides the user with feedback information about the session, wherein the feedback information comprises session evaluation information and/or session suggestion information of the user.

When the user equipment is matched with the user equipment, for example, the user equipment realizes actual interaction with a user, specifically, the user equipment receives voice input information input by the user and sends the voice input information to the network equipment, the user equipment receives response information returned by the network equipment and presents the response information to the user, the user equipment receives feedback information returned by the network equipment and presents the feedback information to the user, and the user equipment and the network equipment which execute the functions form a spoken language practice system.

However, it should be understood by those skilled in the art that with the development of computer technology, especially with the improvement of computing/processing capability of user equipment such as smart phones, the present invention can also be implemented by user equipment, at least the user equipment can implement the spoken language practice method of the present invention in several specific spoken language conversation scenarios. For example, the user device stores one or more pieces of relevant data regarding specific spoken conversation scenarios, so that within these spoken conversation scenarios, the user device can implement the answering and feedback of the user's voice input information entirely locally.

For convenience of explanation, the spoken language practice method of the present invention is illustrated as implemented in a network device, however, it should be understood by those skilled in the art that the illustration is only for the purpose of illustrating the present invention and should not be construed as limiting the present invention in any way.

Furthermore, those skilled in the art will appreciate that the present invention is not limited to the types of spoken language exercises that can be used, and can be used for spoken language exercises in any language, including but not limited to English, Japanese, French, German, Chinese, etc.

The present invention is described in further detail below with reference to the attached drawing figures.

FIG. 1 is a method flow diagram that particularly illustrates the process of a method of spoken language practice, in accordance with one embodiment of the present invention.

As shown in fig. 1, in step S1, the network device acquires voice input information input by the user, the voice input information corresponding to a specific spoken conversation scenario.

For example, first, at a user end, a user device receives voice input information input by a user, the voice input information corresponding to a specific spoken conversation scenario. Specifically, the user may first select a particular spoken dialog scenario and then input his voice input information. And then, the user equipment sends the voice input information of the user to the network equipment, so that the network equipment receives the voice input information of the user and the corresponding spoken language conversation scene from the user equipment.

Preferably, the network device may also preprocess the voice input information to obtain character string information and pronunciation information contained in the voice input information.

For example, after receiving the voice input information of the user, the network device performs preprocessing, such as noise filtering, on the voice input information to generate secondary data of the voice input information, such as obtaining character string information and pronunciation information contained therein.

The environment for practicing the spoken language is not quite quiet in many times, and even if the environment is quite quiet, the voice input information of the user can still be interfered by other external electric signals, so that noise filtering is performed, and effective input of the voice input information is facilitated; meanwhile, the spoken language practice method is also favorable for better spoken language practice and the practice effect of spoken language practice of the user, and the occurrence of conditions such as wrong evaluation is avoided.

After preprocessing the voice input information of the user, the network device may obtain the character string information and pronunciation information contained in the voice input information, which may be regarded as secondary data of the voice input information. While noise filtering is performed, the network device can convert the voice input information into data information which is easier to recognize, and meanwhile, data support is provided for information processing and the like in subsequent steps.

In step S2, the network device provides the user with response information corresponding to his voice input information within the spoken conversation scenario. Specifically, in a specific spoken language conversation scenario, the network device may provide the user with response information corresponding to the voice input information of the user by performing semantic analysis or keyword matching on the voice input information of the user.

For example, the network device may perform semantic analysis on the user's voice input information, such as by keyword and context analysis, to generate corresponding response information. For another example, the network device matches the response database according to the keywords in the voice input information of the user, and generates corresponding response information according to a predetermined sentence template.

It should be understood by those skilled in the art that the above-mentioned manner of obtaining the response information corresponding to the voice input information of the user is only an example, and the examples are only for the purpose of illustrating the present invention and should not be construed as any limitation to the present invention, and any other existing or future manner of obtaining the response information corresponding to the voice input information of the user may be incorporated by reference as applicable to the present invention.

In step S3, the network device provides the user with feedback information about the session when the session ends, where the feedback information includes session rating information and/or session suggestion information for the user.

For example, the network device may generate the session rating information and/or the session suggestion information for the user according to the number of sentences of the voice input information of the user in the session and the semantic meaning of the voice input information. The session evaluation information is, for example, an evaluation of an expression ability of the user in a specific contextual session, specifically, a score is given to the performance of the user in the session, and the session suggestion information is, for example, a common example sentence in a scene of the session. The feedback information can reflect the mastering degree of the user on the conversation scene laterally, if the user can express the content which the user wants to express rapidly by fewer sentences, the user can better master the conversation scene, the user can spend more time on the conversation scene with poorer mastering degree according to the requirement of the user, the user time is saved, and the user can better master each conversation scene.

Further, when the network device sends the feedback information to the user terminal and the feedback information is presented at the user device, the user device may provide the feedback information to the user in various forms. For example, the user equipment may play the score of the performance of the session and the common example sentences in the scene of the session to the user in a voice mode. As another example, the user device may present the feedback information to the user in a visual form. The visual representation can be various, such as a text display, a diagram display and the like. Further, the user equipment can display and play the feedback information to the user at the same time.

After the conversation is finished, the invention can provide feedback information of the conversation, such as conversation evaluation information and/or conversation suggestion information, for the user, and the user can conveniently and quickly know the advantages and the defects of the user by checking and inputting the feedback information, so that the invention can help the user to strengthen the advantages and improve the defects, and is beneficial to helping the user to quickly and accurately express the content which the user wants to express.

According to a preferred example of the present invention, after the voice input information input by the user is obtained in step S1, the network device may further analyze the pronunciation of the voice input information to obtain corresponding pronunciation evaluation information and/or pronunciation suggestion information.

Here, the network device may analyze the pronunciation of the voice input information with reference to a standard pronunciation and generate corresponding pronunciation evaluation information and/or pronunciation suggestion information. The pronunciation evaluation information includes, for example, whether a standard pronunciation is satisfied, and the pronunciation suggestion information includes, for example, a standard pronunciation of a word whose pronunciation is not standard or a suggestion for improvement.

The invention analyzes the pronunciation of the voice input information of the user, outputs the good aspect and the poor aspect of the pronunciation of the user, and simultaneously gives an improved suggestion aiming at the pronunciation of the user, thereby better helping the user strengthen the strong point of the pronunciation and simultaneously improving the defects existing in the pronunciation.

The pronunciation analysis step and the response step in step S2 are not strictly sequential, and may be executed in parallel or in any order.

Subsequently, in S3, the feedback information provided by the network device may further include pronunciation evaluation information and/or pronunciation suggestion information.

Further, the feedback information may further include a comprehensive evaluation information, and the comprehensive evaluation information is determined based on the conversation evaluation information and the pronunciation evaluation information. For example, the integrated rating information may be a sum or average of the conversation rating information and the pronunciation rating information. For another example, different weights are set for the conversation evaluation information and the pronunciation evaluation information, so that the integrated evaluation information may be a weighted sum or a weighted average of the conversation evaluation information and the pronunciation evaluation information.

Wherein, for the pronunciation suggestion information, the user terminal can play the pronunciation suggestion information to the user. For example, if the pronunciation suggestion information is the standard pronunciation of all sentences in the voice input information of the user, the user equipment plays the pronunciation suggestion information, so that the user can be helped to comprehensively and intuitively perceive the pronunciation aspect of the user and the shortages of tone and speed, and the user can be helped to quickly improve the pronunciation level of the user.

Alternatively, only the standard pronunciation for the non-standard pronunciation sentence in the user voice input information may be included in the pronunciation suggestion information. The invention selects the sentences with nonstandard pronunciation in the user voice input information, and feeds back and plays the sentences with the standard pronunciation, thereby helping the user to pertinently improve the places with insufficient pronunciation.

Preferably, the pronunciation suggestion information may be presented to the user simultaneously. For example, the user device may present a speakable video of pronunciation suggestion information or a corresponding standard pronunciation slogan as the sentence pronunciation is played. The standard pronunciation mouth-shaped graph or video is displayed while the played sentence is fed back, so that the user can be helped to pronounce correctly, the user is not only helped to improve the pronunciation similarity, the pronunciation of a plurality of characters or letters and the like needs to be known to be similar, and if the mouth-shaped graph and the like are not used, the user can hardly learn the correct pronunciation.

For example, in english, the pronunciation of "i:" and "i" is very similar to that which is difficult to distinguish if it is heard with light and put into words, but if it is possible to give a export graph with a few descriptions of pronunciation trick (for example, the former is like clothes and makes a long sound, and the latter is like clothes and makes a short sound), it is easy for the user to grasp the pronunciation correctly.

According to another preferred example of the present invention, after acquiring the voice input information input by the user in step S1, the network device may further analyze the syntax and syntax of the voice input information to obtain corresponding grammar evaluation information and/or grammar recommendation information.

The network device analyzes the grammar and syntax of the voice input information, and generates corresponding grammar evaluation information and/or grammar suggestion information if the grammar specification is not met or the language is inaccurate. The grammar evaluation information includes, for example, whether there is grammar error or word inaccuracy, and the grammar suggestion information includes, for example, improvement suggestion or example sentence of related grammar and word.

The invention analyzes the grammar of the voice input information of the user, outputs the good and bad aspects of the grammar of the user and gives improved suggestions aiming at the grammar of the user, thereby better helping the user strengthen the strong points of the grammar and improving the defects in the grammar.

The parsing step and the response step in step S2 are not strictly sequential, and may be executed in parallel or in any order.

Subsequently, in S3, the feedback information provided by the network device may further include syntax evaluation information and/or syntax suggestion information.

Further, the feedback information may further include a comprehensive evaluation information, and the comprehensive evaluation information is determined based on the session evaluation information and the grammar evaluation information. For example, the integrated rating information may be a sum or average of the session rating information and the syntax rating information. For another example, different weights are set for the session evaluation information and the syntax evaluation information, so that the integrated evaluation information may be a weighted sum or a weighted average of the session evaluation information and the syntax evaluation information. Wherein, for the grammar suggestion information, the user terminal can present the grammar suggestion information to the user, for example, present the grammar at the error position or the related example sentence of the word to the user. Providing example sentences can help the user to deepen the knowledge of the conversation scene and learn more conversation sentences under the correct grammar.

For example: for example, in English, for the antisense question sentence "I don't think your intersection answers like the drama series, does he? "many people will answer yes in favor of tertiary dislike of watching tv drama, but at time when approval is given here, no should be said; specifically, a corresponding example sentence may be given the answer "No, he doesn't, but festive ways the program.

The two aforementioned preferred examples can be further combined.

For example, according to still another preferred example of the present invention, after acquiring the voice input information input by the user at step S1, in addition to providing the user with response information corresponding to the voice input information thereof within the spoken conversation scenario according to the aforementioned step S2, the network device may analyze the pronunciation of the voice input information to obtain corresponding pronunciation evaluation information and/or pronunciation suggestion information, and analyze the syntax and syntax of the voice input information to obtain corresponding grammar evaluation information and/or grammar suggestion information. Subsequently, in S3, the feedback information provided by the network device includes the session evaluation information and/or the session suggestion information, the pronunciation evaluation information and/or the pronunciation suggestion information, and the grammar evaluation information and/or the grammar suggestion information.

The pronunciation analysis step, the grammar analysis step and the response step in the step S2 are not strictly sequential, and may be executed in parallel or in any order. The speech analysis step and the grammar analysis step may be performed during the conversation or after the conversation is completed.

Further, the feedback information may further include a comprehensive evaluation information, and the comprehensive evaluation information is determined based on the conversation evaluation information, the pronunciation evaluation information, and the grammar evaluation information. For example, the overall evaluation information may be the sum or average of the three. For another example, different weights are set for the conversation evaluation information, the pronunciation evaluation information, and the grammar evaluation information, so that the comprehensive evaluation information may be a weighted sum or a weighted average of the three.

Specifically, the network device gives a score to the session evaluation information, the pronunciation evaluation information, and the grammar evaluation information according to respective preset rules, calculates an average score of the three scores to obtain a total score, and displays the total score and the total score in a feedback manner. Scoring is carried out on three aspects of pronunciation, grammar, scene conversation and the like, so that the user can know the level of the user in each aspect, and the user can spend more time in the poor aspect or pertinently strengthen the better aspect; and if a total score is given, the user is helped to judge whether the user needs to continue to practice the spoken language of the conversation scene.

For example, the user obtains three scores of 50,70 and 90 in 3 aspects of pronunciation evaluation information, grammar evaluation information and session evaluation information, and the total score of the comprehensive evaluation information is 70, so that the user can clearly know that the worst aspect of the session is the pronunciation aspect, and then the user can improve the corresponding pronunciation of the user in a targeted manner, and the other two aspects above the pass line (60 points) can be relaxed appropriately.

The user equipment can present the evaluation information to the user in a chart form after the evaluation information is collected. The chart form is more visual than the character form, the user can be helped to quickly judge the own exercise result, the user does not need to spend time to check the characters word by word and sentence by sentence to judge the own exercise result, the time of the user is saved, and convenience is brought to the user.

In addition, the process shown in fig. 1 may further include an early termination step: and if the voice input information of the user deviates from the situation of the spoken language conversation, ending the conversation. Subsequently, in step S3, the network device provides the user with feedback information about the session when the session ends.

Wherein the network device may determine that the user's voice input information deviates from the spoken conversation scenario of the conversation when at least any one of the following is satisfied:

1) the speech input information cannot be recognized.

Here, the network device may determine whether the voice input information cannot be recognized, and if so, determine that the voice input information does not conform to the session scenario, and end the session. If the voice input information cannot be recognized, it can be considered that the pronunciation of the user has a serious problem or is irrelevant to the current conversation situation, so that in the actual conversation situation, the user cannot perform effective conversation at all, and therefore, if the voice input information of the user cannot be recognized, the conversation is ended in time, the reason why the user cannot recognize is informed, the user can be helped to perform effective exercise, and the user can be helped to find the improvement in the aspects of pronunciation and the like.

2) The semantics of the speech input information do not correspond to the spoken conversation scenario.

Here, the network device may determine whether the semantic meaning of the voice input information is consistent with the session context, and if not, end the session. The semantic meaning may be determined by determining whether a keyword in the voice input information is related to the session context. Through semantic judgment, the user can be informed whether the ongoing conversation conforms to the called conversation scene, if not, the conversation is ended, and the meaningless practice time waste of the user is avoided; matching and feeding back response information corresponding to the voice input information only when the voice input information is matched with the voice input information; therefore, the user can better train the conversation of the conversation scene which the user wants to practice, the user can be helped to quickly adapt to the conversation scene, and the language expression capability of the user in the corresponding conversation scene can be improved.

3) The sentence quantity of the voice input information exceeds the corresponding preset threshold value.

Here, the network device may determine whether the number of sentences of the voice input information exceeds a preset threshold, and if so, determine that the voice input information does not conform to the session scenario, and end the session. The meaning of judging whether the statement quantity of the voice input information exceeds the preset threshold value is that under many situations, people need to use limited statements to express themselves, too many statements can cause the annoyance of others, and meanwhile, if the statement quantity is too much, the situation shows that the user cannot express the content to be expressed quickly, at the moment, the conversation practice can be considered to be unqualified, the conversation is ended in time, and the user can be summarized, improved and perfected.

4) Whether the repeated times of the words and the sentences in the voice input information exceed the corresponding preset threshold value or not.

Here, the network device may determine whether the number of times of repeating the words and phrases in the voice input information exceeds a preset threshold, and if so, determine that the voice input information does not conform to the session scenario, and end the session. In the voice input information, the meaning of the repeated times of the sentences is judged in that when a certain sentence is repeated for multiple times, the user can be considered as having no relay for the conversation, so that the sentence is repeated, at the moment, the time is wasted only when the meaningless conversation is continued, at the moment, the conversation is ended, the time of the user can be saved, the user can better reduce the deficiency in the conversation, if the conversation is ended after a long time, the user probably forgets the deficiency, and therefore the voice input information can help the user to reduce the deficiency and improve the deficiency.

It will be understood by those skilled in the art that the corresponding preset threshold for the number of sentences of the voice input information and the corresponding preset threshold for the number of repeated sentences in the voice input information may be set to be the same or different, which may depend on the specific application.

The invention can quickly judge whether the conversation content of the user accords with the situation that the user wants to practice by analyzing whether the voice input information deviates from the current spoken language conversation situation, and if the voice input information deviates from the current spoken language conversation situation, the conversation is ended, so that the user can know that the conversation content of the user needs to be adjusted, and the user can be helped to quickly and pertinently practice the spoken language.

Fig. 2 is a schematic view of an apparatus according to another embodiment of the present invention, which particularly shows an apparatus for spoken language practice (hereinafter, simply referred to as "spoken language practice apparatus").

As shown in fig. 2, the spoken language practicing device 20 is installed in a network device, and the spoken language practicing device 20 further includes a voice input device 21, a scene analyzing device 22, and a session feedback device 23.

Specifically, the voice input device 21 acquires voice input information input by the user, the voice input information corresponding to a specific spoken conversation scenario.

For example, first, at a user end, a user device receives voice input information input by a user, the voice input information corresponding to a specific spoken conversation scenario. Specifically, the user may first select a particular spoken dialog scenario and then input his voice input information. Subsequently, the user device sends the voice input information of the user to the network device, so that the voice input device 21 receives the voice input information of the user and the corresponding spoken language conversation scene from the user device.

Preferably, the voice input information may also be preprocessed by the voice input device 21 or other specific devices (not shown) in the spoken language practice device 20 to obtain character string information and pronunciation information contained in the voice input information.

For example, the voice input device 21, after receiving the voice input information of the user, performs preprocessing, such as noise filtering, on the voice input information to generate secondary data of the voice input information, such as obtaining character string information and pronunciation information contained therein.

After preprocessing the voice input information of the user, the voice input device 21 can obtain the character string information and pronunciation information included in the voice input information, which can be regarded as secondary data of the voice input information. While noise filtering is performed, the network device can convert the voice input information into data information which is easier to recognize, and meanwhile, data support is provided for information processing and the like in subsequent steps.

Subsequently, the scene analysis means 22 provides the user with response information corresponding to the voice input information thereof within the spoken conversation scene. Specifically, in a specific spoken language conversation scenario, the scenario analysis device 22 may provide the user with response information corresponding to the voice input information of the user by performing semantic analysis, keyword matching, or the like on the voice input information of the user.

For example, the scenario analysis device 22 may perform semantic analysis on the voice input information of the user, such as keyword and context analysis, to generate corresponding response information. For another example, the network device matches the response database according to the keywords in the voice input information of the user, and generates corresponding response information according to a predetermined sentence template.

Then, when the session is ended, the session feedback device 23 provides the user with feedback information about the session, where the feedback information includes session evaluation information and/or session suggestion information for the user.

For example, the conversation feedback device 23 may generate the conversation evaluation information and/or the conversation suggestion information of the user according to the number of sentences of the voice input information of the user in the conversation and the semantic meaning of the voice input information. The session evaluation information is, for example, an evaluation of an expression ability of the user in a specific contextual session, specifically, a score is given to the performance of the user in the session, and the session suggestion information is, for example, a common example sentence in a scene of the session. The feedback information can reflect the mastering degree of the user on the conversation scene laterally, if the user can express the content which the user wants to express rapidly by fewer sentences, the user can better master the conversation scene, the user can spend more time on the conversation scene with poorer mastering degree according to the requirement of the user, the user time is saved, and the user can better master each conversation scene.

Alternatively, the session evaluation information and/or the session suggestion information may also be generated by the context analysis means 22, while the session feedback means 23 is only responsible for information transfer with the user side.

Further, when the session feedback device 23 sends the feedback information to the user terminal and the feedback information is presented at the user equipment, the user equipment may provide the feedback information to the user in various forms. For example, the user equipment may play the score of the performance of the session and the common example sentences in the scene of the session to the user in a voice mode. As another example, the user device may present the feedback information to the user in a visual form. The visual representation can be various, such as a text display, a diagram display and the like. Further, the user equipment can display and play the feedback information to the user at the same time.

According to a preferred embodiment of the present invention, the spoken language practice device 20 may further include a pronunciation analysis device (not shown). Specifically, after the voice input device 21 acquires the voice input information input by the user, the pronunciation analysis device may analyze the pronunciation of the voice input information to obtain corresponding pronunciation evaluation information and/or pronunciation suggestion information.

Here, the utterance analysis device may analyze the utterance of the voice input information with reference to a standard utterance and generate corresponding utterance evaluation information and/or utterance suggestion information. The pronunciation evaluation information includes, for example, whether a standard pronunciation is satisfied, and the pronunciation suggestion information includes, for example, a standard pronunciation of a word whose pronunciation is not standard or a suggestion for improvement.

The pronunciation analysis operation performed by the pronunciation analysis device and the response operation performed by the scene analysis device 22 are not strictly sequential, and may be performed in parallel or in an arbitrary order.

Subsequently, the feedback information provided by the conversation feedback apparatus 23 may further include pronunciation evaluation information and/or pronunciation suggestion information.

According to another preferred embodiment of the present invention, the spoken language practice device 20 may further include a syntax analysis device (not shown). Specifically, after the voice input device 21 acquires the voice input information input by the user, the syntax analysis device may analyze the syntax and syntax of the voice input information to obtain corresponding grammar evaluation information and/or grammar recommendation information.

The speech input information is analyzed by the grammar analyzer, and if the speech input information does not meet the grammar specification or the language is inaccurate, the grammar analyzer generates corresponding grammar evaluation information and/or grammar suggestion information. The grammar evaluation information includes, for example, whether there is grammar error or word inaccuracy, and the grammar suggestion information includes, for example, improvement suggestion or example sentence of related grammar and word.

The speech analysis operation performed by the syntax analysis device and the response operation performed by the scene analysis device 22 are not strictly sequential, and may be performed in parallel or in an arbitrary order.

Subsequently, the feedback information provided by the session feedback device 23 may further include syntax evaluation information and/or syntax suggestion information.

The two aforementioned preferred examples can be further combined.

For example, according to still another preferred example of the present invention, the spoken language practicing device 20 may further include a pronunciation analyzing device (not shown) and a grammar analyzing device (not shown). Specifically, after the voice input device 21 acquires the voice input information input by the user, the scenario analysis device 22 provides the user with response information corresponding to the voice input information within the spoken conversation scenario, the pronunciation analysis device may analyze the pronunciation of the voice input information to obtain corresponding pronunciation evaluation information and/or pronunciation suggestion information, and the grammar analysis device may analyze the grammar and syntax of the voice input information to obtain corresponding grammar evaluation information and/or grammar suggestion information. Subsequently, the feedback information provided by the conversation feedback device 23 includes conversation evaluation information and/or conversation suggestion information, pronunciation evaluation information and/or pronunciation suggestion information, and grammar evaluation information and/or grammar suggestion information.

The pronunciation analysis operation performed by the pronunciation analysis device, the grammar analysis operation performed by the grammar analysis device, and the response operation performed by the scene analysis device 22 are not strictly sequential, and may be performed in parallel or in any order. The speech analysis operation performed by the speech analyzer and the syntax analysis operation performed by the syntax analyzer may be performed during the conversation or after the conversation is completed.

Specifically, for example, the conversation feedback device 23 gives a score to the conversation evaluation information, the pronunciation evaluation information, and the grammar evaluation information according to respective preset rules, calculates an average score of the three scores to obtain a total score, and displays the three scores and the total score in a feedback manner. Scoring is carried out on three aspects of pronunciation, grammar, scene conversation and the like, so that the user can know the level of the user in each aspect, and the user can spend more time in the poor aspect or pertinently strengthen the better aspect; and if a total score is given, the user is helped to judge whether the user needs to continue to practice the spoken language of the conversation scene.

In addition, the spoken language practice device 20 shown in FIG. 2 may also include an early termination device (not shown). If the voice input information of the user deviates from the situation of the spoken language conversation, the device is terminated in advance. Subsequently, the session feedback means 23 provides the user with feedback information about the session at the end of the session.

Wherein the early termination means may determine that the speech input information of the user deviates from the spoken conversation scenario of the conversation when at least any one of the following is satisfied:

1) the speech input information cannot be recognized.

Here, the early termination device may determine whether the voice input information cannot be recognized, and if so, determine that the voice input information does not conform to the session context, and end the session. If the voice input information cannot be recognized, it can be considered that the pronunciation of the user has a serious problem or is irrelevant to the current conversation situation, so that in the actual conversation situation, the user cannot perform effective conversation at all, and therefore, if the voice input information of the user cannot be recognized, the conversation is ended in time, the reason why the user cannot recognize is informed, the user can be helped to perform effective exercise, and the user can be helped to find the improvement in the aspects of pronunciation and the like.

Here, the early termination device may determine whether the semantic meaning of the voice input information is analyzed to be in accordance with the session context, and if not, terminate the session. The semantic meaning may be determined by determining whether a keyword in the voice input information is related to the session context. Through semantic judgment, the user can be informed whether the ongoing conversation conforms to the called conversation scene, if not, the conversation is ended, and the meaningless practice time waste of the user is avoided; matching and feeding back response information corresponding to the voice input information only when the voice input information is matched with the voice input information; therefore, the user can better train the conversation of the conversation scene which the user wants to practice, the user can be helped to quickly adapt to the conversation scene, and the language expression capability of the user in the corresponding conversation scene can be improved.

Here, the early termination device may determine whether the number of sentences of the voice input information exceeds a preset threshold, and if so, determine that the voice input information does not conform to the session context, and end the session. The meaning of judging whether the statement quantity of the voice input information exceeds the preset threshold value is that under many situations, people need to use limited statements to express themselves, too many statements can cause the annoyance of others, and meanwhile, if the statement quantity is too much, the situation shows that the user cannot express the content to be expressed quickly, at the moment, the conversation practice can be considered to be unqualified, the conversation is ended in time, and the user can be summarized, improved and perfected.

Here, the early termination device may determine whether the number of times of repetition of the words in the voice input information exceeds a preset threshold, and if so, determine that the voice input information does not conform to the conversation scenario, and end the conversation. In the voice input information, the meaning of the repeated times of the sentences is judged in that when a certain sentence is repeated for multiple times, the user can be considered as having no relay for the conversation, so that the sentence is repeated, at the moment, the time is wasted only when the meaningless conversation is continued, at the moment, the conversation is ended, the time of the user can be saved, the user can better reduce the deficiency in the conversation, if the conversation is ended after a long time, the user probably forgets the deficiency, and therefore the voice input information can help the user to reduce the deficiency and improve the deficiency.

It is noted that the present invention may be implemented in software and/or in a combination of software and hardware, for example, the various means of the invention may be implemented using Application Specific Integrated Circuits (ASICs) or any other similar hardware devices. In one embodiment, the software program of the present invention may be executed by a processor to implement the steps or functions described above. Also, the software programs (including associated data structures) of the present invention can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Further, some of the steps or functions of the present invention may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other means or steps, and the singular does not exclude the plural. A plurality of means or devices recited in the system claims may also be implemented by one device or device through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

While exemplary embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the claims. The protection sought herein is as set forth in the claims below.

Claims

1. A method of spoken language practice implemented by a computer device, wherein the method comprises the steps of:

-when the session is over, providing the user with feedback information about the session, the feedback information including session rating information and/or session suggestion information for the user, the session rating information including a rating of the user's ability to express in the spoken session context, the session rating information and/or session suggestion information being generated based on the number of sentences of the user's voice input information in the session and the semantic meaning of the voice input information;

wherein, the method also comprises:

-ending the present session if the speech input information deviates from the spoken session context, wherein the speech input information is determined to deviate from the spoken session context when at least any one of:

-the semantics of the speech input information do not correspond to the spoken conversation scenario;

-the number of sentences of the speech input information exceeds its corresponding preset threshold;

-the number of repetitions of a word in said speech input information exceeds its corresponding preset threshold.

2. The method of claim 1, wherein the feedback information further comprises pronunciation evaluation information and/or pronunciation suggestion information for the user;

wherein, the method also comprises:

-analyzing the pronunciation of the speech input information to obtain corresponding pronunciation assessment information and/or pronunciation suggestion information.

3. The method of claim 2, wherein the feedback information further includes composite rating information for the user, the composite rating information being determined by the conversational rating information in combination with the pronunciation rating information.

4. The method of claim 1, wherein the feedback information further includes syntax evaluation information and/or syntax suggestion information for the user;

wherein, the method also comprises:

-parsing the syntax and syntax of the speech input information to obtain corresponding grammar evaluation information and/or grammar recommendation information.

5. The method of claim 4, wherein the feedback information further includes composite rating information for the user, the composite rating information determined by the session rating information in combination with the grammar rating information.

6. The method of claim 1, wherein the step of acquiring the voice input information of the user further comprises:

-preprocessing the speech input information to obtain string information and pronunciation information contained in the speech input information.

7. The method of any one of claims 1 to 6, wherein the feedback information for the session is presented to the user in a visual form.

8. An apparatus for implementing spoken language practice in a computer device, wherein the apparatus comprises:

-means for providing the user with feedback information about the session when the session is ended, the feedback information comprising session rating information and/or session suggestion information for the user, the session rating information comprising a rating of the user's ability to express in the spoken session context, the session rating information and/or session suggestion information being generated from the number of sentences of the user's speech input information in the session and from an analysis of the semantics of the speech input information;

-means for ending the present conversation if the speech input information deviates from the spoken conversation scenario, wherein the speech input information is determined to deviate from the spoken conversation scenario when at least any one of:

9. The apparatus of claim 8, wherein the feedback information further comprises pronunciation evaluation information and/or pronunciation suggestion information for the user;

wherein, the device still includes:

-means for analyzing the pronunciation of the speech input information to obtain corresponding pronunciation assessment information and/or pronunciation suggestion information.

10. The apparatus of claim 9, wherein the feedback information further comprises composite rating information for the user, the composite rating information determined by the conversational rating information in combination with the pronunciation rating information.

11. The apparatus of claim 8, wherein the feedback information further comprises syntax evaluation information and/or syntax suggestion information for the user;

wherein, the device still includes:

-means for parsing the syntax and syntax of said speech input information to obtain corresponding grammar evaluation information and/or grammar recommendation information.

12. The apparatus of claim 11, wherein the feedback information further comprises composite rating information for the user, the composite rating information determined by the session rating information in combination with the grammar rating information.

13. The apparatus of claim 8, wherein the means for obtaining voice input information of the user is further for:

14. The apparatus of any one of claims 8 to 13, wherein the feedback information for the session is presented to the user in a visual form.