CN110136719B

CN110136719B - Method, device and system for realizing intelligent voice conversation

Info

Publication number: CN110136719B
Application number: CN201810105481.0A
Authority: CN
Inventors: 翁翔坚; 林晖; 刘翔; 韩旭
Original assignee: Shanghai Liulishuo Information Technology Co ltd
Current assignee: Shanghai Liulishuo Information Technology Co ltd
Priority date: 2018-02-02
Filing date: 2018-02-02
Publication date: 2022-01-28
Anticipated expiration: 2038-02-02
Also published as: CN110136719A

Abstract

The invention provides a method, a device and a system for realizing intelligent voice conversation, wherein the method comprises the following steps: receiving a voice signal recorded by a client; converting the voice signal into a voice text; determining semantics corresponding to the voice text; determining language logic corresponding to the semantics; determining a dialog text corresponding to the language logic; synthesizing an audio file corresponding to the dialog text; and sending the audio file to a client. By applying the embodiment of the invention, the English learning time is flexible, the cost is low, the limitation on the answer of the user is small, and the intelligent human-computer interactive learning experience is provided for the user.

Description

Method, device and system for realizing intelligent voice conversation

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method, a device and a system for realizing intelligent voice conversation.

Background

Along with the improvement of the attention degree of people to English learning, more and more English learning mechanisms and English learning software are produced.

Generally, people select paid off-line external teaching courses for better practice of spoken language, the time of the off-line external teaching courses is fixed, the learning time is inflexible, and the cost is high; the simulation dialogue provided by the on-line learning software must be advanced according to a set flow, options are directly provided for the user to answer, the answer of the user is greatly limited, and the human-computer interactive intelligent learning experience cannot be provided for the user.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, and a system for implementing an intelligent voice conversation to solve the problems of inflexible english learning time, high cost, and large restrictions on user answers.

In order to achieve the purpose, the invention provides the following technical scheme:

according to a first aspect of the present invention, a method of implementing an intelligent voice dialog is presented, the method comprising:

receiving a voice signal recorded by a client;

converting the voice signal into a voice text;

determining semantics corresponding to the voice text;

determining language logic corresponding to the semantics;

determining a dialog text corresponding to the language logic;

synthesizing an audio file corresponding to the dialog text;

and sending the audio file to a client.

According to a second aspect of the present invention, an apparatus for implementing intelligent voice dialog is provided, which includes:

the voice receiving module is used for receiving voice signals recorded by the client;

the text conversion module is used for converting the voice signal into a voice text;

the semantic determining module is used for determining the semantic corresponding to the voice text;

the logic determination module is used for determining language logic corresponding to the semantics;

the text determining module is used for determining the dialog text corresponding to the language logic;

the audio synthesis module is used for synthesizing an audio file corresponding to the dialog text;

and the audio sending module is used for sending the audio file to the client.

According to a third aspect of the present invention, there is provided a system for implementing intelligent voice conversations, the system comprising: a client and a server; wherein the content of the first and second substances,

the client is used for receiving the scene instruction and sending the scene instruction to the server;

the server is used for starting the function of intelligent voice conversation based on the scene instruction, initiating first-round conversation to the client based on the scene corresponding to the scene instruction, converting the voice signal into a voice text when receiving the voice signal recorded by the client, determining the semantic corresponding to the voice text, determining the language logic corresponding to the semantic, determining the conversation text corresponding to the language logic, synthesizing the audio file corresponding to the conversation text, and sending the audio file to the client;

the client is also used for receiving the audio file and playing the audio file.

According to the technical scheme, the server receives the voice signals recorded by the client, converts the voice signals into the voice texts and determines the semantics corresponding to the voice texts, the server determines the language logic according to the semantics, determines the corresponding conversation texts through the language logic, finally synthesizes the audio files corresponding to the conversation texts, and sends the audio files to the client, so that the client can initiate the next round of conversation after playing the audio files.

Drawings

FIG. 1A is a flowchart of an embodiment of a method for implementing intelligent voice conversations;

FIG. 1B is a schematic diagram of an internal structure of a server to which the method of FIG. 1A is applied;

FIG. 2 is a flowchart of an embodiment of a method for implementing intelligent voice conversations;

FIG. 3 is a flow chart of another embodiment of a method for implementing intelligent voice conversations provided by the present invention;

FIG. 4 is a flow chart of another embodiment of a method for implementing intelligent voice conversations;

FIG. 5 is a flow chart of yet another embodiment of a method for implementing intelligent voice conversations provided by the present invention;

FIG. 6 is a flow chart of yet another embodiment of a method for implementing intelligent voice conversations provided in the present invention;

FIG. 7 is a flowchart of another embodiment of a method for implementing intelligent voice conversations;

FIG. 8 is a hardware block diagram of a server provided by the present invention;

FIG. 9 is a block diagram of an embodiment of an apparatus for implementing intelligent voice conversations;

fig. 10 is a block diagram of another embodiment of an apparatus for implementing intelligent voice conversation provided by the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Fig. 1A is a flowchart of an embodiment of a method for implementing an intelligent voice conversation according to the present invention. The method for realizing intelligent voice conversation can be applied to a server, as shown in fig. 1A, and includes the following steps:

step 101: and receiving the voice signal recorded by the client.

Step 102: the speech signal is converted into a speech text.

Step 103: and determining the corresponding semantics of the voice text.

Step 104: and determining language logic corresponding to the semantics.

Step 105: and determining the dialog text corresponding to the language logic.

Step 106: and synthesizing an audio file corresponding to the dialog text.

Step 107: and sending the audio file to the client.

In step 101, in an embodiment, as will be understood by those skilled in the art, the client presents at least one scenario task through a screen, where the scenario task is a life scenario that ultimately achieves a certain purpose, and the scenario task is, for example: and ordering a steak at a restaurant, acquiring a boarding check at an airport, shopping at a duty-free shop, handling the stay in the hotel and other life scenes. The method comprises the steps that a user selects a scene task by clicking a screen, a client receives a scene instruction generated when the user clicks the screen, the client sends the scene instruction to a server, the server starts an intelligent voice conversation function based on the scene instruction, the server initiates a first-turn conversation based on a scene corresponding to the scene instruction, taking the scene task as ' one steak at a restaurant ' as an example, the server initiates the first-turn conversation based on the scene corresponding to the scene instruction of ' one steak at the restaurant ', and the client plays a content of ' What teak do you walk? "the audio file, specifically, the presentation form on the client screen may be: text prompts, pictures, motion pictures, small videos, and the like, in combination with audio files. Those skilled in the art will appreciate that the ease of intelligent dialog may be adjusted by setting different combinations. For example, when an audio file is presented in combination with a picture, the hearing ability of the user is tested more, and the conversation difficulty is higher; when the audio file and the text prompt are combined and presented, a user can understand the voice content more easily by reading the text prompt, and the conversation difficulty is simpler. In each round of conversation, the client starts recording the voice signals by receiving a recording instruction of a user, and when the client receives the instruction of finishing recording, the client sends the recorded voice signals to the server. And the server receives the voice signal recorded by the client. The question "What do you want? ", for example, the user recorded a voice signal with the content" I want Sirloin please ase "through the client.

In step 102, in one embodiment, the server converts the speech signal into speech text. In connection with step 101, the server converts the speech signal of "I want Sirloin please" into a speech text of "I want Sirloin please". Specifically, how the server converts the voice signal into the voice text may refer to the description of the related art, which is not described herein again.

In step 103, in one embodiment, the server determines the corresponding semantics of the speech text. As can be understood by those skilled in the art, when the english level of the user is not good enough and the recorded audio effect is affected by environmental interference and other factors, the problems of word missing, grammar error, sentence break and the like may occur in the speech text converted by the server based on the speech signal, and therefore the server needs to analyze effective core content capable of reflecting the intention of the dialog from the speech text. In connection with step 102, the server determines that the verb "want" in the "I want Sirloid please" speech text is to represent a positive semantic, and in connection with the noun "Sirloid" in the "I want Sirloid please" speech text, indicates that the user wants a West cold steak, so the server may determine that the semantic is "wants to eat a West cold steak". Specifically, the steps of how the server determines the corresponding semantics of the speech text can refer to the following related descriptions of step 201-step 202 shown in fig. 2, which will not be described herein.

In step 104, in one embodiment, the server determines the language logic to which the semantics correspond. For example, when the previous sentence in the dialog is "I wan Sirloin please ase", the next sentence in the dialog corresponding to the semantic logic may be "explain if the western cold steak is in stock", "ask the steak to be slightly cooked", or "ask if other dishes and wine need to be added". In connection with step 103, for example, the server determines that the semantic is "want to eat west cold steak," it may be determined that the semantic logic corresponding to "want to eat west cold steak" may be "ask the steak to be a few degrees ripe. Specifically, the steps of how the server determines the language logic corresponding to the semantics can refer to the following description related to steps 301 to 302 shown in fig. 3, which will not be described herein.

In step 105, in one embodiment, the server determines the dialog text to which the language logic corresponds. In connection with step 104, for example, it is determined that the semantic logic corresponding to "want to eat western-cold steak" is "ask steak to be cooked a little", the server determines that the dialog text corresponding to "ask steak to be cooked a little" is "How shoulde green you teak, medium well, medium ram or well done? ". Specifically, the steps of how the server determines the language logic corresponding to the semantics can refer to the following description related to steps 401 to 402 shown in fig. 4, which will not be described herein.

In step 106, in one embodiment, the server synthesizes an audio file corresponding to the dialog text. In connection with step 105, the server will have the dialog text "How would be the home shouldwe prepare your teak, medium well, medium rare or well done? "synthesize the corresponding audio file. Specifically, how the server synthesizes the audio file corresponding to the dialog text may refer to the description of the related art, which is not described herein again.

In the embodiment of the invention, the server receives the voice signal recorded by the client, converts the voice signal into the voice text and determines the semantics corresponding to the voice text, the server determines the language logic according to the semantics, determines the corresponding dialog text through the language logic, finally synthesizes the audio file corresponding to the dialog text, and sends the audio file to the client, so that the client initiates the next round of dialog after playing the audio file.

Fig. 1B is a schematic diagram of an internal structure of a server to which the method of fig. 1A is applied, and the server 11 in fig. 1B includes a speech module 111, an understanding module 112, a logic module 113, a text module 114, a content module 115, and an audio module 116. The voice module 111 is configured to receive a voice signal recorded by the client sent by the client, and convert the voice signal into a voice text; the understanding module 112 is used for determining the semantics of the speech text; the logic module 113 is configured to determine language logic corresponding to semantics; the text module 114 is used for determining a dialog text corresponding to language logic; the content module 115 is configured to provide corresponding words, phrases and sentences for the understanding module 112 and the text module 114, and provide preset logic configuration for the logic module 113; the audio module 116 is used to synthesize the dialog text determined in the text module 114 into an audio file. Specifically, in conjunction with the above steps 101-107 of fig. 1A, the voice module 111 receives the voice signal recorded by the client. The speech signal corresponds to a content such as "I want Sirloin please, and the speech module 111 converts the speech signal into a speech text" I want Sirloin please. The understanding module 112 determines that the semantic is "want to eat western cold steak" by combining the verb "want" and the noun "Sirloin" provided in the content module 115. The logic module 113 determines the linguistic logic corresponding to the "want to eat a western cold steak" semantic. For example, the logic module 113 provides three preset logic configurations from the content module 115: "explain if western-cold steak has stock", "ask the steak to be cooked a little", "ask if it needs to add other side dish and wine" and determine that the semantic logic corresponding to "want to eat western-cold steak" is "ask the steak to be cooked a little", then the text module 114 determines that the dialog text corresponding to "ask the steak to be cooked a little" is "How shallow well deep teak, medium well, medium ray or well done? ". The audio module 116 synthesizes the dialog text "How shallow well prepare your teak, medium well, medium rare or well done? "corresponding audio file. It can be understood by those skilled in the art that the voice module 111, the understanding module 112, the logic module 113, the text module 114, the content module 115, and the audio module 116 in the server are only exemplary illustrations, and the server may further include modules (not shown in fig. 1B) such as a determining module and a scoring module, where the determining module may be configured to determine whether a scene task is completed, for example, taking the scene task as "one steak at restaurant" as an example, and when the server determines that a voice signal recorded by the client is "one steak at restaurant", the scene task of "one steak at restaurant" is completed; the scoring module is used for scoring the recorded voice signal, and specifically, how the server scores the voice text, reference may be made to the following description of step 608 shown in fig. 6, which will not be described herein.

Fig. 2 is a flowchart of an embodiment of a method for implementing an intelligent voice conversation provided by the present invention, and with reference to fig. 1A, an exemplary description is made on how a server determines semantics corresponding to a voice text on the basis of steps 101 to 107, as shown in fig. 2, including the following steps:

step 201: and selecting at least one keyword in the voice text based on a first preset selection rule.

Step 202: semantics are determined based on at least one keyword.

In step 201, the first preset selection rule may be to select verbs, nouns, pronouns, adverbs, and the like in the speech text as keywords, and specifically, different selection rules may be set for different speech texts. When the question begins with ' what, where ', the noun ' in the voice text is preferentially selected as the key word; when beginning with 'who', preferentially selecting 'person-name pronouns and nouns' in the voice text as keywords; when the question begins with "how, do", the "adverb" in the speech text is preferably selected as the keyword. For example, when the question is "Do you want to eat Sirlin? If the speech text is "Yes", then the keyword is "Yes"; when the query is "Who is your best friends? "if the speech text is" Lily is my best friends ", then" Lily "in the speech text can be selected to represent the" noun "of a specific person.

In step 202, the server determines semantics based on at least one keyword, and in conjunction with step 201, when the query is "Do you want to eat Sirloin? If the keyword determined by the server is "Yes", the server can determine that the semantic meaning is "want to eat West cold steak".

In the embodiment of the invention, the server selects at least one keyword in the voice text based on the first preset selection rule, determines the semantics based on the at least one keyword, and can enable the server to be more intelligent in the aspect of semantic understanding and have higher fault tolerance by setting different first preset selection rules.

Fig. 3 is a flowchart of another embodiment of a method for implementing intelligent voice conversation provided by the present invention, and in conjunction with fig. 1B, the embodiment of the present invention exemplarily illustrates how a server determines language logic corresponding to semantics, as shown in fig. 3, including the following steps:

step 301: and determining at least one preset logic configuration corresponding to the semantics.

Step 302: based on a second preset selection rule, language logic is determined from at least one preset logic configuration.

In step 301, in conjunction with fig. 1B, the content module 115 is configured to store a preset logic configuration and provide the preset logic configuration for the logic module 113. Taking the logic module 113 to determine the semantic meaning of "want to eat western cold steak" as an example, the logic module 113 determines three preset logic configurations corresponding to the semantic meaning of "want to eat western cold steak" from the content module 115: "explain if the western cold beefsteak has goods in stock", "ask beefsteak to be cooked a little, or" ask if it needs to add other side dishes and wine ".

In step 302, the second predetermined selection rule is, for example: selecting a preset logic configuration which does not appear before the round of conversation; polling and selecting a preset logic configuration; selecting the preset logic configuration with the least number of times of use, and the like. For example, the logic module 113 records three preset logic configurations from the content module 115: the ' inquiring that the beef steak needs to be cooked a little ' is selected as language logic corresponding to the semantic meaning of ' wanting to eat the western-cold beef steak ' in a polling mode among ' explaining whether the western-cold beef steak is stored, the ' inquiring that the beef steak needs to be cooked a little ', the ' inquiring that other side dishes and wine need to be added '.

In the embodiment of the invention, the server determines at least one preset logic configuration corresponding to the semantics, the server determines the language logic from the at least one preset logic configuration based on the second preset selection rule, and the language logic finally determined by the server can be more diversified by setting the reasonable second preset selection rule and setting a greater number of preset logic configurations.

Fig. 4 is a flowchart of another embodiment of a method for implementing an intelligent voice conversation, according to the present invention, in which, in conjunction with fig. 1A, on the basis of steps 101 to 107, an exemplary description is made on how a server determines a conversation text corresponding to language logic, as shown in fig. 4, the method includes the following steps:

step 401: a preset response rule is determined based on the speech signal.

Step 402: and determining the dialog text corresponding to the language logic based on a preset answering rule.

In step 401, the preset response rule is a determination rule when the server determines the dialog text based on the language logic. The method for the server to determine the preset answering rule based on the language signal can comprise the following steps: determining a preset answering rule based on the language text of the language signal; and determining a preset answering rule based on the score of the language signal. The method comprises the steps that a preset answering rule is determined based on a language text of a language signal, and a suitable dialog text under a specific scene is given to a language logic by a server in combination with a context; and determining preset response rules based on the scores of the language signals to give the dialog texts with different difficulty degrees aiming at the language signals with different scores. Specifically, the server determines the score of the language ability of the user based on the language signal, and different scores correspond to different preset response rules, for example: 0-30 points of preset answering rules corresponding to easier degrees (more words with suggestive meanings are given); 30-60 points correspond to a moderate preset response rule (normal response); a score of 60-100 corresponds to a predetermined rule of action for the degree of difficulty (suggestive words are given less).

In step 402, in step 105, the server determines that the language logic is "ask the steak to be more thoroughly divided", in combination with step 401, if the score of the user language ability is 25 points, the dialog text of the language logic corresponding to the preset answering rule is "How shoulded well prepare your teak, medium well, medium ray or well done? "wherein" medium well, medium rare or well done "is a suggestive word given; if the score of the user language ability is 85 scores, the dialog text of the language logic corresponding to the preset answering rule is "How well she prepare your teak? ", no suggestive words are given.

In the embodiment of the invention, the server determines the preset answering rule based on the language signal, determines the dialog text corresponding to the language logic based on the preset answering rule, and can flexibly change the difficulty of the dialog text by setting the reasonable preset answering rule.

Fig. 5 is a flowchart of another embodiment of a method for implementing an intelligent voice conversation, according to the present invention, and in conjunction with fig. 1A, on the basis of steps 101-107, an embodiment of the present invention exemplarily illustrates how a server ends a conversation, as shown in fig. 5, including the following steps:

step 501: and judging whether the dialog text is consistent with a preset target text.

Step 502: and if the dialog text is consistent with the preset target text, ending the dialog.

In steps 501 to 502, the preset target text is a dialog text preset by the server and indicating that the scene task is completed, and taking the scene task in fig. 1A as an example of "make a steak at a restaurant", if the dialog text is "Enjoy your media" and the preset target text "Enjoy your media" are consistent, the server closes the function of the intelligent voice dialog, and ends the dialog.

In the embodiment of the invention, the server judges whether the dialog text is consistent with the preset target text, and if the dialog text is consistent with the preset target text, the server finishes the dialog, thereby achieving the purpose of completing the scene task.

Fig. 6 is a flowchart of another embodiment of a method for implementing an intelligent voice conversation according to the present invention, and in conjunction with fig. 1A, the embodiment of the present invention determines a score of at least one dimension of a voice signal for a server based on steps 101-107. For illustrative purposes, as shown in fig. 6, the method comprises the following steps:

step 601: and receiving the voice signal recorded by the client.

Step 602: the speech signal is converted into a speech text.

Step 603: and determining the corresponding semantics of the voice text.

Step 604: and determining language logic corresponding to the semantics.

Step 605: and determining the dialog text corresponding to the language logic.

Step 606: and synthesizing an audio file corresponding to the dialog text.

Step 607: and sending the audio file to the client.

Step 608: a score for at least one dimension of the speech signal is determined based on a preset scoring criterion.

In steps 601 to 607, the relevant description may refer to the relevant description of steps 101 to 107 in fig. 1A, which is not described herein again, it should be noted that step 608 may be performed before or after any step after step 601 is performed, and the timing sequence of step 608 is not limited herein.

In step 608, the preset scoring criteria is preset, and the preset scoring criteria can score the speech signal from a plurality of dimensions, including: pronunciation, fluency, expression, independent completion, etc. Specifically, taking fluency as an example, the preset scoring standard can judge the time length of the language signal recorded by the user; taking pronunciation as an example, the preset scoring criteria can be evaluated by the number of words or phrases in the speech text converted from the language signal by the server. And the server scores each dimension of the voice signals to obtain the score of each dimension. The server can also score based on the overall expression of the user conversation by setting different weights of each dimension, and can also generate each dimension capability distribution graph of the conditions of the user such as hearing, pronunciation, fluency, expression, independent completion and the like. Meanwhile, analysis and improvement suggestions based on capability distribution are provided, and dimension parts with lower expression scores in dimensions such as pronunciation and expression can be selected for commenting.

In the embodiment of the invention, the server determines the score of at least one dimension of the voice signal based on the preset scoring standard, the level of the dialogue capability of the user can be visually displayed in a brightening manner through the score, and meanwhile, the server analyzes and evaluates according to the dimension part with lower score, so that the server is beneficial to the user to learn in a short board dimension in a targeted manner.

Fig. 7 is a flowchart of another embodiment of a method for implementing an intelligent voice conversation, according to the embodiment of the present invention, in conjunction with fig. 1A, on the basis of steps 101 to 107, an exemplary description is given to how a help-seeking instruction is processed after a server receives the help-seeking instruction, as shown in fig. 7, the method includes the following steps:

step 701: when a help-seeking instruction is received, at least one reference dialog text is determined based on the current dialog text.

Step 702: at least one reference dialog text is sent to the client.

In steps 701-702, when the server receives a help-seeking instruction sent by the client, the server determines a current dialog text, and the current dialog text is a question sent by the server to the client and currently waiting for a user to answer. The server determines at least one reference dialog text corresponding to the current dialog text based on the current dialog text, for example, the client receives a help seeking instruction generated by a user clicking a ' request help ' control of a screen, the client sends the help seeking instruction to the server, and the server determines that the current dialog text is ' What step do you winter we have Rib Eye, Sirloin and T-Bone ', and then the server determines at least one preset reference dialog text ' I wind ha Rib Eye please ', ' I'd live try the Sirloin ' and ' I am ordering the T-Bone '. The server sends reference conversation texts ' I wind have the Rib Eye please ', ' I'd like to try the Sirlin ' and ' I am ordering the T-Bone ' to the client, and the client displays the three reference conversation texts on a screen for the user to refer to.

In the embodiment of the invention, when the server receives the help seeking instruction, the server determines at least one reference dialog text based on the current dialog text and sends the at least one reference dialog text to the client, so that a reference example is given, the function of prompting is realized, and the memory and the learning simulation of a user are facilitated.

The invention also provides a hardware structure diagram of the server shown in fig. 8, corresponding to the method for realizing the intelligent voice conversation. Referring to fig. 8, at the hardware level, the server includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form a device for realizing intelligent voice conversation on a logic level. Of course, besides the software implementation, the present invention does not exclude other implementations, such as logic devices or combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

Fig. 9 is a block diagram of an embodiment of an apparatus for implementing an intelligent voice conversation, according to the present invention, and as shown in fig. 9, the apparatus for implementing an intelligent voice conversation may include: a speech receiving module 91, a text conversion module 92, a semantic determining module 93, a logic determining module 94, a text determining module 95, an audio synthesizing module 96 and an audio sending module 97, wherein:

the voice receiving module 91 is used for receiving a voice signal recorded by the client;

a text conversion module 92, configured to convert the voice signal into a voice text;

a semantic determining module 93, configured to determine a semantic corresponding to the voice text;

a logic determination module 94, configured to determine language logic corresponding to semantics;

a text determining module 95, configured to determine a dialog text corresponding to the language logic;

an audio synthesis module 96, configured to synthesize an audio file corresponding to the dialog text;

and the audio sending module 97 is configured to send an audio file to the client.

Fig. 10 is a block diagram of another embodiment of an apparatus for implementing intelligent voice conversation provided by the present invention, and as shown in fig. 10, on the basis of the above embodiment shown in fig. 9, the semantic determining module 93 includes:

the keyword selection sub-module 931 is configured to select at least one keyword in the voice text based on a first preset selection rule;

a first determination submodule 932 for determining semantics based on at least one keyword.

In one embodiment, the logic determination module 94 includes:

a second determining submodule 941, configured to determine at least one preset logic configuration corresponding to the semantics;

a third determining sub-module 942 is configured to determine the language logic from the at least one preset logic configuration based on the second preset selecting rule.

In one embodiment, the text determination module 95 includes:

a fourth determining sub-module 951 for determining a preset response rule based on the language signal;

the fifth determining sub-module 952 is configured to determine a dialog text corresponding to the language logic based on a preset response rule.

In one embodiment, the apparatus for implementing intelligent voice dialog further comprises:

a text judgment module 98, configured to judge whether the dialog text is consistent with a preset target text;

and a dialog ending module 99, configured to end the dialog if the dialog text is consistent with the preset target text.

the scoring module 100 is configured to determine a score of at least one dimension of the speech signal based on a preset scoring criterion.

a reference text determination module 101, configured to determine, when a help seeking instruction is received, at least one reference dialog text based on a current dialog text;

and a text sending module 102, configured to send at least one reference dialog text to the client.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

It can be seen from the above embodiments that, in the embodiments of the present invention, a server receives a voice signal recorded by a client, the server converts the voice signal into a voice text, and determines a semantic corresponding to the voice text, the server determines a language logic according to the semantic, determines a corresponding dialog text through the language logic, finally synthesizes an audio file corresponding to the dialog text, and sends the audio file to the client, so that the client initiates a next round of dialog after playing the audio file.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for implementing intelligent voice conversations, the method comprising:

receiving a voice signal recorded by a client;

converting the voice signal into a voice text;

determining semantics corresponding to the voice text;

determining language logic corresponding to the semantics;

determining a dialog text corresponding to the language logic;

synthesizing an audio file corresponding to the dialog text;

sending the audio file to a client;

wherein the determining of the dialog text corresponding to the language logic comprises:

determining a preset answering rule based on the language text of the language signal; based on the preset answering rule, selecting a suitable dialog text under a specific scene from the language logic by combining context;

or, determining a preset response rule based on the score of the language signal; selecting dialogue texts with different difficulty degrees aiming at language signals with different scores based on the preset answering rules;

and the preset answering rule is a determination principle when the dialog text is determined based on language logic.

2. The method of claim 1, wherein the determining the corresponding semantics of the speech text comprises:

selecting at least one keyword in the voice text based on a first preset selection rule;

determining semantics based on the at least one keyword.

3. The method of claim 1, wherein the determining the semantic correspondence language logic comprises:

determining at least one preset logic configuration corresponding to the semantics;

determining language logic from the at least one preset logic configuration based on a second preset selection rule.

4. The method of claim 1, further comprising:

judging whether the dialog text is consistent with a preset target text or not;

and if the dialog text is consistent with the preset target text, ending the dialog.

5. The method of claim 1, further comprising:

determining a score for at least one dimension of the speech signal based on a preset scoring criterion.

6. The method according to any one of claims 1-5, further comprising:

when a help seeking instruction is received, determining at least one reference dialog text based on the current dialog text;

and sending the at least one reference dialog text to the client.

7. An apparatus for enabling intelligent voice conversations, the apparatus comprising:

the audio sending module is used for sending the audio file to a client;

8. The apparatus of claim 7, wherein the semantic determination module comprises:

the keyword selection submodule is used for selecting at least one keyword in the voice text based on a first preset selection rule;

a first determining submodule for determining semantics based on the at least one keyword.

9. A system for implementing intelligent voice conversations, the system comprising: a client and a server; wherein the content of the first and second substances,

the client is also used for receiving the audio file and playing the audio file;