CN110633357A - Voice interaction method, device, equipment and medium - Google Patents

Voice interaction method, device, equipment and medium Download PDF

Info

Publication number
CN110633357A
CN110633357A CN201910903910.3A CN201910903910A CN110633357A CN 110633357 A CN110633357 A CN 110633357A CN 201910903910 A CN201910903910 A CN 201910903910A CN 110633357 A CN110633357 A CN 110633357A
Authority
CN
China
Prior art keywords
voice
user
reply
question
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910903910.3A
Other languages
Chinese (zh)
Inventor
赵涛涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910903910.3A priority Critical patent/CN110633357A/en
Publication of CN110633357A publication Critical patent/CN110633357A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics

Abstract

The application discloses a voice interaction method, a voice interaction device, voice interaction equipment and a voice interaction medium, and relates to the technical field of intelligent voice. The specific implementation scheme is as follows: when a first user performs voice interaction with the intelligent sound box, extracting key data in a first text corresponding to a first question voice of the intelligent sound box, and recording a first reply voice of the first user corresponding to the first question voice; establishing a dialogue model for the first user based on the key data and the first reply voice; receiving a second question voice input by a second user, and determining a second reply voice corresponding to the second question voice based on the dialogue model, wherein the second reply voice is the historical reply voice of the first user; and broadcasting the second reply voice. According to the method and the device, the first reply voice of the first user can be effectively recorded, and the second reply voice corresponding to the second question voice of the second user is determined based on the established dialogue model, so that the effect of simulating dialogue with the first user is achieved, and the reality of the simulated dialogue is improved.

Description

Voice interaction method, device, equipment and medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to an intelligent voice technology, and more particularly, to a voice interaction method, apparatus, device, and medium.
Background
Along with the development of intelligent audio amplifier technique, more and more intelligent audio amplifier products on the market at present, the user can carry out the pronunciation interaction with intelligent audio amplifier, for example, the user sends out the pronunciation, and intelligent audio amplifier plays the response pronunciation according to the pronunciation content after receiving this pronunciation, and perhaps, intelligent audio amplifier can send out the pronunciation, and the user answers according to the pronunciation of hearing, and intelligent audio amplifier receives user's response pronunciation etc..
However, the current smart sound box can only realize the function of real-time conversation with the current user, and cannot realize richer conversation functions to meet the requirements of different users.
Disclosure of Invention
The voice interaction method, the voice interaction device, the voice interaction equipment and the voice interaction medium are used for achieving the effect that a current user can carry out simulated conversation with other users who once carry out voice interaction with the intelligent sound box through the intelligent sound box.
The embodiment of the application discloses a voice interaction method, which comprises the following steps:
when a first user performs voice interaction with the intelligent sound box, extracting key data in a first text corresponding to a first question voice of the intelligent sound box, and recording a first reply voice of the first user corresponding to the first question voice;
establishing a dialogue model for a first user based on the key data and the first reply voice;
receiving a second question voice input by a second user, and determining a second reply voice corresponding to the second question voice based on the dialogue model, wherein the second reply voice is the historical reply voice of the first user;
and broadcasting the second reply voice.
The above embodiment has the following advantages or beneficial effects: the first reply voice of the first user is effectively recorded, and the second reply voice corresponding to the second question voice of the second user is determined from the historical reply voice of the first user based on the established conversation model, so that the problem of single voice interaction function is solved, the effect of carrying out simulated conversation with the user who uses the same intelligent sound box is realized, the simulated conversation can be carried out with the historical user who once carries out voice interaction with the intelligent sound box when the historical user is not on site, and the voice interaction form of the intelligent sound box is enriched so as to meet different requirements of the user.
Further, the key data comprises first intention data in the first text;
correspondingly, the establishing of the dialogue model for the first user based on the key data and the first reply voice comprises:
establishing a corresponding relation between the first intention data and the first recovered voice, and establishing a dialogue model aiming at a first user based on the corresponding relation.
Accordingly, the above-described embodiments have the following advantages or advantageous effects: according to the first intention data in the first text corresponding to the first question voice of the intelligent sound box and the corresponding relation of the first reply voice, a dialogue model for the first user is established, so that the first reply voice of the first user is effectively recorded, a second reply voice is determined according to the dialogue model subsequently, and a dialogue simulation process is achieved.
Further, the determining a second reply voice corresponding to the second question voice based on the dialogue model includes:
and determining second intention data in a second text corresponding to the second question voice, determining a first reply voice corresponding to the second intention data according to the dialogue model, and taking the first reply voice as a second reply voice corresponding to the second question voice.
Accordingly, the above-described embodiments have the following advantages or advantageous effects: according to the second intention data in the second text corresponding to the second question voice, the first reply voice corresponding to the second intention data is determined based on the dialogue model and serves as the second reply voice corresponding to the second question voice, so that the second reply voice is closer to the reply mode of the first user, and the reality of the simulated dialogue is improved.
Further, the key data comprises first intention data and first slot bit data in the first text;
correspondingly, the establishing of the dialogue model for the first user based on the key data and the first reply voice comprises:
determining second slot bit data in a third text corresponding to the first recovered voice;
training a preset question-answer model according to the first text, the third text, the first intention data, the first slot bit data and the second slot bit data;
and taking the question-answer model obtained by training as a dialogue model aiming at the first user.
Accordingly, the above-described embodiments have the following advantages or advantageous effects: the dialogue model is obtained by determining the first text, the third text, the first intention data, the first slot bit data and the second slot bit data and training the preset question-answer model, so that the first text and the third text are analyzed more accurately, and the answer corresponding to the second question voice can be determined more accurately by the dialogue model determined according to more detailed information.
Further, the determining a second reply voice corresponding to the second question voice based on the dialogue model includes:
inputting second intention data and third slot bit data in a second text corresponding to second question voice into the dialogue model, and acquiring a result text output by the dialogue model;
and determining a second reply voice corresponding to the second question voice according to the result text.
Accordingly, the above-described embodiments have the following advantages or advantageous effects: and determining a result text based on the dialogue model according to second intention data and third slot data in a second text corresponding to the second question voice, and further determining a second reply voice.
Further, the determining, according to the result text, a second reply voice corresponding to a second question voice includes:
obtaining a voice corresponding to third intention data in the result text by querying a database, and taking the queried voice as a second reply voice corresponding to the second question voice;
the database stores at least one corresponding relation between intention data and voice, and the corresponding relation between the at least one intention data and the voice is determined according to the corresponding relation between the first intention data and the first recovered voice.
Accordingly, the above-described embodiments have the following advantages or advantageous effects: because the database stores the corresponding relation between at least one intention data and the voice, and the corresponding relation between the at least one intention data and the voice is determined according to the corresponding relation between the first intention data and the first reply voice, the second reply voice obtained by querying the database can be closer to the first reply voice of the first user, and therefore the conversation is more authentic and more reductive.
Further, the determining, according to the result text, a second reply voice corresponding to a second question voice includes:
and acquiring pre-stored voice feature data of the first user, generating voice according to the voice feature data and the result text, and taking the generated voice as second reply voice corresponding to the second question voice.
Accordingly, the above-described embodiments have the following advantages or advantageous effects: the result text is generated into voice according to the voice feature data of the first user and used as second reply voice corresponding to the second question voice, so that the reply voice of the first user is truly restored, and the reality of the simulated conversation is improved.
The embodiment of the application also discloses a voice interaction device, which comprises:
the first user information acquisition module is used for extracting key data in a first text corresponding to a first question voice of the intelligent sound box and recording a first reply voice of the first user corresponding to the first question voice when the first user performs voice interaction with the intelligent sound box;
a dialogue model establishing module for establishing a dialogue model for a first user based on the key data and the first reply voice;
the second reply voice determining module is used for receiving a second question voice input by a second user, and determining a second reply voice corresponding to the second question voice based on the dialogue model, wherein the second reply voice is the historical reply voice of the first user;
and the broadcasting module is used for broadcasting the second reply voice.
Further, the key data comprises first intention data in the first text;
correspondingly, the dialogue model building module comprises:
a corresponding relation establishing unit, configured to establish a corresponding relation between the first intention data and the first recovered voice, and establish a dialogue model for the first user based on the corresponding relation.
Further, the second reply voice determination module includes:
and the second intention data determining unit is used for determining second intention data in a second text corresponding to the second question voice, determining a first reply voice corresponding to the second intention data according to the dialogue model, and taking the first reply voice as a second reply voice corresponding to the second question voice.
Further, the key data comprises first intention data and first slot bit data in the first text;
correspondingly, the dialogue model building module comprises:
the second slot data determining unit is used for determining second slot data in a third text corresponding to the first reply voice;
the model training unit is used for training a preset question and answer model according to the first text, the third text, the first intention data, the first slot bit data and the second slot bit data;
and the question-answer model determining unit is used for taking the trained question-answer model as a dialogue model for the first user.
Further, the second reply voice determination module includes:
the result text output unit is used for inputting second intention data and third slot bit data in a second text corresponding to a second question voice into the dialogue model and acquiring a result text output by the dialogue model;
and the result text processing unit is used for determining a second reply voice corresponding to the second question voice according to the result text.
Further, the result text processing unit includes:
the query subunit is configured to obtain, through querying a database, a speech corresponding to the third intention data in the result text, and use the queried speech as a second reply speech corresponding to the second question speech;
the database stores at least one corresponding relation between intention data and voice, and the corresponding relation between the at least one intention data and the voice is determined according to the corresponding relation between the first intention data and the first recovered voice.
Further, the result text processing unit includes:
and the voice generating subunit is used for acquiring pre-stored voice characteristic data of the first user, generating voice according to the voice characteristic data and the result text, and taking the generated voice as second reply voice corresponding to the second question voice.
The embodiment of the application also discloses an electronic device, which comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described in any one of the embodiments of the present application.
Also disclosed in embodiments herein is a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of the embodiments herein.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a flow chart of a voice interaction method provided according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating another method of voice interaction provided in accordance with an embodiment of the present application;
FIG. 3 is a flow chart illustrating a further method for voice interaction according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a voice interaction apparatus provided in an embodiment of the present application;
fig. 5 is a block diagram of an electronic device for implementing a voice interaction method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic flowchart of a voice interaction method according to an embodiment of the present application. This embodiment can be applicable to the condition that current user carries out the dialogue through intelligent audio amplifier with other users. Typically, this embodiment may be applicable to a case where a user has a conversation with a history user via a smart speaker when the history user who has voice-interacted with the smart speaker is not currently on the spot. And can be used for carrying out simulated dialogue, the voice interaction method disclosed by the embodiment can be executed by a voice interaction device, and the device can be realized by software and/or hardware. Referring to fig. 1, the voice interaction method provided in this embodiment includes:
s110, when the first user performs voice interaction with the intelligent sound box, extracting key data in a first text corresponding to a first question voice of the intelligent sound box, and recording a first reply voice of the first user corresponding to the first question voice.
The first user may be any user who performs voice interaction with the smart sound box, the first question voice may be a question voice uttered by the smart sound box when performing voice interaction with the first user, for example, "what you eat breakfast?.
At present, when intelligent audio amplifier and user carry out speech interaction, only can carry out real-time speech interaction at the present time to speech interaction generally only exists between intelligent audio amplifier and user, can't realize the analog dialogue between the different users, and the speech interaction mode is single, can't satisfy different users ' demand. For example, when the first user has performed voice interaction with the smart sound box, but the first user is currently out of place or the first user is away from home, it is impossible to perform a simulated conversation with the first user according to the usage of the smart sound box by the first user and the voice interaction with the smart sound box, so as to achieve the effect of reminiscent of the first user. Therefore, in the embodiment of the application, when the first user performs voice interaction with the smart sound box, the key data in the first text corresponding to the first question voice of the smart sound box is extracted, the information in the first question voice is obtained in time, and the first reply voice of the first user corresponding to the first question voice is recorded, so that the voice and the reply of the first user are recorded in time, and the reply content of the simulated conversation is determined according to the recorded first reply voice of the first user in the following process, so as to realize the simulated conversation with the first user.
Optionally, when the user performs voice interaction with the smart sound box, different users can be distinguished according to the tone, the volume, the tone and the like of the user, so that the key data in the first question voice of the different users and the first reply voice are distinguished and stored corresponding to the different users, and therefore the accuracy of data storage is improved, and a simulation conversation is performed according to the historical reply voice of the specified user in a subsequent targeted manner.
And S120, establishing a dialogue model for the first user based on the key data and the first reply voice.
For example, for the key data of the first text corresponding to each first question voice, there is a first reply voice corresponding to it, so the key data and the first reply voice corresponding to it can be used as training samples to train the neural network model, resulting in a dialogue model for the first user.
Because the key data in the first text corresponding to the first question voice corresponds to the first reply voice corresponding to the first question voice, a dialogue model for the first user is established according to the key data and the first reply voice corresponding to the key data, and therefore the corresponding relation between the key data in the first text and the first reply voice is effectively stored through the model.
S130, receiving a second question voice input by a second user, and determining a second reply voice corresponding to the second question voice based on the dialogue model, wherein the second reply voice is the historical reply voice of the first user.
For example, when a second user has a requirement for a simulated conversation with the first user, the first user may send a second question voice to the smart sound box, and at this time, in order to simulate a scene of the conversation with the first user and improve the reality of the simulated conversation with the first user, the smart sound box obtains a reply voice corresponding to the second question voice for replying according to semantic analysis of the second question voice, but determines the second reply voice corresponding to the second question voice based on the conversation model, for example, key data of a second text corresponding to the second question voice is input into the conversation model as input data, so as to obtain the second reply voice output by the conversation model. Because the second reply voice is the historical reply voice of the first user and the dialogue model is established based on the key data and the first reply voice, the determined second reply voice can accurately restore the reply mode of the first user, and the reality of the simulated dialogue is improved.
And S140, broadcasting the second reply voice.
Exemplarily, the smart sound box broadcasts the second reply voice to the second user, and the second reply voice is the historical reply voice of the first user, so that the reply scene of the first user is accurately restored, the simulated conversation between the second user and the first user is realized, and the simulated conversation is authentic.
According to the technical scheme, the first reply voice of the first user is effectively recorded, the second reply voice corresponding to the second question voice of the second user is determined from the history reply voice of the first user based on the established conversation model, the problem of single voice interaction function is overcome, the effect of simulating conversation with the user using the same intelligent sound box is achieved, the voice interaction function of the intelligent sound box can be enriched, and different requirements of the user are met.
Fig. 2 is a schematic flow chart of another voice interaction method provided in an embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 2, the voice interaction method provided in this embodiment includes:
s210, when a first user performs voice interaction with the intelligent sound box, extracting key data in a first text corresponding to a first question voice of the intelligent sound box, and recording a first reply voice of the first user corresponding to the first question voice; the key data includes first intention data in the first text.
The first intention data is used for expressing the intention in the first text, for example, the first text is "who you are,"? ", the intention of the first text can be determined as" native query ".
S220, establishing a corresponding relation between the first intention data and the first recovered voice, and establishing a dialogue model aiming at a first user based on the corresponding relation.
Specifically, since the first question voice corresponds to the first reply voice of the first user, the key data in the first text corresponding to the first question voice corresponds to the first reply voice of the first user, and therefore, the dialogue model of the first user is trained based on the correspondence between the key data in the first text and the first reply voice of the first user. Exemplarily, the key data in the first text is used as the input of the neural network model, the first reply voice of the first user corresponding to the key data of the first text is used as the input of the neural network model, and the neural network model is trained to obtain the dialogue model for the first user. The conversation model is established based on the corresponding relation between the key data and the first recovered voice, so that the corresponding relation between the key data in the first text and the first recovered voice is effectively recorded, and the recovery scene of the first user is conveniently restored according to the conversation model.
And S230, receiving a second question voice input by a second user.
S240, second intention data in a second text corresponding to the second question voice are determined, a first reply voice corresponding to the second intention data is determined according to the dialogue model, and the first reply voice is used as a second reply voice corresponding to the second question voice.
Wherein the second intention data is for expressing an intention in the second text. Illustratively, the smart sound box receives a second question voice input by a second user, analyzes a second text corresponding to the second question voice, acquires second intention data in the second text, inputs the second intention data into the dialogue model, determines a first reply voice corresponding to the second intention data, and takes the first reply voice as a second reply voice corresponding to the second question voice. For example, the second intention data may be matched with the first intention data, and the first reply voice corresponding to the first intention data matched with the second intention data may be used as the second reply voice corresponding to the second question voice.
And determining a second reply voice through the second intention data and the dialogue model, so that the determined second reply voice accurately conforms to a second intention in the second text, the second reply voice has high matching performance with the second question voice, and the second reply voice is actually the first reply voice of the first user, so that the scene reducibility of the simulated dialogue is improved, the second reply voice is closer to the reply of the first user, and the reality of the simulated dialogue is improved.
And S250, broadcasting the second reply voice.
According to the technical scheme, the dialogue model is established based on the corresponding relation between the first intention data and the first reply voice, so that the reply content of the first user is effectively recorded, the first reply voice corresponding to the second intention data is determined to serve as the second reply voice corresponding to the second question voice through the second intention data in the second text corresponding to the second question voice and the dialogue model, the scene simulation of dialogue with the first user is achieved, and the played second reply voice is actually the first reply voice of the first user, so that the reality of the simulated dialogue is improved.
Fig. 3 is a flowchart illustrating another voice interaction method according to an embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 3, the voice interaction method provided in this embodiment includes:
s310, when a first user performs voice interaction with the intelligent sound box, extracting key data in a first text corresponding to a first question voice of the intelligent sound box, and recording a first reply voice of the first user corresponding to the first question voice; the key data includes first intention data and first slot data in the first text.
Wherein the first intention data is used to express an intention in the first text, for example, the first text is "your person is?", the intention of the first text may be determined as "native query". the first slot data is used to express a type of the keyword in the first text, for example, for the first text "person who is you is?", the first slot data may be determined as { you } { person who is then }, the intention of the first text may be accurately expressed by the first intention data, the keyword type in the first text may be accurately expressed by the first slot data, and thus the first text may be more accurately analyzed.
S320, determining second slot data in a third text corresponding to the first recovered voice.
Illustratively, for the first recovered speech, the second slot data corresponding to the third text is determined, so as to perform more refined splitting and analysis on the first recovered speech. For example, if the third text data corresponding to the first recovered speech is "i am a person in northriver down mountain", the second slot data of the third text data may be determined to be { i } { hebei } { person in down mountain }, so that the third text is refined to obtain critical information.
S330, training a preset question-answer model according to the first text, the third text, the first intention data, the first slot bit data and the second slot bit data, and taking the question-answer model obtained through training as a dialogue model for the first user.
The preset question-answer model can be a question-answer model constructed in advance based on a neural network and used for obtaining the output of the reply sentences according to the input of the question sentences. For example, the first text, the first intention data and the first slot data may be used as input of a preset question-answer model, the third text and the second slot data may be used as output of the preset question-answer model, and the preset question-answer model is trained, so that the obtained question-answer model is used as a dialogue model for the first user. The dialogue model is obtained through training according to the first text, the third text, the first intention data, the first slot data and the second slot data, so that key information corresponding to the first question voice and the first reply voice of the first user is effectively recorded through the dialogue model, and the reply content of the simulated dialogue is conveniently determined subsequently.
And S340, receiving a second question voice input by a second user.
And S350, inputting second intention data and third slot data in a second text corresponding to the second question voice into the dialogue model, and acquiring a result text output by the dialogue model.
Illustratively, the smart speaker receives a second question voice input by a second user, and inputs second intention data and third slot data in the second question voice into the dialogue model, so as to obtain a result text according to the dialogue model, wherein the result text comprises contents of a second reply voice corresponding to the second question voice, and the result text is determined according to the dialogue model and comprises key information of a first question voice and a first reply voice corresponding to the first user, so that the result text obtained according to the dialogue model is contents related to the reply voice of the first user, and therefore the reality of the simulated dialogue is improved.
And S360, determining a second reply voice corresponding to the second question voice according to the result text.
Optionally, the determining, according to the result text, a second reply voice corresponding to the second question voice includes: obtaining a voice corresponding to third intention data in the result text by querying a database, and taking the queried voice as a second reply voice corresponding to the second question voice; the database stores at least one corresponding relation between intention data and voice, and the corresponding relation between the at least one intention data and the voice is determined according to the corresponding relation between the first intention data and the first recovered voice.
Illustratively, the correspondence between the first question voice corresponding to the first intention data in the first text and the first reply voice of the first user is stored in the database in advance. And determining third intention data of the result text aiming at the obtained result text, inquiring first intention data matched with the third intention number through a database, taking the first reply voice corresponding to the first intention data matched with the third intention data as the voice corresponding to the third intention data, and taking the first reply voice as the second reply voice corresponding to the second question voice. The corresponding relation between at least one intention data and the voice prestored in the database is inquired, the voice corresponding to the third intention is determined, and the corresponding relation between the at least one intention data and the voice is determined according to the corresponding relation between the first intention data and the first reply voice, so that the second reply voice is closer to the reply of the first user, the simulated dialogue is more similar to the dialogue process with the first user, and the reality of the simulated dialogue is improved.
Optionally, the determining, according to the result text, a second reply voice corresponding to the second question voice includes: and acquiring pre-stored voice feature data of the first user, generating voice according to the voice feature data and the result text, and taking the generated voice as second reply voice corresponding to the second question voice.
Illustratively, when the first user performs voice interaction with the smart sound box, the recorded first recovered voice of the first user is analyzed, voice feature data, such as tone, volume, tone, and the like, in the first recovered voice is extracted, and the voice feature data corresponding to the first user is saved. After the result text is obtained, the result text is converted into voice according to the voice feature data of the first user and serves as second reply voice corresponding to the second question voice, so that the second reply voice restores the reply voice of the first user, the second reply voice has higher similarity with the voice of the first user, and the reality of simulated conversation with the first user is improved.
And S370, broadcasting the second reply voice.
According to the technical scheme, the dialogue model is obtained through training according to the first text, the third text, the first intention data, the first slot data and the second slot data, so that key information corresponding to the first question voice and the first reply voice of the first user is effectively recorded through the dialogue model, the reply content of the simulation dialogue is conveniently determined subsequently, the result text is the content related to the reply voice of the first user according to the result text obtained through the dialogue model, and the reality of the simulation dialogue is improved.
Fig. 4 is a schematic structural diagram of a voice interaction apparatus provided according to an embodiment of the present application. Referring to fig. 4, an embodiment of the present application discloses a voice interaction apparatus 400, where the apparatus 400 includes: the system comprises a first user information acquisition module 401, a dialogue model establishment module 402, a second reply voice determination module 403 and a broadcast module 404.
The first user information obtaining module 401 is configured to, when a first user performs voice interaction with the smart sound box, extract key data in a first text corresponding to a first question voice of the smart sound box, and record a first reply voice of the first user corresponding to the first question voice;
a dialogue model building module 402 for building a dialogue model for a first user based on the key data and the first reply voice;
a second reply voice determining module 403, configured to receive a second question voice input by a second user, and determine, based on the dialog model, a second reply voice corresponding to the second question voice, where the second reply voice is a historical reply voice of the first user;
and the broadcasting module 404 is used for broadcasting the second reply voice.
According to the technical scheme, the first reply voice of the first user is effectively recorded, and the second reply voice corresponding to the second question voice of the second user is determined from the historical reply voice of the first user based on the established conversation model, so that the problem that the current user cannot converse with the user who uses the intelligent sound box once is solved, the effect of simulating conversation with the user who uses the same intelligent sound box is achieved, the simulation conversation with the historical user who uses the intelligent sound box once and does not use the intelligent sound box at present can be achieved, the voice interaction function of the intelligent sound box is enriched, and different requirements of the user are met.
Further, the key data comprises first intention data in the first text;
accordingly, the dialogue model building module 402 includes:
a corresponding relation establishing unit, configured to establish a corresponding relation between the first intention data and the first recovered voice, and establish a dialogue model for the first user based on the corresponding relation.
Further, the second reply voice determination module 403 includes:
and the second intention data determining unit is used for determining second intention data in a second text corresponding to the second question voice, determining a first reply voice corresponding to the second intention data according to the dialogue model, and taking the first reply voice as a second reply voice corresponding to the second question voice.
Further, the key data comprises first intention data and first slot bit data in the first text;
accordingly, the dialogue model building module 402 includes:
the second slot data determining unit is used for determining second slot data in a third text corresponding to the first reply voice;
the model training unit is used for training a preset question and answer model according to the first text, the third text, the first intention data, the first slot bit data and the second slot bit data;
and the question-answer model determining unit is used for taking the trained question-answer model as a dialogue model for the first user.
Further, the second reply voice determination module 403 includes:
the result text output unit is used for inputting second intention data and third slot bit data in a second text corresponding to a second question voice into the dialogue model and acquiring a result text output by the dialogue model;
and the result text processing unit is used for determining a second reply voice corresponding to the second question voice according to the result text.
Further, the result text processing unit includes:
the query subunit is configured to obtain, through querying a database, a speech corresponding to the third intention data in the result text, and use the queried speech as a second reply speech corresponding to the second question speech;
the database stores at least one corresponding relation between intention data and voice, and the corresponding relation between the at least one intention data and the voice is determined according to the corresponding relation between the first intention data and the first recovered voice.
Further, the result text processing unit includes:
and the voice generating subunit is used for acquiring pre-stored voice characteristic data of the first user, generating voice according to the voice characteristic data and the result text, and taking the generated voice as second reply voice corresponding to the second question voice.
The voice interaction device provided by the embodiment of the application can execute the voice interaction method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 5, fig. 5 is a block diagram of an electronic device for implementing a voice interaction method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.
Memory 502 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of speech recognition provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of speech recognition provided herein.
The memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method of speech recognition in the embodiment of the present application (for example, the first user information obtaining module 401, the dialogue model building module 402, the second reply speech determining module 403, and the announcement module 404 shown in fig. 4). The processor 501 executes various functional applications of the server and data processing, i.e., a method of speech recognition in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 502.
The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device for voice recognition, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 optionally includes memory located remotely from processor 501, which may be connected to a voice recognition electronic device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the method of speech recognition may further comprise: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the speech-recognized electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method of voice interaction, the method comprising:
when a first user performs voice interaction with the intelligent sound box, extracting key data in a first text corresponding to a first question voice of the intelligent sound box, and recording a first reply voice of the first user corresponding to the first question voice;
establishing a dialogue model for a first user based on the key data and the first reply voice;
receiving a second question voice input by a second user, and determining a second reply voice corresponding to the second question voice based on the dialogue model, wherein the second reply voice is the historical reply voice of the first user;
and broadcasting the second reply voice.
2. The method of claim 1, wherein the key data comprises first intent data in a first text;
correspondingly, the establishing of the dialogue model for the first user based on the key data and the first reply voice comprises:
establishing a corresponding relation between the first intention data and the first recovered voice, and establishing a dialogue model aiming at a first user based on the corresponding relation.
3. The method of claim 2, wherein the determining a second reply voice corresponding to the second question voice based on the dialogue model comprises:
and determining second intention data in a second text corresponding to the second question voice, determining a first reply voice corresponding to the second intention data according to the dialogue model, and taking the first reply voice as a second reply voice corresponding to the second question voice.
4. The method of claim 1, wherein the key data comprises first intent data and first slot data in a first text;
correspondingly, the establishing of the dialogue model for the first user based on the key data and the first reply voice comprises:
determining second slot bit data in a third text corresponding to the first recovered voice;
training a preset question-answer model according to the first text, the third text, the first intention data, the first slot bit data and the second slot bit data;
and taking the question-answer model obtained by training as a dialogue model aiming at the first user.
5. The method of claim 4, wherein the determining a second reply voice corresponding to the second question voice based on the dialogue model comprises:
inputting second intention data and third slot bit data in a second text corresponding to second question voice into the dialogue model, and acquiring a result text output by the dialogue model;
and determining a second reply voice corresponding to the second question voice according to the result text.
6. The method of claim 5, wherein determining a second reply voice corresponding to a second question voice from the result text comprises:
obtaining a voice corresponding to third intention data in the result text by querying a database, and taking the queried voice as a second reply voice corresponding to the second question voice;
the database stores at least one corresponding relation between intention data and voice, and the corresponding relation between the at least one intention data and the voice is determined according to the corresponding relation between the first intention data and the first recovered voice.
7. The method of claim 5, wherein determining a second reply voice corresponding to a second question voice from the result text comprises:
and acquiring pre-stored voice feature data of the first user, generating voice according to the voice feature data and the result text, and taking the generated voice as second reply voice corresponding to the second question voice.
8. A voice interaction apparatus, comprising:
the first user information acquisition module is used for extracting key data in a first text corresponding to a first question voice of the intelligent sound box and recording a first reply voice of the first user corresponding to the first question voice when the first user performs voice interaction with the intelligent sound box;
a dialogue model establishing module for establishing a dialogue model for a first user based on the key data and the first reply voice;
the second reply voice determining module is used for receiving a second question voice input by a second user, and determining a second reply voice corresponding to the second question voice based on the dialogue model, wherein the second reply voice is the historical reply voice of the first user;
and the broadcasting module is used for broadcasting the second reply voice.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN201910903910.3A 2019-09-24 2019-09-24 Voice interaction method, device, equipment and medium Pending CN110633357A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910903910.3A CN110633357A (en) 2019-09-24 2019-09-24 Voice interaction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910903910.3A CN110633357A (en) 2019-09-24 2019-09-24 Voice interaction method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN110633357A true CN110633357A (en) 2019-12-31

Family

ID=68972847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910903910.3A Pending CN110633357A (en) 2019-09-24 2019-09-24 Voice interaction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN110633357A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966803A (en) * 2020-08-03 2020-11-20 深圳市欢太科技有限公司 Dialogue simulation method, dialogue simulation device, storage medium and electronic equipment
WO2021190225A1 (en) * 2020-03-27 2021-09-30 华为技术有限公司 Voice interaction method and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180082184A1 (en) * 2016-09-19 2018-03-22 TCL Research America Inc. Context-aware chatbot system and method
CN108320738A (en) * 2017-12-18 2018-07-24 上海科大讯飞信息科技有限公司 Voice data processing method and device, storage medium, electronic equipment
CN108897848A (en) * 2018-06-28 2018-11-27 北京百度网讯科技有限公司 Robot interactive approach, device and equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180082184A1 (en) * 2016-09-19 2018-03-22 TCL Research America Inc. Context-aware chatbot system and method
CN108320738A (en) * 2017-12-18 2018-07-24 上海科大讯飞信息科技有限公司 Voice data processing method and device, storage medium, electronic equipment
CN108897848A (en) * 2018-06-28 2018-11-27 北京百度网讯科技有限公司 Robot interactive approach, device and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李德毅 等: "《人工智能导论》", 31 August 2018 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021190225A1 (en) * 2020-03-27 2021-09-30 华为技术有限公司 Voice interaction method and electronic device
CN111966803A (en) * 2020-08-03 2020-11-20 深圳市欢太科技有限公司 Dialogue simulation method, dialogue simulation device, storage medium and electronic equipment
CN111966803B (en) * 2020-08-03 2024-04-12 深圳市欢太科技有限公司 Dialogue simulation method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111192591B (en) Awakening method and device of intelligent equipment, intelligent sound box and storage medium
CN111324727B (en) User intention recognition method, device, equipment and readable storage medium
JP2019102063A (en) Method and apparatus for controlling page
CN112259072A (en) Voice conversion method and device and electronic equipment
CN111177355B (en) Man-machine conversation interaction method and device based on search data and electronic equipment
CN111105800B (en) Voice interaction processing method, device, equipment and medium
CN104866275B (en) Method and device for acquiring image information
CN112100352A (en) Method, device, client and storage medium for interacting with virtual object
US11749255B2 (en) Voice question and answer method and device, computer readable storage medium and electronic device
CN112434139A (en) Information interaction method and device, electronic equipment and storage medium
CN111159380B (en) Interaction method and device, computer equipment and storage medium
CN112908318A (en) Awakening method and device of intelligent sound box, intelligent sound box and storage medium
CN111709362A (en) Method, device, equipment and storage medium for determining key learning content
CN112382287A (en) Voice interaction method and device, electronic equipment and storage medium
CN112000781A (en) Information processing method and device in user conversation, electronic equipment and storage medium
CN111259125A (en) Voice broadcasting method and device, intelligent sound box, electronic equipment and storage medium
CN111177462B (en) Video distribution timeliness determination method and device
CN110633357A (en) Voice interaction method, device, equipment and medium
CN112382279B (en) Voice recognition method and device, electronic equipment and storage medium
CN112650844A (en) Tracking method and device of conversation state, electronic equipment and storage medium
CN111918073A (en) Management method and device of live broadcast room
CN112652311B (en) Chinese and English mixed speech recognition method and device, electronic equipment and storage medium
CN113160782B (en) Audio processing method and device, electronic equipment and readable storage medium
CN111681052B (en) Voice interaction method, server and electronic equipment
CN111581347B (en) Sentence similarity matching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination