CN112528004A

CN112528004A - Voice interaction method, voice interaction device, electronic equipment, medium and computer program product

Info

Publication number: CN112528004A
Application number: CN202011551823.5A
Authority: CN
Inventors: 冯博豪
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-03-19

Abstract

The present disclosure provides a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for voice interaction, which relate to the technical field of artificial intelligence, and in particular, to natural language processing and computer vision. A method of voice interaction may comprise: acquiring a first voice input; and controlling the terminal to output the first voice output based at least in part on the first voice input and environmental information, wherein the environmental information is dynamically maintained by image analysis of images captured at the terminal.

Description

Voice interaction method, voice interaction device, electronic equipment, medium and computer program product

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular, to natural language processing and computer vision, and more particularly, to a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for voice interaction.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. The artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

Voice interaction may occur in a variety of products, such as intelligent voice assistants, intelligent speakers, intelligent shopping guides, and so on. Through the voice interaction function, the voice internet surfing, song on demand, weather understanding, current affair understanding and the like can be realized.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for voice interaction.

According to an aspect of the present disclosure, there is provided a method of voice interaction, including: acquiring a first voice input; and controlling the terminal to output the first voice output based at least in part on the first voice input and the environmental information. Wherein the environmental information is dynamically maintained by image analysis of images captured at the terminal.

According to another aspect of the present disclosure, there is provided an apparatus for voice interaction, including: the image acquisition module is used for acquiring images; an image analysis module for performing image analysis on the acquired image to dynamically maintain environmental information; and a voice interaction module to, in response to receiving a first voice input, output a first voice output based at least in part on the environmental information.

According to another aspect of the present disclosure, there is provided an electronic device comprising a camera, a speaker, a processor, and a memory storing a program, the program comprising instructions that, when executed by the processor, cause the electronic device to perform a voice interaction method according to an embodiment of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium storing a program, the program comprising instructions that, when executed by a processor of an electronic device, instruct the electronic device to perform a voice interaction method according to an embodiment of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a voice interaction method according to embodiments of the present disclosure.

According to one or more embodiments of the present disclosure, voice interaction capability with a user may be increased.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a voice interaction method according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of the attribute mapping step according to an embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a voice interaction method according to another embodiment of the present disclosure;

FIG. 5 shows a block diagram of a voice interaction device, according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of a voice interaction device, according to another embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the voice interaction method to be performed.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use

client devices

101, 102, 103, 104, 105, and/or 106 for voice interaction. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as Microsoft Windows, Apple iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., Google Chrome OS); or include various Mobile operating systems, such as Microsoft Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The data store 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In certain embodiments, the data store used by the server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

A voice interaction method 200 according to an embodiment of the present disclosure is described below with reference to fig. 2.

At step S201, a first speech input is acquired. The first speech input may be a speech input from a user, such as a sentence spoken by the user, or the like.

At step S202, the terminal is controlled to output a first voice output based at least in part on the first voice input and environmental information, wherein the environmental information is dynamically maintained by image analysis of images captured at the terminal.

Visual capabilities can be combined with voice interaction features through the method 200 described above. Specifically, the environment information can be dynamically maintained using the photographed visual information and by performing image analysis thereon, whereby the information obtained by the image analysis can be converted into a voice output in conjunction with the visual ability.

The first speech output may be a response to the first speech input. For example, the first speech input may be a question and the first speech output may be a response thereto. Alternatively, the first speech output may be a statement sentence or phrase or word or the like, and the first speech output may be any type of sentence, such as an active question, a statement message, a poster or a sentence for attracting the user's attention and guiding the initiated conversation, or the like. As part of the voice interaction, a first voice output is generated based on the environmental information. For example, the first speech output may be an active question based on environmental information, such as "do you need to turn the desk lamp on/off? Or, if the environment information indicates that there is an apple on the desktop (and optionally, the color, size, freshness, etc. of the apple), it may be "an apple with a red color on the desktop". As another example, the control terminal may output a guidance statement based on the context information and further in combination with other information, such as user habits, user status, etc., e.g., indicating that a window shade is open, outside the window is clear, etc., based on the context information, the first voice output may be "today is very good weather, wants to chat? It is to be understood that the present disclosure is not so limited and that the first speech output may be any speech output obtained based on dynamically maintained environmental information.

By collecting images at the terminal and performing various common image analyses on the images, environmental information of the environment in which the terminal is located can be dynamically maintained. In this context, dynamically maintaining may refer to both acquiring an image and analyzing an image as being dynamic. For example, images may be acquired periodically (e.g., every few hours, daily, every few days, etc.), based on detecting environmental changes, based on detecting moving objects, based on light changes, based on positioning and orientation being changed, or based on other criteria (e.g., based on detecting sound, etc.), and so forth. Alternatively, the environmental information repository may be updated periodically or based on other criteria (e.g., such as those described above) by performing image analysis on the images in the image repository. Furthermore, as will be described in detail below, the updating of the image analysis model and the updating of the information base generated thereby may also be triggered based on self-learning mechanisms, user error correction, and the like.

The terminal may comprise one or more cameras, which may for example take pictures from different angles, e.g. may cover the whole indoor space. The camera can be used for shooting indoor articles including furniture, sofas, animals and the like. The scene or object in the image captured by the terminal can be analyzed, identified and information extracted. The terminal may include a module capable of voice interaction. The method 200 may, for example, generate the speech output directly at the terminal using computing resources at the terminal and output the speech output by a speech interaction module of the terminal. Alternatively, the analysis of the first voice input and the calculation of the first voice output may be performed at the server or the cloud, and the voice output of the terminal is controlled by the server or the cloud. The "control terminal output" may cover the scenarios of the above two cases. According to the method 200, more intelligent voice interaction can be realized by using the environment information of the user.

According to some embodiments, the image analysis comprises at least one of the following to obtain semantic information: target detection, instance segmentation, object ranging, character recognition and image classification. A method for maintaining environmental information using visual capabilities. Thus, information such as object type, distance, and characters in the surrounding environment can be obtained.

All objects in the indoor space may be detected using a target detection algorithm. After detection, the result obtained by the detection is changed into semantic information, for example, "two apples on a table".

An example segmentation algorithm may be used as a supplement to the target detection. Example segmentation can more finely segment objects in an image than target detection. For example, through example segmentation, examples of information that can be obtained are "two apples on a table, a sticker on an apple, a wall beside the table, a wood floor" and the like.

For example, the same object can be shot by a plurality of cameras, and the distance between the indoor scenery and the cameras can be calculated by using the parallax of the cameras and the triangulation principle. For example, after object ranging, examples of information that can be obtained are "apple is 1 meter from the ground, 1 meter from the table height", and the like.

The detected object may be subjected to character recognition by character recognition (OCR). Thereafter, the detected character information may be saved as an attribute of the detected object, or the like. For example, after obtaining information of "the journal is on the table" by an object detection model or the like, it can be known that it is "time" periodicals or "readers" or the like by character detection recognition.

The object may be image classified using an image classification model. The image classification may be fine-grained, e.g. refined to the name of the object and the category to which it belongs. Examples of information that can be obtained are, for example, "the object is an associational brand computer, belonging to an electronic product" or "the object is a bag of lv, belonging to a luxury product". It can also be calculated and analyzed in conjunction with the content recognized by the OCR character recognition model.

Results obtained by models (e.g., one or more of) target detection, image segmentation, object ranging, text recognition, image classification, and the like may be aggregated. After aggregation, the integrated data may be formed, for example, in the form of triple data. For example,

The results of the analysis may be stored, for example, in an information repository or repository. For example, the information obtained from the analysis may be saved in the form of a knowledge graph. The information base may be updated and maintained dynamically or periodically.

Human-computer interaction, such as receiving a conversation and initiating a conversation, may be performed using, for example, the information obtained using the steps described above. According to the user's problem, the image analysis function can be called to carry out corresponding matching on the object in the space, and finally, the information is returned to the client in a voice mode. According to some embodiments, the first speech input comprises a query for attributes of objects in the surrounding environment, and wherein the first speech output comprises attribute values of the objects, the attribute values being from the environmental information. The attributes of objects in the surrounding environment can be output in a voice manner through image acquisition and image analysis. As an example, the query may be "what color is apple? "for example, such a conversation can be used for chatting with the user, and more advantageously, can be used for a scene in which children or visually impaired people know surrounding information, and the like.

During the interaction, a question-answering model (alternatively referred to as a QA model) may be invoked. The question-answer model may be configured to match objects in space accordingly to the user's question and finally obtain information. The QA question-and-answer process can be split primarily into a named entity identification step and an attribute mapping step. Wherein the object of the entity identification step is to find the name of the entity queried in the question, and the object of the attribute mapping step is to find the relevant attribute queried in the question.

The named entity recognition step can be performed, for example, using the method of BERT + BilSTM + CRF. The BERT model is used for converting sentences into word vectors, the BILSTM realizes sequence labeling, and the CRF selects the most reasonable sequence from a plurality of labeled sequences of the BILSTM.

Named entity recognition models according to the present disclosure can be trained in various ways. For example, it may be pre-trained with encyclopedia data or any other data. Such models may also be continuously trained and updated through user history, as will be described below. The trained model is able to recognize subjects in sentences. When an entity abbreviation is used in the user question, the model can also be complemented with learned knowledge, for example, knowledge in encyclopedia. The model can determine the sentence body, for example, in "what color is apple? "the subject is determined to be apple.

After the named entity recognition step, the body of the question in the sentence has been determined. Since there may be many ternary information associated with the current principal, an attribute mapping step is required. The attribute mapping process may employ a pre-trained BERT model. Through the attribute mapping step, the attribute mapping most suitable for the problem can be found out.

Fig. 3 shows the process of attribute mapping according to the present example. In this example, the result "apple color is red" may be returned by named entity identification and attribute mapping.

According to some embodiments, the voice interaction method may further include: after outputting the first speech output, a second speech output is output in response to determining that the user exhibits an interest above a threshold. Therefore, based on the interest of the user, the voice is continuously output to the user, and the intelligence of the voice interaction system is increased. By adopting emotion recognition, the conversation content can be adjusted according to the user emotion obtained by recognition. The threshold value here may be a preset threshold value or a threshold value learned from interaction with the user, or the like. According to some embodiments, wherein the second speech output comprises knowledge about the keyword or comprises knowledge about a query whether knowledge about the keyword is desired. Outputting knowledge of the associated content or asking if it is desired to know the relevant knowledge. The active dialogue capability is increased. For example, the user may be further guided after finding a question the user wants.

Emotion recognition can be applied throughout the steps of the dialog. For example, in a user asking a question "what brand is the computer? "and after the voice interaction method outputs" association brand computer ", the method may further output" whether or not it wants to know the originator of the association company? "and the like. At this time, if it is detected that the user exhibits disinterest, for example, that the user exhibits hesitant expressions, the present disclosure may go to other problems.

The emotion recognition may include at least one of expression recognition and speech emotion recognition.

Expression recognition may include analyzing a user's expression from captured images or videos (e.g., images captured at different times) to determine changes in the mood and psychological of the person, determine whether the user is happy, angry, surprised, neutral, hating, sad, or the like. The expression recognition can be completed by various algorithms, and particularly, an image classification model VGG19 can be adopted, so that the expression recognition can achieve a good effect. The expression recognition model may be pre-trained through various training methods. Through expression recognition, human feelings can be understood by the algorithm, and human-computer interaction experience is improved.

Speech emotion recognition may employ a dual-cycle neural network (RNN) to encode information in audio and text sequences and combine information in these information sources to predict emotion classifications. The algorithm can more fully utilize the information contained in the data by considering the contents of the text and the voice. The speech emotion recognition model may include one or more of an encoder for speech, an encoder for speech recognition text, and a multi-modal fusion network sub-module based on attention mechanism. Where the encoder of speech may be configured to obtain low latitude frame-based features of speech and use BiLSTM for high dimensional feature representation of audio frame-based. The encoder sub-module of the speech recognition text may be configured to obtain a word vector for the word and perform a high-dimensional word-based feature representation of the ASR recognition text using BiLSTM. The Attention mechanism-based multi-modal fusion network sub-module can be configured to dynamically obtain the weight of each word text feature and the feature weight of each frame of voice based on the Attention mechanism, then obtain the feature of each word alignment by weighted summation, perform feature fusion by using the BilSTM, and perform emotion classification by using the maximum pooling layer and the full connection layer. It will be appreciated that the present disclosure is not so limited and that other algorithms and models for recognizing speech emotion are applicable.

According to some embodiments, the voice interaction method may further include outputting a third voice output in response to detecting the image of the user before receiving the first voice input, the third voice output for guiding a start of a conversation with the user. The intelligence of the voice interaction can be increased by actively initiating a dialog based on detecting the user. For example, a user may be directed to a conversation upon detection of the user (e.g., the user goes home from work). The dialog content can be further adjusted according to the user emotion. In this process, the user representation and points of interest may be analyzed through communication with the user, and the model and user preference settings may be adjusted accordingly. For example, through continuous self-learning, topics of interest to the user can be recommended first in the process of the next conversation.

Optionally, a model self-learning function may also be included to perform self-learning. According to some embodiments, the voice interaction method may further include: after outputting the first speech output, determining correctness of an answer included in the first speech output; and in response to the answer being incorrect, updating at least one of the environmental information or model parameters of the image analysis. Thus, the voice interaction method may have an active self-learning capability, and may be able to determine whether the output answer is correct, and thereby update the information base and/or the image analysis algorithm.

Whether the feedback result is correct can be judged according to at least the expression and the answer of the user.

According to some embodiments, determining the correctness of the answer includes determining the correctness of the answer based on an expression of the user. The correct answer is determined based on the expression, and the visual ability and the self-learning of the model are combined, so that the collected information is more complete, and the updated model is more accurate. For example, the answer may be determined to be correct based on a satisfied expression of the user, or to be incorrect based on an discontented or confused expression of the user.

According to some embodiments, determining the correctness of the answer comprises: outputting a fourth voice output, the fourth voice output comprising a query whether the answer is correct; and receiving a second voice input indicating whether the answer is correct. Whether the answer is correct or not can be confirmed to the user through voice output, active interaction with the user is increased, and accuracy of the model can be improved. For example, after receiving a user question and invoking the question and answer module to perform answer search, and feeding back to the user in a voice manner, a question may be added: "is the answer correct? ". Thereafter, the user voice response may be received and analyzed, for example, to determine that the answer previously output to the user was correct based on receiving the user "answer correct" or "yes".

And if the correct answer cannot be obtained in the answer searching process, voice interaction can be carried out with the user. Furthermore, emotion recognition may be used to analyze the user's emotion in real time throughout the interaction to adjust the content, speech rate, pitch, mood, etc. of the conversation.

According to some embodiments, the voice interaction method may further include: responsive to the answer being incorrect, outputting a fifth speech output, the fifth speech output comprising a query for the correct answer; receiving a third voice input indicating a correct answer to the first voice input; updating at least one of the environmental information or model parameters of the image analysis using the obtained correct answer. After the answer is determined to be incorrect, the user can be actively inquired whether the answer is correct, the intelligence of the voice system is increased, and more accurate models/information can be obtained based on the correct answer. For example, in determining that the result of the feedback is incorrect, a voice question "what is the correct answer? "upon receiving a correct answer that facilitates voice feedback from the user (e.g.," the apple is green "), the results of the user feedback may be recorded and the model retrained to improve the accuracy of the recognition classification of the model. Meanwhile, the semantic information stored in the information base can be updated based on the correct answers fed back by the user.

The following self-learning process may be employed: the image of the question asked with the conversation is saved, and the user's speech is used as a label. Thereafter, a large number of similar training samples may be generated. Generating a large number of similar training samples may include generating a large number of picture samples, including samples taken from a subject taken from multiple angles and picture samples generated by a similar picture generation algorithm, etc.

After the algorithm generation, enough picture samples and labels thereof can be formed. These data can be used to train image classification, object detection, and text recognition models.

In daily life, the present disclosure trains in real time. With the increase of communication with the user, the number of questions which can be answered by the method is increased, the answers are more and more accurate, and the emotion analysis of the user is more and more accurate.

Through the embodiment of the disclosure, the visual ability can be provided for the intelligent voice assistant, so that the intelligent voice assistant can 'observe the environment' and output answers according to natural language rules and daily logic according to the problems of the user. The present disclosure may incorporate techniques for artificial intelligence in a variety of ways, such as object recognition, object detection, expression recognition, speech emotion recognition, and natural language processing. According to different embodiments of the present disclosure, the method may focus on a part of the object in the indoor environment, may also focus on the whole indoor environment, and may further recognize the emotion of the user according to the expression, the voice content, and the like of the user. In addition, with the communication and the communication with the user, self-learning can be continuously carried out, so that the recognized objects and emotions can be more and more accurate.

According to the embodiments of the present disclosure, an active dialog and a passive dialog can be realized. A passive dialog may be triggered upon detection of a command for speech. An active dialog may, for example, actively initiate a dialog to a user upon detecting the presence of the user, in particular by image collection and analysis. The active dialog may be a recommendation based on an information base and user history information. For example, it is learned from history information that "several points are" the news of today are "are asked frequently after the user goes home? "etc., so, upon detecting that the user is home, initiate a voice" do you want to know about news today? "

A flow diagram of a method 400 according to another embodiment of the present disclosure is described below with reference to fig. 4.

In step S401, a surrounding image is captured by an image capturing device such as a camera. The image may include things in the environment. Step S401 may further include performing preliminary processing on the captured image.

At step S402, the acquired image is analyzed and a knowledge base is maintained accordingly.

At step S403, a voice interaction is performed with the user. In the voice interaction process, the question of the user can be answered according to the content of the knowledge base and the environmental information acquired by the image analysis module.

At step S404, the emotion of the user is recognized and analyzed during the voice interaction. The dialog contents can be continuously adjusted based on the emotion recognized by the user's expression and intonation, etc. in real time.

At step S405, it is determined whether the given answer is correct. If it is determined based on the negative feedback or the like that the question answered in the interaction process is incorrect, the step proceeds to S406. At step S406, the self-learning module initiates training, corrects the wrong answer and updates the model.

If the answered question is determined to be correct based on positive feedback or other determination means during the interaction process, corresponding question-answer information may be stored in step S407. Subsequently, based on the stored information, analysis of the user representation and the user interest points may be performed by an information recommendation system or the like.

By the method according to the embodiment of the disclosure, the voice interaction device or the intelligent voice assistant can have the expression recognition capability, and the conversation content can be adjusted according to the expression of the user; the camera has visual ability, can detect and identify objects shot by the camera, and completes question answering with a user; self-learning can be carried out, and the accuracy of identification is continuously improved; and the system can have a recommendation function, and can know the interest points of the user through continuous communication with the client, so as to perform more accurate and intelligent communication and recommendation.

A voice interaction apparatus 500 according to an embodiment of the present disclosure is described below with reference to fig. 5. The apparatus 500 may include an image acquisition module 501, an image analysis module 502, and a voice interaction module 503. The image acquisition module 501 may be configured to acquire images. The image analysis module 502 may be configured to perform image analysis on the acquired images to dynamically maintain environmental information. The voice interaction module 503 may be configured to output a first voice output based at least in part on the first voice input and the environmental information in response to receiving the first voice input.

A functional segmentation diagram of a voice interaction device 600 according to another embodiment of the present disclosure is described with reference to fig. 6. The apparatus 600 may include an image acquisition module 601, an image analysis module 602, a voice interaction module 603, an information repository 604, a question-and-answer model 605, and the like.

The module 601-603 may be similar to the module 501-503 described above, and thus the description of the same features is omitted. The image acquisition module 601 may include or may be communicatively connected to one or more cameras. The image acquisition module 601 may also contain basic image processing functions, such as image enhancement of images taken in low light conditions. The image analysis module 602 may analyze, identify and extract information about a scene or an object in the image acquired by the image acquisition module 601. The voice interaction module 603 may call the environment information generated by the image analysis module 602 for human-computer interaction, such as receiving a conversation and initiating a conversation. According to some embodiments, the first voice input received by the voice interaction module 603 comprises a query for attributes of objects in the surrounding environment, and the first voice output or control output by the voice interaction module 603 comprises attribute values of the objects, the attribute values being from the environmental information.

According to some embodiments, the image analysis comprises at least one of the following to obtain semantic information: target detection, instance segmentation, object ranging, character recognition and image classification. Alternatively, the image analysis module 602 may be composed of a plurality of sub-functional modules. The plurality of sub-functional modules may include one or more of the following sub-modules: a target detection sub-module 621, an instance segmentation sub-module 622, an object ranging sub-module 623, a word recognition sub-module 624, an image classification sub-module 625, and the like. The image analysis module 602 may also include an information integration sub-module 626. The object detection sub-module 621 may be configured to perform the object detection step. The instance segmentation sub-module 622 may be configured to perform instance segmentation on the image using an instance segmentation algorithm. The object ranging sub-module 623 may call a plurality of cameras to perform object ranging on an object in a space. The text recognition sub-module 624 may perform text recognition on the detected object through text recognition (OCR). The image classification sub-module 625 may perform image classification on the object using an image classification model, and the image classification sub-module 625 may also perform computation and analysis in conjunction with the content identified by the OCR character recognition model. The information integrating sub-module 626 may be employed as an aggregating module that aggregates results obtained by models (e.g., one or more of) of target detection, image segmentation, object ranging, text recognition, image classification, and the like. After aggregation, the integrated data may be formed, such as in the form of triples as described above. It is to be understood that "sub-modules" herein are for descriptive convenience and are merely functional examples, and that more or fewer functional partitioning modules may be present. For example, depending on the actual scene needs, only one or a few of these sub-modules may be used to implement the required image analysis functionality. Those skilled in the art will appreciate that additional image analysis algorithms may also be used to implement image analysis, and the present disclosure is not limited thereto.

Optionally, the apparatus 600 may include an information base 604, and the voice interaction module may read the real-time maintained and updated environment information from the information base 604.

Optionally, the apparatus 600 may include a question-answering module 605. The question-answering module 605 may include, for example, the question-answering model described above, and may include, for example, a named entity identification sub-module and an entity attribute mapping sub-module. The voice interaction module 603 may invoke the question-answering module 605 to complete the voice interaction process. It is to be understood that the question and answer module 605 may also be part of the voice interaction module 603 and need not be separate.

Optionally, apparatus 600 may include emotion recognition module 606. The emotion recognition module 606 may be functionally composed of at least one of an emotion recognition submodule and a speech emotion recognition submodule. The expression recognition submodule may be configured to perform expression recognition functions such as, but not limited to, those described above, and the speech emotion recognition submodule may be configured to perform speech emotion recognition functions such as, but not limited to, those described above. According to some embodiments, the apparatus 600 may further comprise: means for outputting a second speech output in response to determining that the user exhibits an interest above a threshold after outputting the first speech output. According to some embodiments, the means for outputting the second speech output in response to determining that the user exhibits an interest above a threshold is configured to: including at least one of detecting an expression of interest of the user or detecting a tone of interest of the user, determining that the user exhibits interest above a threshold. According to some embodiments, the second speech output includes keywords associated with the first speech input or the first speech output. According to some embodiments, the second speech output comprises knowledge about the keyword or comprises knowledge about a query whether it is desired to know about the keyword.

Optionally, the apparatus 600 may include a self-learning module 607 configured to perform self-learning. The self-learning module 607 may determine whether the feedback result is correct, for example, according to the expression and the answer of the user, and may perform feedback of the model and correction of the information base. According to some embodiments, the apparatus 600 further includes means for determining correctness of an answer included in the first voice output after the first voice output is output, and means for updating at least one of the environment information or the model parameters of the image analysis in response to the answer being incorrect. According to some embodiments, the unit for determining the correctness of the answer is configured to determine the correctness of the answer based on the expression of the user. According to some embodiments, the means for determining the correctness of the answer comprises: a unit that outputs a fourth voice output including an inquiry as to whether the answer is correct; and a unit receiving a second voice input indicating whether the answer is correct. According to some embodiments, the apparatus 600 further comprises: a unit that outputs a fifth speech output in response to the answer being incorrect, the fifth speech output including a query for a correct answer; a unit that receives a third voice input indicating a correct answer to the first voice input; and means for updating at least one of the environment information or model parameters of the image analysis using the obtained correct answer.

According to some embodiments, the apparatus 600 further comprises means for outputting a third speech output in response to detecting the image of the user before receiving the first speech input, the third speech output for guiding a start of a conversation with the user.

It is to be understood that the division of the modules herein is for descriptive convenience and that the modules are merely functional examples, that the functional modules may be combined with or included in each other, and that one or more of them may be omitted. Those skilled in the art will appreciate that such physical modules are not necessary in order to implement the methodologies of embodiments of the present disclosure. Each of these modules may be constituted by a plurality of modules distributed, or two or more of these modules may be combined, or the like. It is to be readily understood that there is no requirement for any physical module to actually be present in order to implement the methods of embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device is also provided. The electronic device may include a camera, a speaker, a processor, and a memory storing a program comprising instructions that, when executed by the processor, cause the electronic device to perform a voice interaction method according to an embodiment of the present disclosure. The camera may capture image or video information. The speaker may output an acoustic signal, such as a speech statement or the like. According to some embodiments, an electronic device according to the present disclosure may be a sound box. Therefore, the intelligent loudspeaker box has visual capacity and can interact with a user based on the visual capacity.

There is also provided, in accordance with an embodiment of the present disclosure, a computer-readable storage medium storing a program, the program comprising instructions that, when executed by a processor of an electronic device, instruct the electronic device to perform a voice interaction method in accordance with an embodiment of the present disclosure.

There is also provided, in accordance with an embodiment of the present disclosure, a computer program product comprising computer instructions which, when executed by a processor, implement a voice interaction method in accordance with an embodiment of the present disclosure.

Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of the method 200 described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the methods, systems, and apparatus described above are merely exemplary embodiments or examples and that the scope of the present disclosure is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A method of voice interaction, comprising:

acquiring a first voice input; and

controlling a terminal to output a first voice output based at least in part on the first voice input and environmental information,

wherein the environmental information is dynamically maintained by image analysis of images captured at the terminal.

2. The method of claim 1, wherein the first speech input comprises a query for attributes of an object in a surrounding environment, and wherein the first speech output comprises attribute values of the object, the attribute values being from the environmental information.

3. The method of claim 1 or 2, further comprising:

after outputting the first voice output, in response to determining that the user exhibits an interest above a threshold, controlling the terminal to output a second voice output.

4. The method of claim 3, wherein determining that a user exhibits an interest above a threshold comprises at least one of detecting an expression of interest of the user or detecting a intonation of interest of the user.

5. The method of claim 3, wherein the second speech output includes a keyword associated with the first speech input or the first speech output.

6. The method of claim 5, wherein the second speech output comprises knowledge related to the keyword or a query whether knowledge related to the keyword is desired.

7. The method of claim 1 or 2, further comprising:

controlling the terminal to output a third voice output for guiding a start of a conversation with the user in response to detecting the image of the user before receiving the first voice input.

8. The method of claim 2, further comprising:

after outputting the first speech output, determining correctness of the answer included in the first speech output;

in response to the answer being incorrect, updating at least one of the environmental information or model parameters of the image analysis.

9. The method of claim 8, wherein determining the correctness of the answer comprises determining the correctness of the answer based on an expression of a user.

10. The method of claim 8, wherein determining the correctness of the answer comprises:

outputting a fourth voice output comprising a query whether the answer is correct; and is

Receiving a second voice input indicating whether the answer is correct.

11. The method of claim 8, further comprising:

in response to the answer being incorrect, controlling the terminal to output a fifth voice output, the fifth voice output comprising a query for a correct answer;

receiving a third voice input indicating a correct answer to the first voice input;

updating at least one of the environmental information or the model parameters of the image analysis using the obtained correct answer.

12. The method according to claim 1 or 2, wherein the image analysis comprises at least one of the following to obtain semantic information: target detection, instance segmentation, object ranging, character recognition and image classification.

13. An apparatus for voice interaction, comprising:

the image acquisition module is used for acquiring images;

an image analysis module for performing image analysis on the acquired image to dynamically maintain environmental information; and

a voice interaction module to output a first voice output based at least in part on the environmental information in response to receiving a first voice input.

14. The apparatus of claim 13, wherein the first voice input comprises a query for attributes of an object in a surrounding environment, and wherein the first voice output comprises attribute values of the object, the attribute values being from the environmental information.

15. The apparatus of claim 13 or 14, further comprising:

means for outputting a second speech output in response to determining that the user exhibits an interest above a threshold after outputting the first speech output.

16. The apparatus of claim 15, wherein, in response to determining that the user exhibits interest above a threshold, the means for outputting the second speech output is configured to: determining that the user exhibits an interest above a threshold in response to at least one of detecting an expression of interest of the user or detecting a tone of interest of the user.

17. An electronic device comprising a camera, a speaker, a processor, and a memory storing a program, the program comprising instructions that, when executed by the processor, cause the electronic device to perform the method of any of claims 1-12.

18. The electronic device of claim 17, wherein the electronic device is a sound box.

19. A computer readable storage medium storing a program, the program comprising instructions that, when executed by a processor of an electronic device, instruct the electronic device to perform the method of any of claims 1-12.

20. A computer program product comprising computer instructions which, when executed by a processor, implement the method according to any one of claims 1-12.