CN112183105A

CN112183105A - Man-machine interaction method and device

Info

Publication number: CN112183105A
Application number: CN202010886462.3A
Authority: CN
Inventors: 王仁宇; 杨宇庭; 钱莉; 黄雪妍
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2021-01-05
Also published as: WO2022042664A1

Abstract

The application discloses a human-computer interaction method and device, and relates to the field of artificial intelligence. The method comprises the steps of receiving a target command sent by a user; generating a target decision of a target command by using the historical command, the historical decision of the historical command and the target command, wherein the historical command is a command of a historical human-computer interaction task, and the target command is a command of a current human-computer interaction task; and outputting the target decision.

Description

Man-machine interaction method and device

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and in particular, to a human-computer interaction method and apparatus.

Background

With the development of Artificial Intelligence (AI), electronic devices can communicate with users by using Human-Computer Interaction technologies (HCI), so that the electronic devices understand the intentions of the users and complete the intended work of the users. At present, human-computer interaction has wide application in a plurality of fields, for example, smart home, automatic driving and the like. However, electronic devices do not interact very "naturally" and "smartly" with users. The intention of the electronic device to understand the received voice command in natural language is not very close to the real intention of the user. In addition, the decision making mechanism of the electronic device is rigid and does not give the user the optimal decision. The user experience of human-computer interaction is low.

Disclosure of Invention

The application provides a human-computer interaction method and a human-computer interaction device, when natural language understanding is carried out on a target command sent by a user in a current human-computer interaction task, the semantics of the historical command in the historical human-computer interaction task are referred, the natural language understanding of the target command is assisted, and the result of the natural language understanding is more appropriate to the real intention of the user; the historical decision is referred to when the system decision is executed, the target decision can be optimized according to the historical decision, and the user experience degree of man-machine interaction is effectively improved.

In order to achieve the purpose, the technical scheme is as follows:

in a first aspect, the present application provides a human-computer interaction method, where the method is applicable to an electronic device, or the method is applicable to a human-computer interaction apparatus that can support the electronic device to implement the method, for example, the human-computer interaction apparatus includes a chip system, and the method includes: and after receiving a target command sent by a user, the electronic equipment generates a target decision of the target command by using the historical command, the historical decision of the historical command and the target command and outputs the target decision. The historical command is a command of a historical human-computer interaction task, and the target command is a command of a current human-computer interaction task. The historical commands may be commands for one or more historical users to perform multiple rounds of human-computer interaction tasks with the electronic device. For example, the historical commands may be commands for multiple historical users to perform multiple rounds of human-machine interaction tasks with the electronic device. Historical users may also include users who issued targeted commands. Therefore, when natural language understanding is carried out on a target command sent by a user in the current human-computer interaction task, the semantics of the historical command in the historical human-computer interaction task are referred to, the natural language understanding of the target command is assisted, and the natural language understanding result is more appropriate to the real intention of the user; the historical decision is referred to when the system decision is executed, the target decision can be optimized according to the historical decision, and the user experience degree of man-machine interaction is effectively improved.

In one possible implementation, generating a target decision for a target command using a historical command, historical decisions for the historical command, and the target command includes: the electronic equipment carries out weighted coding on the historical command based on the command semantic related weight to obtain historical command coding information, wherein the command semantic related weight represents the semantic related degree of the target command and the historical command; and generating a target decision according to the target command, the historical command coding information and the historical decision of the historical command.

Before weighted coding is carried out on the historical command based on the command semantic related weight, semantic coding is carried out on the target command by the electronic equipment to obtain a semantic vector of the target command; and performing similarity calculation according to the semantic vector of the target command and the semantic vector of the historical command to obtain command semantic correlation weight.

In another possible implementation manner, performing weighted coding on the historical command based on the command semantic related weight to obtain historical command coding information, including: and carrying out weighted coding on the historical command based on the command semantic related weight and the user weight to obtain historical command coding information, wherein the user weight represents the correlation degree between the user and the historical user who sends the historical command. The electronic equipment can acquire the association degree between the user and the historical user according to the voiceprint of the user, and obtain the user weight.

In another possible implementation manner, performing weighted coding on the historical command based on the command semantic correlation weight and the user weight to obtain historical command coding information, including: and carrying out weighted coding on the historical command based on the command semantic correlation weight, the user weight and the user relationship correlation weight to obtain historical command coding information, wherein the user relationship correlation weight is the preset relationship strength value of a plurality of users.

In another possible implementation, generating a target decision according to the target command, the historical command encoding information, and the historical decision of the historical command includes: performing natural language understanding on word vectors and historical command coding information of the target command by using an intention understanding model to obtain the intention and the slot position of the target command; and generating a target decision according to the intention and the slot position of the target command and the historical decision coding vector of the historical command.

Specifically, generating the target decision according to the intention and slot position of the target command and the historical decision coding vector of the historical command includes: coding the intention and the slot position of the target command to obtain a decision coding vector; carrying out weighted coding on the historical decision coding vector based on the historical decision coding vector weight to obtain historical decision coding information, wherein the historical decision coding vector weight represents the correlation degree of the decision coding vector and the historical decision coding vector; and analyzing the decision coding vector and the historical decision coding information by using a decision model to generate a target decision. The historical decision of the historical command is referred to when the system decision is executed, the decision content can be optimized according to the historical decision of the historical command, the decision information of the electronic equipment is enriched, and the user experience of man-machine interaction is effectively improved.

Before the historical decision coding vector is subjected to weighted coding based on the historical decision coding vector weight to obtain the historical decision coding information, the electronic equipment performs similarity calculation on the decision coding vector and the historical decision coding vector to obtain the historical decision coding vector weight.

In another possible implementation manner, performing weighted coding on the historical decision coding vector based on the historical decision coding vector weight to obtain the historical decision coding information includes: carrying out weighted coding on the historical decision coding vector based on the historical decision coding vector weight and the user weight to obtain historical decision coding information; or carrying out weighted coding on the historical decision coding vector based on the historical decision coding vector weight, the user weight and the user relation relevancy weight to obtain historical decision coding information.

In another possible implementation manner, the electronic device encodes the intention and the slot position of the target command to obtain a decision coding vector, including: and the electronic equipment encodes the intention and the slot position of the target command and the occupation state of the electronic equipment to obtain a decision coding vector. And the semantic vector of the historical command is enhanced by using the occupation state of the electronic equipment, so that the accuracy of system decision is further improved.

In a second aspect, the present application provides a human-computer interaction device, which is applied to an electronic device; the electronic equipment comprises a voice transceiver, and the voice transceiver is used for receiving a target command sent by a user and feeding back decision-making voice to the user. The man-machine interaction device comprises: the device comprises an acquisition unit, a processing unit and a feedback unit. The acquisition unit is used for receiving a target command sent by a user; the processing unit is used for generating a target decision of a target command by utilizing the historical command, the historical decision of the historical command and the target command, wherein the historical command is a command of a historical human-computer interaction task, and the target command is a command of a current human-computer interaction task; and the feedback unit is used for outputting the target decision. Therefore, when natural language understanding is carried out on a target command sent by a user in the current human-computer interaction task, the semantics of the historical command in the historical human-computer interaction task are referred to, the natural language understanding of the target command is assisted, and the natural language understanding result is more appropriate to the real intention of the user; the historical decision is referred to when the system decision is executed, the target decision can be optimized according to the historical decision, and the user experience degree of man-machine interaction is effectively improved. The units may perform corresponding functions in the method example of the first aspect, for specific reference, detailed description of the method example is given, and details are not repeated here.

In a third aspect, the present application provides an electronic device comprising: the system comprises at least one processor, a memory and a voice transceiver, wherein the voice transceiver is used for receiving a target command sent by a user and feeding back decision-making voice to the user, the memory is used for storing computer programs and instructions, and the processor is used for calling the computer programs and instructions and assisting the voice transceiver in executing the human-computer interaction method according to the first aspect or any one of the possible implementation manners of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium comprising: computer software instructions; the computer software instructions, when executed in the electronic device, cause the electronic device to perform implementations as the first aspect or as possible by the first aspect.

In a fifth aspect, the present application provides a computer program product for causing a computer to perform the implementations as possible for the first aspect or the first aspect when the computer program product runs on the computer.

In a sixth aspect, the present application provides a chip system, which is applied to an electronic device; the chip system comprises an interface circuit and a processor; the interface circuit and the processor are interconnected through a line; the interface circuit is used for receiving signals from a memory of the electronic equipment and sending the signals to the processor, and the signals comprise computer instructions stored in the memory; when the processor executes the computer instructions, the system-on-chip performs the implementation as possible for the first aspect or the first aspect.

It should be appreciated that the description of technical features, solutions, benefits, or similar language in this application does not imply that all of the features and advantages may be realized in any single embodiment. Rather, it is to be understood that the description of a feature or advantage is intended to include the specific features, aspects or advantages in at least one embodiment. Therefore, the descriptions of technical features, technical solutions or advantages in the present specification do not necessarily refer to the same embodiment. Furthermore, the technical features, technical solutions and advantages described in the present embodiments may also be combined in any suitable manner. One skilled in the relevant art will recognize that an embodiment may be practiced without one or more of the specific features, aspects, or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments.

Drawings

Fig. 1 is a schematic composition diagram of an electronic device according to an embodiment of the present application;

fig. 2 is a flowchart of a human-computer interaction method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a speech recognition process provided in an embodiment of the present application;

FIG. 4 is a flowchart of a human-computer interaction method according to an embodiment of the present application;

FIG. 5 is a flowchart of a human-computer interaction method according to an embodiment of the present application;

FIG. 6 is a flowchart of a human-computer interaction method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an intent understanding model provided in accordance with an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating an exemplary embodiment of a human-computer interaction device;

fig. 9 is a schematic composition diagram of a human-computer interaction device according to an embodiment of the present application.

Detailed Description

The terms "first," "second," and "third," etc. in the description and claims of this application and the above-described drawings are used for distinguishing between different objects and not for limiting a particular order.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The plural "means two or more, and other terms are similar thereto. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. Furthermore, for elements (elements) that appear in the singular form "a," an, "and" the, "they are not intended to mean" one or only one "unless the context clearly dictates otherwise, but rather" one or more than one. For example, "a device" means for one or more such devices. Still further, at least one (at least one of).

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The electronic device in this embodiment is a device including a display screen and a camera. The embodiment of the present application does not particularly limit the specific form of the electronic device. For example, the electronic device may be a television, a tablet, a projector, a mobile phone, a desktop, a laptop, a handheld computer, a notebook, an ultra-mobile personal computer (UMPC), a netbook, and an internet of things (IoT) device such as a Personal Digital Assistant (PDA), an Augmented Reality (AR), a Virtual Reality (VR) device, a smart speaker, a smart television, and the like.

Please refer to fig. 1, which is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 1, the electronic apparatus includes: processor 110, external memory interface 120, internal memory 121, Universal Serial Bus (USB) interface 130, power management module 140, antenna, wireless communication module 160, audio module 170, speaker 170A, speaker interface 170B, microphone 170C, sensor module 180, buttons 190, indicator 191, display screen 192, and camera 193, among others. The sensor module 180 may include a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, and the like.

It is to be understood that the illustrated structure of the present embodiment does not constitute a specific limitation to the electronic device. In other embodiments, an electronic device may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be a neural center and a command center of the electronic device. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, and/or a USB interface, etc.

In this embodiment, the processor 110 is configured to perform natural language understanding on the target command in combination with the historical command when performing natural language understanding on the target command, and obtain the intention and the slot of the target command. The historical command is a command of a historical human-computer interaction task, and the target command is a command of a current human-computer interaction task. Alternatively, the historical command may be a command of a plurality of historical human-machine interaction tasks. The plurality of historical human-computer interaction tasks may be tasks for which a plurality of historical users have performed multiple rounds of conversations with the electronic device. In turn, the processor 110 determines a target decision in conjunction with the decisions of the historical commands, as well as the intent and slot of the target command.

The intent and the slot together form a "user action", and the machine cannot directly understand the natural language, so the user action serves to map the natural language into a structured semantic representation which can be understood by the machine. The groove has the capacity of multi-round memory state. The slot includes a slot position, e.g., a taxi-taking scene, including a departure location slot and a destination slot.

The power management module 140 is used for connecting a power source. The power management module 140 may also be connected to the processor 110, the internal memory 121, the display 192, the camera 193, the wireless communication module 160, and the like. The power management module 140 receives input of power to supply power to the processor 110, the internal memory 121, the display 192, the camera 193, the wireless communication module 160, and the like. In some embodiments, the power management module 140 may also be disposed in the processor 110.

The wireless communication function of the electronic device may be implemented by the antenna and the wireless communication module 160, etc. The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), Global Navigation Satellite Systems (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like.

The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via an antenna, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. Wireless communication module 160 may also receive signals to be transmitted from processor 110, frequency modulate them, amplify them, and convert them into electromagnetic waves via an antenna for radiation. In some embodiments, the antenna of the electronic device is coupled with the wireless communication module 160 so that the electronic device can communicate with the network and other devices through wireless communication techniques.

The electronic device implements display functions via the GPU, the display screen 192, and the application processor, etc. The GPU is a microprocessor for image processing, coupled to a display screen 192 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 192 is used to display images, video, and the like. The display screen 192 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like.

The electronic device may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 192, the application processor, and the like. The ISP is used to process the data fed back by the camera 193. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device may include 1 or N cameras 193, N being a positive integer greater than 1.

Alternatively, the electronic device may not include a camera, i.e., the camera 193 is not disposed in the electronic device (e.g., a television). The electronic device may be externally connected to the camera 193 through an interface (e.g., the USB interface 130). The external camera 193 may be fixed to the electronic device by an external fixing member (e.g., a camera holder with a clip). For example, the external camera 193 may be fixed to an edge, such as an upper edge, of the display screen 192 of the electronic device by an external fixing member.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device selects a frequency point, the digital signal processor is used for performing fourier transform and the like on the frequency point energy. Video codecs are used to compress or decompress digital video. The electronic device may support one or more video codecs. In this way, the electronic device can play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area can store data (such as audio data and the like) created in the using process of the electronic device, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device may implement audio functions via the audio module 170, the speaker 170A, the microphone 170C, the speaker interface 170B, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110. The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. In the present application, speaker 170A is used to output decision-making speech. The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. In the present application, the microphone 170C is used to receive the voice of a target command or the voice of a history command uttered by a user.

The speaker interface 170B is used to connect to a wired speaker. The speaker interface 170B may be the USB interface 130, or may be an open mobile electronic device platform (OMTP) standard interface of 3.5mm, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic device may receive a key input, and generate a key signal input related to user settings and function control of the electronic device.

The indicator 191 may be an indicator light, and may be used to indicate that the electronic device is in a power-on state, a standby state, a power-off state, or the like. For example, the indicator light is turned off, which may indicate that the electronic device is in a power-off state; the indicator light is green or blue, and can indicate that the electronic equipment is in a starting state; the indicator light is red and can indicate that the electronic equipment is in a standby state.

It is to be understood that the illustrated structure of the embodiments of the present application does not constitute a specific limitation to electronic devices. It may have more or fewer components than shown in fig. 1, may combine two or more components, or may have a different configuration of components. For example, the electronic device may further include a sound box or the like. The various components shown in fig. 1 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing or application specific integrated circuits.

Next, a man-machine interaction method provided in an embodiment of the present application is described in detail with reference to fig. 2.

S201, the electronic equipment receives a target command sent by a user.

The target command is a natural language text recognizable by the user. In some embodiments, a user may enter a target command into an electronic device through an input device (e.g., a virtual keyboard or a physical keyboard). In other embodiments, the user may speak speech to the electronic device. The electronic equipment performs voice recognition on the voice and converts the voice into a target command. Speech refers to the speech sound of a user who is in speech communication with an electronic device.

Alternatively, if the user is in a noisy environment, the electronic device may receive a mixed voice that includes both the voice and the noise of the external environment. The electronic device may utilize the user's voiceprint characteristics to separate speech from the mixed speech. For example, as shown in fig. 3, a schematic diagram of speech separation recognition provided in an embodiment of the present application is shown. The mixed voice is analyzed through short-time Fourier transform (STFT) to obtain a mixed voice spectrum, the mixed voice spectrum and user voiceprint characteristics registered in the system in advance are input into a pre-trained voice separation model, the voice spectrum of a user is separated from the mixed voice spectrum, the separated spectrum is subjected to voice recognition on the voice spectrum of the target voice by utilizing an automatic voice recognition technology, and a target command is obtained. The voice separation model is obtained by training multi-user voice data collected in advance. The speech separation model may be a multi-layer long-term memory (LSTM) model.

S202, the electronic equipment generates a target decision of the target command by using the historical command, the historical decision of the historical command and the target command.

Fig. 4 is a schematic flowchart of another human-computer interaction method provided in this embodiment, where the method flow described in fig. 4 is an explanation of a specific operation process included in S202 in fig. 2, as shown in the figure. S2021, the electronic device carries out weighted coding on the historical command based on the command semantic related weight to obtain historical command coding information. S2022, the electronic equipment generates a target decision according to the target command, the historical command coding information and the historical decision of the historical command.

The command semantic relevance weight represents the degree of semantic relevance of the target command to the historical commands. Related means meanings associated with each other. The historical commands related to the target command may be historical commands that have some relationship to the intent of the target command. For example, the target command is "temperature is somewhat cold", the history command is "good heat", and the air conditioner is turned on to 20 degrees ". Both the target command and the historical command are related to adjusting the temperature of the air conditioner. However, the target command is not explicitly indicated to be the temperature adjustment of the air conditioner, and the historical command indicates to adjust the air conditioner to a specific temperature, so that in the embodiment, when the natural language understanding is performed on the target command sent by the user in the current human-computer interaction task, the semantic meaning of the historical command in the historical human-computer interaction task is referred to, the natural language understanding on the target command is assisted, and the result of the natural language understanding is more appropriate to the real intention of the user.

In a possible implementation manner, the electronic device performs semantic coding on the target command to obtain a semantic vector of the target command, and performs similarity calculation according to the semantic vector of the target command and the semantic vector of the historical command to obtain command semantic correlation weight.

Specifically, the electronic device first performs a Chinese word segmentation (Chinese word segmentation) process on the target command to obtain a word vector of the target command. Chinese word segmentation refers to the segmentation of a continuous word sequence into a single word.

And the electronic equipment inputs the word vector of the target command into the semantic coding model for coding to obtain the semantic vector of the target command. The semantic code model may be a Recurrent Neural Network (RNN), and the most frequently used RNN model is a bidirectional long-term memory (BiLSTM). BilSTM may be implemented using a network comprising 3 hidden layers of 600 nodes each. For example, the target command is "temperature is somewhat cold", and Chinese segmentation is performed on the target command to obtain a "temperature" word vector, a "point" word vector and a "cold" word vector. Inputting the 'temperature' word vector, the 'point' word vector and the 'cold' word vector into a semantic coding model, and obtaining the semantic vector of the target command 'temperature is somewhat cold' through the reasoning of the semantic coding model.

It should be noted that the electronic device stores the semantic vector of the target command, so that the semantic vector of the target command is used as the semantic vector of the historical command to assist the electronic device in performing natural language understanding on the subsequent command.

The semantic vector of the historical commands may be a matrix of M columns, each column representing the semantic vector of commands of one historical human-machine interaction task. And multiplying each column in the matrix by the semantic vector of the target command to obtain command semantic related weight. The command semantic correlation weight satisfies formula (1).

p_m＝u^Th_m (1)

Wherein u is^T＝[x₁,…,x_j]，u^TA semantic vector representing the target command,

h_msemantic vector, p, representing historical commands_mRepresenting command semantic correlation weights.

Optionally, the electronic device may further perform weighted encoding on the historical command based on the command semantic related weight and the user weight to obtain historical command encoding information. The user weight represents the degree of association between the user and the historical user who issued the historical command. In some embodiments, before the user interacts with the electronic device, the electronic device may prompt the user to provide a user's voiceprint, which the electronic device stores. The electronic equipment compares the voiceprint of the user with the voiceprint of the historical user to obtain the association degree between the user and the historical user, namely the user weight. The historical user refers to a user who has performed man-machine interaction with the electronic equipment. For example, the electronic device obtains the similarity between the user and the historical user according to the voiceprint of the user, and obtains the user weight. The degree of similarity may be a likelihood of the user and the historical user. Understandably, the user has a higher weight, which indicates that the user is a historical user with a higher possibility, and the user is set with a higher weight; the user weight is small, indicating that the user is less likely to be the history user, and the set weight is low.

Optionally, the electronic device may further perform weighted encoding on the historical command based on the command semantic correlation weight, the user weight, and the user relationship correlation weight, so as to obtain historical command encoding information. The user relationship relevancy weight is a preset relationship strength value of a plurality of users. For example, the electronic device is a smart home, the user using the smart home is usually a fixed family member, and any one of the family members may set a strength of relationship value with other members. The strength of relationship value may include high, medium, low, no correlation, and the like.

Specifically, the electronic device performs weighted coding on the historical command based on the weighted information to obtain a semantic vector of the historical command after weighted coding, combines the semantic vector of the historical command after weighted coding and the semantic vector of the target command, and performs coding through a full-connection network to obtain the historical command coding information. The weighting information includes at least one of a command semantic relevance weight, a user weight, and a user relationship relevance weight. The semantic vector of the weighted encoded historical command satisfies formula (2).

h′＝∑_mp_mh_mS_m (2)

Where h' represents the semantic vector of the weighted encoded historical command, p_mRepresenting command semantic related weight, h_mSemantic vector representing historical commands, S_mRepresenting the user weight. Or, S_mRepresenting a user relationship relevance weight. Or, S_mRepresenting user weight and user relationship relevancy weight.

Further, fig. 5 is a schematic flowchart of another human-computer interaction method provided in this embodiment, where the method flow described in fig. 5 is an explanation of a specific operation process included in S2022 in fig. 4, as shown in the figure. S20221, the electronic device utilizes the intention understanding model to conduct natural language understanding on the word vector and the historical command coding information of the target command, and the intention and the slot position of the target command are obtained. S20222, the electronic device generates a target decision according to the intention and the slot position of the target command and the historical decision coding vector of the historical command.

Specifically, as shown in fig. 6 (a), the electronic device performs chinese word segmentation on the target command to obtain a word vector of the target command (execute S601). For example, the target command is "temperature is somewhat cold," and the word vectors of the target command include a "temperature" word vector, a "point" word vector, and a "cold" word vector. The electronic device encodes the word vector of the target command by using the semantic coding model to obtain a semantic vector of the target command (execute S602). The electronic device performs similarity calculation according to the semantic vector of the target command and the semantic vector of the historical command to obtain command semantic related weight (execute S603). Suppose the historical command is "good heat o, the air conditioner is turned on to 20 degrees". The command semantic related weight comprises semantic related degrees of target commands of 'temperature is somewhat cold' and historical commands of 'good heat and air conditioner is started to 20 degrees'. It is worth noting that the command semantic relevance weight includes a semantic relevance degree of the target command to commands of the plurality of historical human-machine interaction tasks. And performing weighted coding on the semantic vector of the historical command based on the first weighting information to obtain historical command coding information (executing S604). The first weighting information includes command semantics related weights. Optionally, the first weighting information further includes a user weight and a user relationship relevancy weight. The electronic device performs natural language understanding on the word vector and the historical command encoding information of the target command by using the intention understanding model to obtain the intention and the slot position of the target command (execute S605). For example, the target command is "temperature is somewhat cold," and the intent of the target command may be expressed in terms of adjusting the temperature. The slot positions of the target command may be "temperature", "somewhat" and "cold". It is intended that the model be RNN, and the following specific case is implemented using BERT model based on the TRANSFORMER structure.

It is intended that the model be trained by way of bi-directional encoder representations (BERTs) from transformers. BERT is a bidirectional transformer-based model proposed by google, which can be obtained by pre-training a large number of unsupervised text corpora, and the pre-training process includes two techniques, one is to randomly mask part of characters in a training sentence to predict masked characters, and the other is to train and understand the relationship between sentences and predict the task of the next sentence under the given current text condition. After training of the BERT model is completed, the intention understanding model comprises a pre-trained deep structure for semantic analysis, and then the BERT model is adjusted by secondary training (fine-tune) using a target-related intention understanding task. It should be noted that, in order to introduce the historical semantic related information obtained in the foregoing into the model, the first entry of the beginning of the BERT network uses a weighted semantic code vector, as shown in fig. 7, the word vector and the historical command code information of the target command are input into the intention understanding model to obtain the intention and slot position of the target command.

The decision model is a classification model which is input as intention, conversation state and system database information and output as a specific decision. As shown in fig. 6 (b), the electronic device encodes the intention of the target command, the dialog state, and the information obtained in the system database, resulting in a decision-making encoding vector (execution S606). The coding network may be implemented using a multi-layer Convolutional Neural Network (CNN). The electronic device performs similarity calculation on the decision-making coding vector and the historical decision-making coding vector to obtain a historical decision-making coding vector weight (execute S607). Historical decisions are system actions determined by the electronic device based on historical commands. For example, the historical command is "how today's weather is". The system acts to output 'the weather yin, temperature'. For another example, the historical command is "good heat, so the air conditioner is turned on to 20 degrees". The system acts to adjust the air conditioner temperature to 20 degrees. The historical decision coding vector weight includes the degree of correlation between the decision coding vector of the target command "temperature is somewhat cold" and the historical decision coding vector of the historical decision "air conditioner is turned on to 20 degrees". It is worth noting that the historical decision-making codevector weight includes a degree of correlation of the decision-making codevector of the target command with the historical decision-making codevectors of the decisions of the plurality of historical human-machine interaction tasks. The electronic device performs weighted encoding on the historical decision encoding vector based on the second weighting information to obtain historical decision encoding information (execution S608). The second weighting information includes historical decision coding vector weights. Optionally, the second weighting information further includes a user weight and a user relationship relevancy weight. The historical decision encoding vector weight represents the degree of correlation of the decision encoding vector with the historical decision encoding vector. The electronic device analyzes the decision coding vector and the historical decision coding information by using the decision model to generate a target decision (S609). For example, the target command is "temperature is somewhat cold" and the target decision may be "air conditioner on to 29 degrees". The decision model may be implemented using a shallow classifier, such as a support vector machine; deep Neural Networks (DNNs), such as a multi-layer fully-connected forward network (FNN), may also be used.

It should be noted that the electronic device stores the decision encoding vector, and assists the electronic device in making a decision on a target command issued by a subsequent user.

The historical decision-making encoded vector may be a matrix of M columns, each column representing a historical decision-making encoded vector of decisions for a historical human-machine interaction task. And multiplying each column in the matrix by the decision coding vector to obtain the historical decision coding vector weight. The historical decision encoding vector weights satisfy equation (3).

q_m＝w^Tk_m (3)

Wherein, w^T＝[y₁,…,y]，w^TA decision-making encoding vector is represented,

k_mrepresenting historical decision-making code vectors, q_mRepresenting historical decision coding vector weights.

And the electronic equipment performs weighted coding on the historical decision coding vector based on the second weighted information to obtain a weighted coded historical decision coding vector, merges the weighted coded historical decision coding vector and the weighted coded decision coding vector, and performs coding through a full-connection network to obtain historical decision coding information. The weighted coded historical decision coding vector satisfies formula (4).

k′＝∑_mq_mk_mS_m (4)

Wherein k' represents the weighted encoded historical decision encoded vector. k is a radical of_mRepresenting historical decision-making code vectors, q_mRepresenting historical decision coding vector weights. S_mRepresenting the user weight. Or, S_mRepresenting a user relationship relevance weight. Or, S_mRepresenting user weight and user relationship relevancy weight.

Optionally, the electronic device analyzes the decision coding vector, the historical decision coding information, and the user portrait of the user by using the decision model to determine the target decision. The user portrait is also called a user role, is a virtual representation of a real user, is an effective tool for sketching user and user appeal and design direction, and is widely applied to various fields.

And S203, the electronic equipment outputs a target decision.

In order for the electronic device to communicate with the user, the electronic device may map the goal decision to a Natural Language expression using Natural Language Generation (NLG) techniques, i.e., generate a goal decision text from the goal decision. Natural language generation refers to the conversion of machine-readable decisions into natural language text. The electronic equipment can display the target decision text through the display screen, so that a user can conveniently acquire the system dialogue sentences output by the electronic equipment. Optionally, the electronic device may further convert the target decision text into a target decision voice, and play the target decision voice to the user in a voice form.

Therefore, when natural language understanding is carried out on a target command sent by a user in the current human-computer interaction task, the semantics of the historical command in the historical human-computer interaction task are referred to, the natural language understanding of the target command is assisted, and the natural language understanding result is more appropriate to the real intention of the user; the historical decision is referred to when the system decision is executed, the target decision can be optimized according to the historical decision, and the user experience degree of man-machine interaction is effectively improved.

It is to be understood that, in order to implement the functions of the above-described embodiments, the electronic device includes a corresponding hardware structure and/or software module for performing each function. Those of skill in the art will readily appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software driven hardware depends on the particular application scenario and design constraints imposed on the solution.

Fig. 8 is a schematic structural diagram of a possible human-computer interaction device provided in an embodiment of the present application. The man-machine interaction devices can be used for realizing the functions of the electronic equipment in the method embodiment, so that the beneficial effects of the method embodiment can be realized. In the embodiment of the present application, the human-computer interaction device may be an electronic device as shown in fig. 1, and may also be a module (e.g., a chip) applied to the electronic device.

As shown in fig. 8, the human-computer interaction device 800 includes an acquisition unit 810, a processing unit 820, and a feedback unit 830. The human-computer interaction device 800 is used for implementing the functions of the electronic device in the method embodiments shown in fig. 2, fig. 4, fig. 5 or fig. 6.

When the human-computer interaction device 800 is used for realizing the functions of the electronic equipment in the method embodiment shown in fig. 2: the obtaining unit 810 is configured to execute S201; the processing unit 820 is configured to execute S202; the feedback unit 830 is configured to perform S203.

When the human-computer interaction device 800 is used for realizing the functions of the electronic equipment in the method embodiment shown in fig. 4: the obtaining unit 810 is configured to execute S201; processing unit 820 is configured to perform S2021 and S2022; the feedback unit 830 is configured to perform S203.

When the human-computer interaction device 800 is used for realizing the functions of the electronic equipment in the method embodiment shown in fig. 5: the obtaining unit 810 is configured to execute S201; the processing unit 820 is configured to execute S2021, S20221, and S20222; the feedback unit 830 is configured to perform S203.

When the human-computer interaction device 800 is used for realizing the functions of the electronic equipment in the method embodiment shown in fig. 6: the processing unit 820 is configured to execute S601 to S609.

More detailed descriptions about the obtaining unit 810, the processing unit 820 and the feedback unit 830 can be directly obtained by referring to the related descriptions in the method embodiments shown in fig. 2, fig. 4, fig. 5 or fig. 6, which are not repeated herein. The functions of the obtaining unit 810, the processing unit 820 and the feedback unit 830 may be implemented by the processor 110 in fig. 1.

Alternatively, as shown in fig. 9, the human-computer interaction apparatus 900 may include a voice recognition unit 910, a language understanding unit 920, a dialog management unit 930, a language generation unit 940, and a voice synthesis unit 950. The voice recognition unit 910 is configured to implement the function of the obtaining unit 810. Such as: the voice recognition unit 910 is used for recognizing the voice uttered by the user to obtain the target command. The language understanding unit 920 and the dialog management unit 930 are used to implement the functions of the processing unit 820 to obtain the goal decision. For example, the language understanding unit 920 is configured to perform natural language understanding on the word vector and the historical command encoding information of the target command by using the intention understanding model, and obtain the intention and the slot of the target command. The dialog management unit 930 is configured to generate a target decision according to the intent and slot of the target command and the historical decision encoding vector of the historical command. The language generation unit 940 and the speech synthesis unit 950 are used to implement the function of the feedback unit 830. Such as: the language generation unit 940 is used to convert the target decision into a natural language. The speech synthesis unit 950 is used to feed back the language of the decision to the user.

It is understood that the Processor in the embodiments of the present Application may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. The general purpose processor may be a microprocessor, but may be any conventional processor.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash Memory, Read-Only Memory (ROM), Programmable ROM (PROM), Erasable PROM (EPROM), Electrically EPROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a network device or an electronic device. Of course, the processor and the storage medium may reside as discrete components in a network device or an electronic device.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network appliance, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape; or optical media such as Digital Video Disks (DVDs); it may also be a semiconductor medium, such as a Solid State Drive (SSD).

In the embodiments of the present application, unless otherwise specified or conflicting with respect to logic, the terms and/or descriptions in different embodiments have consistency and may be mutually cited, and technical features in different embodiments may be combined to form a new embodiment according to their inherent logic relationship.

In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In the description of the text of the present application, the character "/" generally indicates that the former and latter associated objects are in an "or" relationship; in the formula of the present application, the character "/" indicates that the preceding and following related objects are in a relationship of "division".

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application. The sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic.

Claims

1. A human-computer interaction method, comprising:

receiving a target command sent by a user;

generating a target decision of the target command by using a historical command, a historical decision of the historical command and the target command, wherein the historical command is a command of a historical human-computer interaction task, and the target command is a command of a current human-computer interaction task;

and outputting the target decision.

2. The method of claim 1, wherein generating the target decision for the target command using a historical command, a historical decision for the historical command, and the target command comprises:

carrying out weighted coding on the historical command based on command semantic related weight to obtain historical command coding information, wherein the command semantic related weight represents the semantic related degree of the target command and the historical command;

and generating the target decision according to the target command, the historical command coding information and the historical decision of the historical command.

3. The method of claim 2, wherein prior to weighted encoding of the historical commands based on command semantic relevance weights, the method further comprises:

semantic coding is carried out on the target command to obtain a semantic vector of the target command;

and performing similarity calculation according to the semantic vector of the target command and the semantic vector of the historical command to obtain the command semantic correlation weight.

4. The method of claim 2 or 3, wherein performing weighted coding on the historical command based on the command semantic related weight to obtain historical command coding information comprises:

and carrying out weighted coding on the historical command based on the command semantic related weight and the user weight to obtain historical command coding information, wherein the user weight represents the association degree between the user and the historical user who sends the historical command.

5. The method of claim 4, wherein prior to the weighted encoding of the historical commands based on command semantics related weights and user weights, the method further comprises:

and acquiring the association degree of the user and the historical user according to the voiceprint of the user to obtain the user weight.

6. The method according to claim 4 or 5, wherein the step of performing weighted coding on the historical command based on the command semantic related weight and the user weight to obtain historical command coding information comprises the steps of:

and carrying out weighted coding on the historical command based on the command semantic correlation weight, the user weight and the user relationship correlation weight to obtain historical command coding information, wherein the user relationship correlation weight is the preset relationship strength value of a plurality of users.

7. The method of any of claims 2-6, wherein generating the target decision from the target command, the historical command encoding information, and historical decisions of the historical commands comprises:

performing natural language understanding on the word vector of the target command and the historical command coding information by using an intention understanding model to obtain the intention and the slot position of the target command;

and generating the target decision according to the intention and the slot position of the target command and the historical decision coding vector of the historical command.

8. The method of claim 7, wherein generating the target decision based on the intent and slot of the target command and a historical decision encoding vector of the historical command comprises:

coding the intention and the slot position of the target command to obtain a decision coding vector;

carrying out weighted coding on the historical decision coding vector based on the historical decision coding vector weight to obtain historical decision coding information, wherein the historical decision coding vector weight represents the correlation degree of the decision coding vector and the historical decision coding vector;

and analyzing the decision coding vector and the historical decision coding information by using a decision model to generate the target decision.

9. The method of claim 8, wherein before the historical decision encoding vector is weighted-encoded based on historical decision encoding vector weights to obtain historical decision encoding information, the method further comprises:

and performing similarity calculation on the decision coding vector and the historical decision coding vector to obtain the weight of the historical decision coding vector.

10. The method according to claim 8 or 9, wherein the weighted coding of the historical decision coding vector based on the historical decision coding vector weight to obtain the historical decision coding information comprises:

carrying out weighted coding on the historical decision coding vector based on the historical decision coding vector weight and the user weight to obtain historical decision coding information;

or carrying out weighted coding on the historical decision coding vector based on the historical decision coding vector weight, the user weight and the user relationship correlation degree weight to obtain historical decision coding information.

11. A human-computer interaction device, comprising:

the acquisition unit is used for receiving a target command sent by a user;

the processing unit is used for generating a target decision of the target command by utilizing a historical command, a historical decision of the historical command and the target command, wherein the historical command is a command of a historical human-computer interaction task, and the target command is a command of a current human-computer interaction task;

and the feedback unit is used for outputting the target decision.

12. An electronic device, comprising: at least one processor, a memory, and a voice transceiver, wherein the voice transceiver is configured to receive a voice of a target command or a voice of a feedback target decision, the memory is configured to store a computer program and instructions, and the processor is configured to invoke the computer program and instructions to assist the voice transceiver in performing the human-computer interaction method of any one of claims 1-10.

13. A computer-readable storage medium, in which a computer program or instructions is stored which, when executed by a human-computer interaction device, carries out the method according to any one of claims 1 to 10.