WO2022042664A1

WO2022042664A1 - Human-computer interaction method and device

Info

Publication number: WO2022042664A1
Application number: PCT/CN2021/114853
Authority: WO
Inventors: 王仁宇; 杨宇庭; 钱莉; 黄雪妍
Original assignee: 华为技术有限公司
Priority date: 2020-08-28
Filing date: 2021-08-26
Publication date: 2022-03-03
Also published as: CN112183105A

Abstract

The present application relates to the field of artificial intelligence, and provides a human-computer interaction method and device. The method comprises: receiving a target command sent by a user; generating a target decision of the target command by using a historical command, a historical decision of the historical command and the target command, the historical command being a command of a historical human-computer interaction task, and the target command being a command of a current human-computer interaction task; and outputting a target decision.

Description

Human-computer interaction method and device

This application claims the priority of the Chinese patent application with the application number 202010886462.3 and the application name "Human-Computer Interaction Method and Device", which was submitted to the State Intellectual Property Office on August 28, 2020, the entire contents of which are incorporated into this application by reference .

technical field

The present application relates to the field of artificial intelligence (Artificial Intelligence, AI), and in particular, to a human-computer interaction method and device.

Background technique

With the development of artificial intelligence (AI), electronic devices can use Human-Computer Interaction Techniques (HCI) to communicate with users, so that electronic devices can understand the user's intention and complete the work of the user's intention. . At present, human-computer interaction has a wide range of applications in many fields, such as smart home, automatic driving and so on. However, the interaction of electronic devices with users is not very "natural" and "smart". The intention obtained by the electronic device through natural language understanding of the received voice command is not very close to the real intention of the user. In addition, the decision-making of electronic devices is mechanically rigid and cannot give users optimal decisions. The user experience of human-computer interaction is low.

SUMMARY OF THE INVENTION

The present application provides a human-computer interaction method and device. When performing natural language understanding on a target command issued by a user in a current human-computer interaction task, the semantics of historical commands in historical human-computer interaction tasks are referred to, which assists in understanding the target command. Natural language understanding makes the results of natural language understanding more appropriate to the user's real intentions; historical decisions are referenced when executing system decisions, and target decisions can be optimized according to historical decisions, effectively improving the user experience of human-computer interaction.

To achieve the above object, the application adopts the following technical solutions:

In a first aspect, the present application provides a human-computer interaction method, and the method can be applied to an electronic device, or the method can be applied to a human-computer interaction device that can support an electronic device to implement the method, for example, the human-computer interaction device includes a chip The system and the method include: after the electronic device receives the target command issued by the user, using the historical command, the historical decision of the historical command and the target command to generate the target decision of the target command, and output the target decision. Among them, the historical command is the command of the historical human-computer interaction task, and the target command is the command of the current human-computer interaction task. The historical command may be one or more historical commands for the user to perform multiple rounds of human-computer interaction tasks with the electronic device. For example, the historical commands may be commands for multiple historical users to perform multiple rounds of human-computer interaction tasks with the electronic device. History users can also contain users who issued the target command. In this way, when the natural language understanding of the target command issued by the user in the current human-computer interaction task is carried out, the semantics of the historical command in the historical human-computer interaction task are referred to, which assists the natural language understanding of the target command, and makes the natural language understanding result. It is more appropriate to the real intention of the user; the historical decision-making is referred to when executing the system decision, and the target decision can be optimized according to the historical decision-making, which effectively improves the user experience of human-computer interaction.

In a possible implementation manner, generating the target decision of the target command by using the historical command, the historical decision of the historical command and the target command includes: the electronic device performs weighted coding on the historical command based on the relevant weight of the command semantics, and obtains the coding information of the historical command, The command semantic correlation weight indicates the degree of semantic correlation between the target command and the historical command; the target decision is generated according to the target command, the coding information of the historical command and the historical decision of the historical command.

Among them, before weighted coding of historical commands based on the command semantic correlation weight, the electronic device performs semantic coding on the target command to obtain the semantic vector of the target command; the command is obtained by calculating the similarity according to the semantic vector of the target command and the semantic vector of the historical command. Semantic relevance weights.

In another possible implementation manner, weighted encoding is performed on historical commands based on command semantic correlation weights to obtain historical command encoding information, including: weighted encoding historical commands based on command semantic correlation weights and user weights to obtain historical command encoding information , the user weight represents the degree of association between the user and the historical user who issued the historical command. The electronic device may obtain the user weight according to the user's voiceprint to obtain the degree of association between the user and the historical user.

In another possible implementation manner, weighted coding is performed on historical commands based on command semantic correlation weights and user weights to obtain historical command coding information, including: historical command coding based on command semantic correlation weights, user weights, and user relationship correlation weights Weighted coding is performed to obtain historical command coding information, and the user relationship relevance weight is a preset relationship strength value of multiple users.

In another possible implementation manner, generating the target decision according to the target command, the historical command coding information and the historical decision of the historical command, including: using an intention understanding model to perform natural language understanding on the word vector of the target command and the historical command coding information, Obtain the intent and slot of the target command; generate the target decision according to the intent and slot of the target command, as well as the historical decision encoding vector of the historical command.

Specifically, generating the target decision according to the intent and slot position of the target command and the historical decision coding vector of the historical command, including: encoding the intention and slot position of the target command to obtain a decision coding vector; based on the historical decision coding vector weights The historical decision coding vector is weighted and encoded to obtain the historical decision coding information. The weight of the historical decision coding vector represents the degree of correlation between the decision coding vector and the historical decision coding vector; the decision coding vector and the historical decision coding information are analyzed by the decision model to generate the target decision. . When executing the system decision, the historical decision of the historical command is referenced, and the decision content can be optimized according to the historical decision of the historical command, which enriches the information of the decision-making of the electronic device, and effectively improves the user experience of the human-computer interaction.

Wherein, before the historical decision coding vector is weighted and encoded based on the historical decision coding vector weight to obtain historical decision coding information, the electronic device performs similarity calculation on the decision coding vector and the historical decision coding vector to obtain the historical decision coding vector weight.

In another possible implementation, weighted encoding is performed on the historical decision encoding vector based on the historical decision encoding vector weight to obtain historical decision encoding information, including: weighted encoding the historical decision encoding vector based on the historical decision encoding vector weight and the user weight , to obtain the historical decision coding information; or, weighted coding the historical decision coding vector based on the historical decision coding vector weight, the user weight and the user relationship correlation weight to obtain the historical decision coding information.

In another possible implementation manner, the electronic device encodes the intent and slot position of the target command to obtain a decision encoding vector, including: the electronic device encodes the intent and slot position of the target command and the occupancy state of the electronic device, Get the decision coding vector. The semantic vector of historical commands is enhanced by the occupancy status of electronic devices, thereby further improving the accuracy of system decision-making.

In a second aspect, the present application provides a human-computer interaction device, the human-computer interaction device is applied to an electronic device; the electronic device includes a voice transceiver, and the voice transceiver is used to receive a target command issued by a user, and feedback the voice of decision-making to the user. The human-computer interaction device includes: an acquisition unit, a processing unit and a feedback unit. The acquisition unit is used to receive the target command issued by the user; the processing unit is used to generate the target decision of the target command by using the historical command, the historical decision of the historical command and the target command, the historical command is the command of the historical human-computer interaction task, and the target command is Command for the current human-computer interaction task; feedback unit for outputting target decisions. In this way, when the natural language understanding of the target command issued by the user in the current human-computer interaction task is carried out, the semantics of the historical command in the historical human-computer interaction task are referred to, which assists the natural language understanding of the target command, and makes the natural language understanding result. It is more appropriate to the real intention of the user; the historical decision-making is referred to when executing the system decision, and the target decision can be optimized according to the historical decision-making, which effectively improves the user experience of human-computer interaction. These units may perform the corresponding functions in the method examples of the first aspect. For details, refer to the detailed descriptions in the method examples, which will not be repeated here.

In a third aspect, the present application provides an electronic device, the electronic device comprising: at least one processor, a memory and a voice transceiver, wherein the voice transceiver is used to receive a target command issued by a user and feed back the voice of a decision to the user, and the memory It is used for storing computer programs and instructions, and the processor is used for invoking computer programs and instructions, and assisting with the voice transceiver to execute the human-computer interaction method according to any one of the first aspect or the possible implementation manners of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, comprising: computer software instructions; when the computer software instructions are executed in an electronic device, the electronic device enables the electronic device to perform the first aspect or possible implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product that, when the computer program product runs on a computer, causes the computer to execute the first aspect or possible implementations of the first aspect.

In a sixth aspect, the present application provides a chip system, which is applied to an electronic device; the chip system includes an interface circuit and a processor; the interface circuit and the processor are interconnected by a line; the interface circuit is used for receiving signals from a memory of the electronic device, A signal is sent to the processor, where the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the chip system executes the first aspect or possible implementations of the first aspect.

It should be understood that the description of technical features, technical solutions, beneficial effects or similar language in this application does not imply that all features and advantages may be realized in any single embodiment. On the contrary, it can be understood that the description of features or beneficial effects means that a specific technical feature, technical solution or beneficial effect is included in at least one embodiment. Therefore, descriptions of technical features, technical solutions or beneficial effects in this specification do not necessarily refer to the same embodiments. Furthermore, the technical features, technical solutions and beneficial effects described in this embodiment can also be combined in any appropriate manner. Those skilled in the art will understand that an embodiment can be implemented without one or more specific technical features, technical solutions or beneficial effects of a specific embodiment. In other embodiments, additional technical features and benefits may also be identified in specific embodiments that do not embody all embodiments.

Description of drawings

FIG. 1 is a schematic diagram of the composition of an electronic device provided by an embodiment of the present application;

2 is a flowchart of a human-computer interaction method provided by an embodiment of the present application;

3 is a schematic diagram of a speech recognition process provided by an embodiment of the present application;

FIG. 4 is a flowchart of a human-computer interaction method provided by an embodiment of the present application;

FIG. 5 is a flowchart of a human-computer interaction method provided by an embodiment of the present application;

FIG. 6 is a flowchart of a human-computer interaction method provided by an embodiment of the present application;

7 is a schematic diagram of an intent understanding model provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of the composition of a human-computer interaction device provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of the composition of a human-computer interaction device according to an embodiment of the present application.

detailed description

The terms "first", "second" and "third" in the description and claims of the present application and the above drawings are used to distinguish different objects, rather than to limit a specific order.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner.

"Multiple" refers to two or more than two, and other quantifiers are similar. "And/or" describes the association relationship between related objects, indicating that there can be three kinds of relationships, for example, A and/or B, can mean: exist independently A, there are both A and B, and there are three cases of B alone. In addition, the occurrence of elements in the singular forms "a", "an" and "the" does not mean that unless the context clearly dictates otherwise means "one or only one", but means "one or more than one". For example, "a device" means for one or more such devices. Furthermore, at least one of).. ....." means one or any combination of subsequent associated objects, eg "at least one of A, B, and C" includes A, B, C, AB, AC, BC, or ABC.

The implementation of the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The electronic device in this embodiment is a device including a display screen and a camera. The specific form of the electronic device is not particularly limited in the embodiments of the present application. For example, electronic devices may be televisions, tablets, projectors, cell phones, desktops, laptops, handheld computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, and personal digital assistants (personal digital assistant, PDA), augmented reality (AR), virtual reality (virtual reality, VR) devices, smart speakers, smart TVs and other Internet of things (Internet of things, IoT) devices.

Please refer to FIG. 1 , which is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in FIG. 1, the electronic device includes: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a power management module 140, an antenna, a wireless communication module 160, an audio Module 170, speaker 170A, speaker interface 170B, microphone 170C, sensor module 180, buttons 190, indicator 191, display screen 192, camera 193 and so on. The aforementioned sensor module 180 may include sensors such as a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, and an ambient light sensor.

It can be understood that the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device. In other embodiments, the electronic device may include more or fewer components than shown, or some components may be combined, or some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

A controller can be the nerve center and command center of an electronic device. The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, and/or USB interface, etc.

In this embodiment, when the processor 110 is configured to perform natural language understanding on the target command, the natural language understanding of the target command is performed in combination with the historical command to obtain the intent and slot of the target command. The historical command is the command of the historical human-computer interaction task, and the target command is the command of the current human-computer interaction task. Optionally, the historical commands may be commands of multiple historical human-computer interaction tasks. The multiple historical human-computer interaction tasks may be tasks in which multiple historical users have conducted multiple rounds of conversations with the electronic device. Further, the processor 110 determines the target decision in combination with the decision of the historical command and the intent and slot of the target command.

Among them, the intention and the slot together constitute the "user action", and the machine cannot directly understand the natural language, so the function of the user action is to map the natural language into a structured semantic representation that the machine can understand. Slots have the ability to memorize states for multiple rounds. The slot includes a slot, for example, a taxi scene, and the slot includes a departure slot and a destination slot.

The power management module 140 is used to connect power. The power management module 140 may also be connected with the processor 110 , the internal memory 121 , the display screen 192 , the camera 193 , the wireless communication module 160 and the like. The power management module 140 receives power input, and supplies power to the processor 110 , the internal memory 121 , the display screen 192 , the camera 193 , the wireless communication module 160 , and the like. In some embodiments, the power management module 140 may also be provided in the processor 110 .

The wireless communication function of the electronic device can be implemented by the antenna and the wireless communication module 160 and the like. Wherein, the wireless communication module 160 can provide wireless local area networks (wireless local area networks, WLAN) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (bluetooth, BT), global navigation, etc. applied on the electronic device. Satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.

The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110, perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna. In some embodiments, the antenna of the electronic device is coupled to the wireless communication module 160 so that the electronic device can communicate with the network and other devices through wireless communication techniques.

The electronic device realizes the display function through the GPU, the display screen 192, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 192 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 192 is used to display images, videos, and the like. The display screen 192 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light). emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.

The electronic device can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 192 and the application processor. The ISP is used to process the data fed back by the camera 193 . In some embodiments, the ISP may be provided in the camera 193 .

Camera 193 is used to capture still images or video. The object is projected through the lens to generate an optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the electronic device may include 1 or N cameras 193 , where N is a positive integer greater than 1.

Alternatively, the electronic device may not include a camera, that is, the above-mentioned camera 193 is not provided in the electronic device (eg, a television). The electronic device can connect to the camera 193 through an interface (eg, the USB interface 130 ). The external camera 193 can be fixed on the electronic device by an external fixing member (such as a camera bracket with a clip). For example, the external camera 193 can be fixed at the edge of the display screen 192 of the electronic device, such as the upper edge, by means of an external fixing member.

A digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device selects the frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy, etc. Video codecs are used to compress or decompress digital video. An electronic device may support one or more video codecs. In this way, the electronic device can play or record videos in various encoding formats, such as: moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.

Internal memory 121 may be used to store computer executable program code, which includes instructions. The processor 110 executes various functional applications and data processing of the electronic device by executing the instructions stored in the internal memory 121 . The internal memory 121 may include a storage program area and a storage data area. Wherein, the program storage area can store the operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area can store data (such as audio data, etc.) created during the use of the electronic device. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.

The electronic device may implement audio functions through an audio module 170, a speaker 170A, a microphone 170C, a speaker interface 170B, and an application processor. For example, music playback, recording, etc.

The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 . Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. In this application, the speaker 170A is used to output the speech of the decision. Microphone 170C, also called "microphone",

A "microphone" that converts sound signals into electrical signals. In the present application, the microphone 170C is used to receive the voice of the target command or the voice of the historical command issued by the user.

The speaker interface 170B is used to connect a wired speaker. The speaker interface 170B can be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.

The keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key. The electronic device may receive key input and generate key signal input related to user settings and function control of the electronic device.

The indicator 191 may be an indicator light, which may be used to indicate that the electronic device is in a power-on state, a standby state, or a power-off state, or the like. For example, if the indicator light is off, it can indicate that the electronic device is in a shutdown state; if the indicator light is green or blue, it can indicate that the electronic device is in a power-on state; if the indicator light is red, it can indicate that the electronic device is in a standby state.

It can be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device. It may have more or fewer components than shown in FIG. 1 , may combine two or more components, or may have a different configuration of components. For example, the electronic device may also include components such as speakers. The various components shown in Figure 1 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing or application specific integrated circuits.

Next, the human-computer interaction method provided by an embodiment of the present application will be described in detail with reference to FIG. 2 .

S201. The electronic device receives a target command sent by a user.

The target command is user-recognizable natural language text. In some embodiments, the user can input target commands to the electronic device through an input device (eg, a virtual keyboard or a physical keyboard). In other embodiments, the user can speak to the electronic device. The electronic device performs voice recognition on the voice and converts the voice into a target command. Voice refers to the voice of a user who communicates with an electronic device.

Optionally, if the user is in a noisy environment, the electronic device may receive a mixed voice, and the mixed voice includes the voice and the noise of the external environment. The electronic device can utilize the user's voiceprint characteristics to separate speech from the mixed speech. Illustratively, as shown in FIG. 3 , a schematic diagram of speech separation and recognition provided by an embodiment of the present application. The mixed speech is analyzed by short-time Fourier transform (short-time fourier transform, or short-term fourier transform, STFT) to obtain the mixed speech spectrum, and the mixed speech spectrum and the pre-registered user voiceprint features in the system are input into the pre-trained In the speech separation model of , the user's speech frequency spectrum is separated from the mixed speech frequency spectrum, and the obtained frequency spectrum is then used for automatic speech recognition technology to perform speech recognition on the speech frequency spectrum of the target speech to obtain the target command. Among them, the speech separation model is trained from pre-collected multi-user speech data. The speech separation model can be a multi-layer long short-term memory model (LSTM).

S202, the electronic device generates the target decision of the target command by using the historical command, the historical decision of the historical command, and the target command.

FIG. 4 is a schematic flowchart of another human-computer interaction method provided in this embodiment, wherein the method process shown in FIG. 4 is an elaboration of the specific operation process included in S202 in FIG. 2 , as shown in the figure. S2021 , the electronic device performs weighted coding on the historical command based on the command semantic correlation weight to obtain historical command coding information. S2022, the electronic device generates a target decision according to the target command, the coding information of the historical command, and the historical decision of the historical command.

The command semantic correlation weight represents the semantic correlation degree between the target command and the historical command. Related means related to each other. A historical command related to a target command may be a historical command that has some connection with the intent of the target command. For example, the target command is "the temperature is a little cold", and the historical command is "it's so hot, turn on the air conditioner to 20 degrees". Both the target command and the history command are related to adjusting the temperature of the air conditioner. However, the target command does not clearly indicate that the temperature of the air conditioner is adjusted, and the historical command indicates that the air conditioner is adjusted to a specific temperature. Therefore, in this embodiment, when performing natural language understanding on the target command issued by the user in the current human-computer interaction task, refer to The semantics of historical commands in historical human-computer interaction tasks are assisted, and the natural language understanding of target commands is assisted, so that the results of natural language understanding are more appropriate to the user's true intentions.

In a possible implementation manner, the electronic device semantically encodes the target command to obtain the semantic vector of the target command, and calculates the similarity of the semantic vector of the target command and the semantic vector of the historical command to obtain the command semantic correlation weight.

Specifically, the electronic device first performs Chinese word segmentation on the target command to obtain a word vector of the target command. Chinese word segmentation refers to dividing a continuous sequence of words into individual words.

The electronic device inputs the word vector of the target command into the semantic encoding model for encoding, and obtains the semantic vector of the target command. The semantic encoding model can be a recurrent neural network (RNN), and the most commonly used RNN model is a bidirectional long short-term memory (BiLSTM). BiLSTM can be implemented using a network with 3 hidden layers of 600 nodes each. For example, the target command is "the temperature is a little cold", and the target command is subjected to Chinese word segmentation to obtain the word vector of "temperature", the word vector of "point" and the word vector of "cold". Input the word vector of "temperature", the word vector of "a bit" and the word vector of "cold" into the semantic encoding model, and after the inference of the semantic encoding model, the semantic vector of the target command "temperature is a little cold" is obtained.

It should be noted that the electronic device stores the semantic vector of the target command, so that the semantic vector of the target command can be used as the semantic vector of the historical command to assist the electronic device to perform natural language understanding of subsequent commands.

The semantic vector of historical commands may be a matrix of M columns, each column representing a semantic vector of commands of a historical human-computer interaction task. Multiply each column in the matrix with the semantic vector of the target command to get the command semantic correlation weight. The command semantic correlation weight satisfies Equation (1).

p _m = u ^Th _m (1)

where u ^T =[x ₁ ,...,x _j ], u ^T represents the semantic vector of the target command,

h _m represents the semantic vector of historical commands, and p _m represents the command semantic correlation weight.

Optionally, the electronic device may further perform weighted encoding on the historical command based on the command semantic correlation weight and the user weight to obtain historical command encoding information. The user weight represents the degree of association between the user and the historical user who issued the historical command. In some embodiments, before the user performs human-computer interaction with the electronic device, the electronic device may prompt the user to provide the user's voiceprint, and the electronic device stores the user's voiceprint. The electronic device compares the user's voiceprint with the historical user's voiceprint to obtain the degree of association between the user and the historical user, that is, the user weight. Historical users refer to users who have performed human-computer interaction with electronic devices. For example, the electronic device obtains the similarity between the user and the historical user according to the user's voiceprint, and obtains the user weight. The degree of similarity may be the likelihood of the user and historical users. It is understandable that if the user has a large weight, it means that the user is more likely to be the historical user, and a higher weight is set; if the user weight is small, it means that the user is less likely to be the historical user, and a lower weight is set. .

Optionally, the electronic device may further perform weighted coding on the historical command based on the command semantic correlation weight, the user weight and the user relationship correlation degree weight to obtain historical command coding information. The user relationship relevance weight is a preset relationship strength value of multiple users. For example, the electronic device is a smart home, the users who use the smart home are usually fixed family members, and any member of the family member can set the value of the strength of the relationship with other members. Relationship strength values can include high, medium, low, and no relationship, among others.

Specifically, the electronic device performs weighted encoding on the historical command based on the weighted information to obtain the semantic vector of the weighted encoded historical command, merges the semantic vector of the weighted encoded historical command and the semantic vector of the target command, and executes the process through a fully connected network. Encoding to obtain historical command encoding information. The weighting information includes at least one of a command semantic relevance weight, a user weight, and a user relationship relevance weight. The semantic vector of the weighted encoded history command satisfies the formula (2).

h′=∑ _m p _m h _m S _m (2)

Among them, h' represents the semantic vector of the weighted encoded historical command, p _m represents the command semantic correlation weight, h _m represents the semantic vector of the historical command, and S _m represents the user weight. Alternatively, S _m represents the user relationship relevance weight. Alternatively, S _m represents the user weight and the user relationship relevance weight.

Further, FIG. 5 is a schematic flow chart of another human-computer interaction method provided in this embodiment, wherein the method flow described in FIG. 5 is an elaboration of the specific operation process included in S2022 in FIG. 4 , as shown in the figure . S20221 , the electronic device uses the intent understanding model to perform natural language understanding on the word vector of the target command and historical command coding information, and obtains the intent and slot of the target command. S20222. The electronic device generates a target decision according to the intent and slot position of the target command and the historical decision coding vector of the historical command.

Specifically, as shown in (a) of FIG. 6 , the electronic device performs Chinese word segmentation on the target command to obtain a word vector of the target command (execute S601 ). For example, the target command is "the temperature is a little cold", and the word vector of the target command includes the word vector of "temperature", the word vector of "a bit", and the word vector of "cold". The electronic device uses the semantic encoding model to encode the word vector of the target command to obtain the semantic vector of the target command (go to S602). The electronic device calculates the similarity according to the semantic vector of the target command and the semantic vector of the historical command to obtain the command semantic correlation weight

(Go to S603). Suppose the historical command is "It's so hot, turn on the air conditioner to 20 degrees". The command semantic correlation weight includes the semantic correlation degree of the target command being "the temperature is a little cold" and the historical command being "so hot, the air conditioner is turned on to 20 degrees". It is worth noting that the command semantic correlation weight includes the semantic correlation degree between the target command and the commands of multiple historical human-computer interaction tasks. Weighted encoding is performed on the semantic vector of the historical command based on the first weighted information to obtain historical command encoding information (go to S604). The first weighting information includes command semantic-related weights. Optionally, the first weighting information further includes user weights and user relationship relevance weights. The electronic device uses the intent understanding model to perform natural language understanding on the word vector of the target command and historical command coding information, and obtains the intent and slot of the target command (go to S605). For example, if the target command is "temperature is a little cold," the intent of the target command may indicate that it is about adjusting the temperature. The slots for the target command can be "temperature", "somewhat", and "cold". The intent understanding model can be RNN, and the following specific cases are implemented using the BERT model based on the TRANSFORMER structure.

Intent understanding models can be trained with bidirectional encoder representations from transformers (BERT). BERT is a two-way transformer-based model proposed by google. It can be pre-trained using a large amount of unsupervised text corpus. The pre-training process includes two techniques. One is to randomly mask some characters in the training sentence to predict the masked characters. The second is to train to understand the relationship between sentences, and to predict the next sentence given the current text conditions. After the BERT model is trained, the intent understanding model includes a pre-trained deep structure for semantic analysis, and then the BERT model is fine-tuned using the target-related intent understanding task. It should be noted that, in order to introduce the historical semantic related information obtained above into the model, the first input of the BERT network uses a weighted semantic encoding vector, as shown in Figure 7, the word vector of the target command and the historical command encoding information. Enter the intent understanding model to get the intent and slot of the target command.

Decision model is a classification model whose input is intention, dialogue state and system database information, and output is specific decision. As shown in (b) of FIG. 6 , the electronic device encodes the intent of the target command, the dialog state, and the information obtained in the system database to obtain a decision encoding vector (execute S606 ). The encoding network can be implemented using a multilayer convolutional neural network (CNN). The similarity calculation is performed between the electronic device decision coding vector and the historical decision coding vector to obtain the weight of the historical decision coding vector (go to S607). Historical decisions are system actions determined by electronic devices based on historical commands. For example, the historical command is "how is the weather today". The system action is to output "overcast, temperature". Another example, the historical command is "It's so hot, turn on the air conditioner to 20 degrees". The system action is to adjust the temperature of the air conditioner to 20 degrees. The weight of the historical decision coding vector includes the correlation degree between the decision coding vector of the target command "the temperature is a little cold" and the historical decision coding vector of the historical decision "turn on the air conditioner to 20 degrees". It is worth noting that the weight of the historical decision coding vector includes the degree of correlation between the decision coding vector of the target command and the historical decision coding vectors of the decisions of multiple historical human-computer interaction tasks. The electronic device performs weighted encoding on the historical decision encoding vector based on the second weighted information to obtain historical decision encoding information (go to S608). The second weighting information includes historical decision coding vector weights. Optionally, the second weighting information further includes user weights and user relationship relevance weights. The weight of the historical decision coding vector represents the degree of correlation between the decision coding vector and the historical decision coding vector. The electronic device uses the decision model to analyze the decision coding vector and the historical decision coding information to generate a target decision (go to S609). For example, the target command is "the temperature is a little cold", and the target decision can be "turn on the air conditioner to 29 degrees". Decision models can be implemented using shallow classifiers, such as support vector machines, or deep neural networks (DNNs), such as multi-layer fully connected feedforward networks (FNNs).

It should be noted that the electronic device stores the decision coding vector, which assists the electronic device to make decisions on the target command issued by the subsequent user.

The historical decision coding vector may be a matrix of M columns, and each column represents a historical decision coding vector of a decision of a historical human-computer interaction task. Multiply each column in the matrix by the decision coding vector to get the weight of the historical decision coding vector. The weight of the historical decision coding vector satisfies the formula (3).

q _m =w ^T k _m (3)

Among them, w ^T =[y ₁ ,...,y], w ^T represents the decision coding vector,

k _m represents the historical decision coding vector, and q _m represents the weight of the historical decision coding vector.

The electronic device performs weighted coding on the historical decision coding vector based on the second weighted information to obtain the weighted coding historical decision coding vector, combines the weighted coding historical decision coding vector and the decision coding vector, and encodes it through a fully connected network to obtain Historical decision encoding information. The weighted encoded historical decision encoding vector satisfies formula (4).

k′=∑ _m q _m k _m S _m (4)

Among them, k' represents the weighted encoded historical decision encoding vector. k _m represents the historical decision coding vector, and q _m represents the weight of the historical decision coding vector. S _m represents the user weight. Alternatively, S _m represents the user relationship relevance weight. Alternatively, S _m represents the user weight and the user relationship relevance weight.

Optionally, the electronic device uses the decision model to analyze the decision coding vector, the historical decision coding information and the user portrait of the user to determine the target decision. User portraits are also called user roles. User portraits are virtual representatives of real users. As an effective tool for delineating users, linking user demands and design directions, user portraits have been widely used in various fields.

S203, the electronic device outputs a target decision.

In order to make the electronic device communicate with the user, the electronic device can use the natural language generation (Natural Language Generation, NLG) technology to map the target decision into a natural language expression, that is, to generate the target decision text according to the target decision. Natural language generation refers to converting machine-readable decisions into natural language text. The electronic device can display the target decision text through the display screen, so that the user can obtain the system dialogue statement output by the electronic device. Optionally, the electronic device may also convert the target decision text into target decision voice, and play it to the user in the form of voice.

In this way, when the natural language understanding of the target command issued by the user in the current human-computer interaction task is carried out, the semantics of the historical command in the historical human-computer interaction task are referred to, which assists the natural language understanding of the target command, and makes the natural language understanding result. It is more appropriate to the real intention of the user; the historical decision-making is referred to when executing the system decision, and the target decision can be optimized according to the historical decision-making, which effectively improves the user experience of human-computer interaction.

It can be understood that, in order to implement the functions in the foregoing embodiments, the electronic device includes corresponding hardware structures and/or software modules for performing each function. Those skilled in the art should easily realize that the units and method steps of each example described in conjunction with the embodiments disclosed in the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software-driven hardware depends on the specific application scenarios and design constraints of the technical solution.

FIG. 8 is a schematic structural diagram of a possible human-computer interaction apparatus provided by an embodiment of the present application. These human-computer interaction apparatuses can be used to implement the functions of the electronic device in the above method embodiments, and thus can also achieve the beneficial effects of the above method embodiments. In the embodiment of the present application, the human-computer interaction apparatus may be an electronic device as shown in FIG. 1 , or may be a module (eg, a chip) applied to the electronic device.

As shown in FIG. 8 , the human-computer interaction apparatus 800 includes an acquisition unit 810 , a processing unit 820 and a feedback unit 830 . The human-computer interaction apparatus 800 is used to implement the functions of the electronic device in the method embodiment shown in FIG. 2 , FIG. 4 , FIG. 5 or FIG. 6 .

When the human-computer interaction apparatus 800 is used to implement the functions of the electronic device in the method embodiment shown in FIG. 2 : the obtaining unit 810 is used to perform S201 ; the processing unit 820 is used to perform S202 ; and the feedback unit 830 is used to perform S203 .

When the human-computer interaction apparatus 800 is used to implement the functions of the electronic device in the method embodiment shown in FIG. 4 : the obtaining unit 810 is used to perform S201; the processing unit 820 is used to perform S2021 and S2022; and the feedback unit 830 is used to perform S203.

When the human-computer interaction apparatus 800 is used to implement the functions of the electronic device in the method embodiment shown in FIG. 5: the obtaining unit 810 is used to execute S201; the processing unit 820 is used to execute S2021, S20221 and S20222; the feedback unit 830 is used to execute S203.

When the human-computer interaction apparatus 800 is used to implement the functions of the electronic device in the method embodiment shown in FIG. 6 : the processing unit 820 is used to execute S601 to S609.

More detailed descriptions of the above obtaining unit 810 , processing unit 820 and feedback unit 830 can be obtained directly by referring to the relevant descriptions in the method embodiments shown in FIG. 2 , FIG. 4 , FIG. 5 or FIG. 6 , and details are not repeated here. The functions of the acquiring unit 810 , the processing unit 820 and the feedback unit 830 may be implemented by the processor 110 in FIG. 1 described above.

Optionally, as shown in FIG. 9 , the human-computer interaction device 900 may include a speech recognition unit 910 , a language understanding unit 920 , a dialogue management unit 930 , a language generation unit 940 and a speech synthesis unit 950 . The speech recognition unit 910 is used to realize the function of the acquisition unit 810 . For example, the voice recognition unit 910 is used to recognize the voice issued by the user to obtain the target command. The language understanding unit 920 and the dialogue management unit 930 are used to implement the functions of the processing unit 820 to obtain target decisions. For example, the language understanding unit 920 is configured to use the intent understanding model to perform natural language understanding on the word vector of the target command and historical command coding information, and obtain the intent and slot of the target command. The dialogue management unit 930 is configured to generate the target decision according to the intent and slot of the target command and the historical decision encoding vector of the historical command. The language generation unit 940 and the speech synthesis unit 950 are used to realize the function of the feedback unit 830 . For example, the language generation unit 940 is used to convert the target decision into natural language. The speech synthesis unit 950 is used to feed back the decision language to the user.

It can be understood that the processor in the embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field Programmable Gate Array (Field Programmable Gate Array, FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. A general-purpose processor may be a microprocessor or any conventional processor.

The method steps in the embodiments of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (Random Access Memory, RAM), flash memory, read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM) , PROM), Erasable Programmable Read-Only Memory (Erasable PROM, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disks, removable hard disks, CD-ROMs or known in the art in any other form of storage medium. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage medium may reside in an ASIC. Alternatively, the ASIC may reside in a network device or an electronic device. Of course, the processor and storage medium may also exist as discrete components in a network device or an electronic device.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are executed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable apparatus. The computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instructions may be downloaded from a website, computer, A server or data center transmits by wire or wireless to another website site, computer, server or data center. The computer-readable storage medium may be any available media that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium can be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it can also be an optical medium, such as a digital video disc (DVD); it can also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).

In the various embodiments of the present application, if there is no special description or logical conflict, the terms and/or descriptions between different embodiments are consistent and can be referred to each other, and the technical features in different embodiments are based on their inherent Logical relationships can be combined to form new embodiments.

In this application, "at least one" means one or more, and "plurality" means two or more. "And/or", which describes the relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, it can indicate that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural. In the text description of this application, the character "/" generally indicates that the related objects are a kind of "or" relationship; in the formula of this application, the character "/" indicates that the related objects are a kind of "division" Relationship.

It can be understood that, the various numbers and numbers involved in the embodiments of the present application are only for the convenience of description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence numbers of the above processes does not imply the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic.

Claims

A human-computer interaction method, comprising:

Receive the target command issued by the user;

The target decision of the target command is generated using the historical command, the historical decision of the historical command and the target command, the historical command is the command of the historical human-computer interaction task, and the target command is the command of the current human-computer interaction task ;

The target decision is output.
The method according to claim 1, wherein generating the target decision of the target command using historical commands, historical decisions of the historical commands and the target command, comprising:

The historical command is weighted and encoded based on the command semantic correlation weight to obtain historical command encoding information, and the command semantic correlation weight represents the semantic correlation degree between the target command and the historical command;

The target decision is generated according to the target command, the historical command encoding information and the historical decision of the historical command.
The method according to claim 2, characterized in that before weighted encoding the historical command based on the command semantic correlation weight, the method further comprises:

Semantic encoding is performed on the target command to obtain a semantic vector of the target command;

The command semantic correlation weight is obtained by calculating the similarity according to the semantic vector of the target command and the semantic vector of the historical command.
The method according to claim 2 or 3, wherein weighted coding is performed on the historical command based on the relevant weight of command semantics to obtain historical command coding information, comprising:

The historical command is weighted and encoded based on the command semantic correlation weight and the user weight to obtain historical command encoding information, where the user weight represents the degree of association between the user and the historical user who issued the historical command.
The method according to claim 4, wherein before the weighted coding of the historical command based on the command semantic correlation weight and the user weight, the method further comprises:

The user weight is obtained by acquiring the degree of association between the user and the historical user according to the user's voiceprint.
The method according to claim 4 or 5, wherein the historical command is weighted and encoded based on the command semantic correlation weight and the user weight to obtain historical command encoding information, comprising:

The historical command is weighted and encoded based on the command semantic correlation weight, the user weight and the user relationship correlation degree weight to obtain historical command encoding information, and the user relationship correlation degree weight is a preset relationship strength of multiple users value.
The method according to any one of claims 2-6, wherein generating the target decision according to the target command, the historical command coding information and the historical decision of the historical command, comprising:

Utilize the intent understanding model to perform natural language understanding on the word vector of the target command and the historical command coding information to obtain the intent and slot of the target command;

The target decision is generated according to the intent and slot of the target command and the historical decision encoding vector of the historical command.
The method according to claim 7, wherein generating the target decision according to the intention and slot of the target command and the historical decision coding vector of the historical command, comprising:

Encoding the intention and slot of the target command to obtain a decision encoding vector;

Perform weighted encoding on the historical decision encoding vector based on the historical decision encoding vector weight to obtain historical decision encoding information, where the historical decision encoding vector weight represents the degree of correlation between the decision encoding vector and the historical decision encoding vector;

The target decision is generated by analyzing the decision coding vector and the historical decision coding information by using a decision model.
The method according to claim 8, characterized in that, before performing weighted encoding on the historical decision encoding vector based on the historical decision encoding vector weight to obtain historical decision encoding information, the method further comprises:

The similarity calculation is performed on the decision coding vector and the historical decision coding vector to obtain the weight of the historical decision coding vector.
The method according to claim 8 or 9, wherein weighted encoding is performed on the historical decision encoding vector based on the historical decision encoding vector weight to obtain historical decision encoding information, including:

Weighted encoding is performed on the historical decision encoding vector based on the historical decision encoding vector weight and the user weight to obtain historical decision encoding information;

Alternatively, weighted encoding is performed on the historical decision encoding vector based on the historical decision encoding vector weight, the user weight and the user relationship relevance weight to obtain historical decision encoding information.
A human-computer interaction device, comprising:

an acquisition unit for receiving the target command issued by the user;

The processing unit is used for generating the target decision of the target command by using the historical command, the historical decision of the historical command and the target command, the historical command is the command of the historical human-computer interaction task, and the target command is the current human-computer interaction task. Commands for computer interaction tasks;

A feedback unit for outputting the target decision.
An electronic device, characterized in that it comprises: at least one processor, a memory and a voice transceiver, wherein the voice transceiver is used to receive a voice of a target command or a voice of a feedback target decision, and the memory is used to store a computer program and instructions, the processor is configured to invoke the computer program and instructions to assist with the voice transceiver to execute the human-computer interaction method according to any one of claims 1-10.
A computer-readable storage medium, characterized in that a computer program or instruction is stored in the storage medium, and when the computer program or instruction is executed by a human-computer interaction device, any one of claims 1-10 is implemented. the method described.