CN114627864A

CN114627864A - Display device and voice interaction method

Info

Publication number: CN114627864A
Application number: CN202011433067.6A
Authority: CN
Inventors: 岳文浩; 杨善松
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2022-06-14

Abstract

The embodiment of the application provides a display device and a voice interaction method, wherein when first voice data input by a user is received, candidate user intentions corresponding to the first voice data are determined; when the first voice data corresponds to a plurality of candidate user intentions, generating an inquiry statement according to the candidate user intentions, and feeding back the inquiry statement to the user for prompting the user to select one user intention from the candidate user intentions; receiving second voice data input by a user, and determining a target user intention corresponding to the first voice data in the candidate user intentions according to the second voice data; and outputting the associated information associated with the target user intention. According to the method and the device, the accuracy of understanding the user intention in the voice interaction process can be effectively improved.

Description

Display device and voice interaction method

Technical Field

The embodiment of the application relates to the technical field of voice interaction, in particular to a display device and a voice interaction method.

Background

At present, due to the development of voice technology, more and more intelligent voice interaction devices are provided, voice interaction becomes a very important human-computer interaction approach, and especially in recent years, due to the popularization of voice assistants, services can be obtained from mobile terminals to some intelligent household appliances through voice interaction.

In the existing voice interactive system, the user intention is generally understood according to the sentence input by the user, and then the relevant service is provided for the user according to the user intention.

However, it is difficult for existing voice interactive systems to make accurate intent comprehension and decision in the face of cross-business decisions or user intent ambiguities. For example, when a search result corresponding to a sentence (query) input by a user includes a song and a video at the same time, the existing voice interaction system scores the two search results, and preferentially feeds back the searched song to the user if the score of the searched song is higher than the score of the video; if the score of the searched "video" is higher than the score of the "song", the searched video is preferentially fed back to the user, thereby possibly causing the final decision not to meet the real intention of the user.

Disclosure of Invention

The embodiment of the application provides a display device and a voice interaction method, which can improve the accuracy of understanding of user intentions in the voice interaction process.

In some embodiments, the present application provides a display device, including:

the voice acquisition device is used for acquiring voice data;

the audio processor is used for processing the collected voice data;

a display screen for displaying an image;

a controller configured to:

receiving first voice data input by a user, and determining candidate user intentions corresponding to the first voice data;

when the first voice data corresponds to a plurality of candidate user intentions, generating an inquiry statement according to the candidate user intentions, and sending the inquiry statement to the display screen for displaying, wherein the inquiry statement is used for prompting a user to select one user intention from the candidate user intentions;

receiving second voice data input by the user, and determining a target user intention corresponding to the first voice data in the candidate user intentions according to the second voice data;

and outputting the associated information associated with the target user intention.

In one possible design, when the first speech data corresponds to a single candidate user intent, outputting associated information associated with the candidate user intent.

In one possible embodiment, the controller is configured to:

acquiring historical user intentions determined in a voice interaction process;

and determining candidate user intentions corresponding to the first voice data by utilizing the intention recognition model according to the first voice data and the historical user intentions.

In one possible embodiment, the controller is configured to:

determining whether the first speech data and the historical user intent belong to the same dialog sequence based on a dialog state tracking model;

when the first voice data and the historical user intention belong to the same dialogue sequence, determining an initial user intention corresponding to the first voice data by using the intention recognition model, and updating the initial user intention corresponding to the first voice data according to the historical user intention to obtain a candidate user intention corresponding to the first voice data;

when the first voice data and the historical user intention do not belong to the same dialog sequence, determining candidate user intentions corresponding to the first voice data by using the intention recognition model.

In one possible embodiment, the controller is configured to:

determining scores of all output modules in a dialogue strategy learning model according to the intentions of all candidate users corresponding to the first voice data; the dialogue strategy learning model comprises at least one of the following output modules: the rewriting module is used for referring to a resolution module, a vertical field intention analysis module, a task multi-turn response module, a question-answering module, a news search module, a chat module, a recommendation module and a candidate intention analysis module;

and when the output module with the highest score in the dialogue strategy learning model is the candidate intention analysis module, generating the inquiry statement according to the score of each candidate user intention.

In one possible embodiment, the controller is configured to:

when the score of a first candidate user intention corresponding to the first voice data is smaller than a first preset threshold, the score of a second candidate user intention corresponding to the first voice data is larger than a second preset threshold, and the difference between the score of the first candidate user intention and the score of the second candidate user intention is smaller than a preset interval threshold, generating the query sentence based on the first candidate user intention and the second candidate user intention; the first candidate user intention is a candidate user intention with the highest score corresponding to the first voice data, the second candidate user intention is a candidate user intention with the second highest score corresponding to the first voice data, and the first preset threshold is larger than the second preset threshold.

In some embodiments, an embodiment of the present application further provides a voice interaction method, where the method includes:

when the first voice data corresponds to a plurality of candidate user intentions, generating a query statement according to the candidate user intentions, and outputting the query statement to the user, wherein the query statement is used for prompting the user to select one user intention from the candidate user intentions;

outputting association information associated with the target user intent.

In one possible design, the determining the candidate user intention corresponding to the first speech data based on the intention recognition model includes:

acquiring historical user intentions determined in a voice interaction process;

In one possible design, the determining, by using the intent recognition model, a candidate user intent corresponding to the first speech data according to the first speech data and the historical user intent includes:

In one possible design, the generating a query statement according to the plurality of candidate user intents includes:

In one possible design, the generating the query statement according to the score of each candidate user intention includes:

when the score of a first candidate user intention corresponding to the first voice data is larger than a first preset threshold, the score of a second candidate user intention corresponding to the first voice data is larger than a second preset threshold, and the difference between the score of the first candidate user intention and the score of the second candidate user intention is smaller than a preset interval threshold, generating the query sentence based on the first candidate user intention and the second candidate user intention; the first candidate user intention is a candidate user intention with the highest score corresponding to the first voice data, the second candidate user intention is a candidate user intention with the second highest score corresponding to the first voice data, and the first preset threshold is larger than the second preset threshold.

According to the display equipment and the voice interaction method, when first voice data input by a user is received, candidate user intentions corresponding to the first voice data are determined; when the first voice data corresponds to a plurality of candidate user intentions, generating an inquiry statement according to the candidate user intentions, and outputting the inquiry statement to the user for prompting the user to select one user intention from the candidate user intentions; receiving second voice data input by a user, and determining a target user intention corresponding to the first voice data in the candidate user intentions according to the second voice data; and outputting response information associated with the target user intention. In the application, when the candidate user intentions corresponding to the user input sentences are two or more than two, the voice interaction system actively feeds back the query sentences to the user through an anthropomorphic interaction mode, and then determines the real intentions of the user according to the response sentences input by the user, so that the accuracy of understanding the user intentions in the voice interaction process can be effectively improved.

Drawings

Fig. 1 is a schematic diagram illustrating an operation scenario between a display device and a control apparatus according to an embodiment;

fig. 2 is a block diagram exemplarily showing a hardware configuration of a display device 200 according to an exemplary embodiment;

fig. 3 is a block diagram schematically showing a configuration of a control device 1001 according to an exemplary embodiment;

FIG. 4 is a schematic diagram of a software system of a display device provided in the present application;

FIG. 5 is a schematic diagram of an application program that can be provided by the display device provided in the present application;

FIG. 6 is a schematic diagram of an application of a display device in a voice interaction scenario;

FIG. 7 is a schematic flow chart illustrating an application of a display device in a voice interaction scenario;

FIG. 8 is a schematic diagram of an application scenario exemplarily illustrated in an embodiment of the present application;

FIG. 9 is another flow chart illustrating the application of a display device to a voice interaction scenario;

FIG. 10 is a schematic diagram of a supplier of identification models issuing identification models;

FIG. 11 is a flowchart illustrating a process of the server 400 obtaining a recognition model;

FIG. 12 is a schematic flow chart illustrating the process of updating the recognition model by the server;

fig. 13 is a first flowchart illustrating a voice interaction method provided in an embodiment of the present application;

fig. 14 is a flowchart illustrating a voice interaction method according to an embodiment of the present application;

15 a-15 d are schematic diagrams of voice interaction of a display device according to an embodiment of the invention;

fig. 16a to 16d are schematic diagrams illustrating another voice interaction of the display device according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment.

It should be noted that the brief descriptions of the terms in the present application are only for convenience of understanding of the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or device.

The term module, as used herein, refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the functionality associated with that element.

The term "remote control" as used in this application refers to a component of an electronic device (such as the display device disclosed in this application) that is typically wirelessly controllable over a relatively short range of distances. Typically using infrared and/or Radio Frequency (RF) signals and/or bluetooth to interface with the electronic device, and may also include WiFi, wireless USB, bluetooth, motion sensor, etc. functional modules. For example: the hand-held touch remote controller replaces most of the physical built-in hard keys in the common remote control device with the user interface in the touch screen.

Fig. 1 is a schematic diagram illustrating an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display apparatus 200 through a mobile terminal 1002 and a control device 1001.

In some embodiments, the control device 1001 may be a remote controller, and the communication between the remote controller and the display device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, etc. to control the display device 200 in a wireless or other wired manner. The user may input a user command through a key on a remote controller, voice input, control panel input, etc. to control the display apparatus 200. Such as: the user can input a corresponding control command through a volume up/down key, a channel control key, up/down/left/right moving keys, a voice input key, a menu key, a power on/off key, etc. on the remote controller, to implement the function of controlling the display device 200.

In some embodiments, mobile terminals, tablets, computers, laptops, and other smart devices may also be used to control the display device 200. For example, the display device 200 is controlled using an application program running on the smart device. The application, through configuration, may provide the user with various controls in an intuitive User Interface (UI) on a screen associated with the smart device.

In some embodiments, the mobile terminal 1002 may install a software application with the display device 200, implement connection communication through a network communication protocol, and implement the purpose of one-to-one control operation and data communication. Such as: the control instruction protocol can be established between the mobile terminal 1002 and the display device 200, the remote control keyboard is synchronized to the mobile terminal 1002, and the function of controlling the display device 200 is realized by controlling the user interface on the mobile terminal 1002. The audio and video content displayed on the mobile terminal 1002 can also be transmitted to the display device 200, so as to realize the synchronous display function.

As also shown in fig. 1, the display apparatus 200 also performs data communication with the server 400 through various communication means. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. Illustratively, the display device 200 receives software program updates, or accesses a remotely stored digital media library, by sending and receiving information, as well as Electronic Program Guide (EPG) interactions. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers. Other web service content such as video-on-demand and advertising services are provided through the server 400.

The display device 200 may be a liquid crystal display, an OLED display, a projection display device. The particular display device type, size, resolution, etc. are not limiting, and those skilled in the art will appreciate that the display device 200 may be modified in performance and configuration as desired.

The display apparatus 200 may additionally provide an intelligent network tv function of a computer support function including, but not limited to, a network tv, an intelligent tv, an Internet Protocol Tv (IPTV), and the like, in addition to the broadcast receiving tv function.

A hardware configuration block diagram of a display device 200 according to an exemplary embodiment is exemplarily shown in fig. 2.

In some embodiments, at least one of the controller 250, the tuner demodulator 210, the communicator 220, the detector 230, the input/output interface 255, the display 275, the audio output interface 285, the memory 260, the power supply 290, the user interface 265, and the external device interface 240 is included in the display apparatus 200.

In some embodiments, a display screen 275 receives image signals originating from the first processor output and displays video content and images and components of the menu manipulation interface.

In some embodiments, the display 275, includes a display component for presenting a picture, and a driving component for driving the display of an image.

In some embodiments, the video content is displayed from broadcast television content, or alternatively, from various broadcast signals that may be received via wired or wireless communication protocols. Alternatively, various image contents received from the network communication protocol and sent from the network server side can be displayed.

In some embodiments, the display screen 275 is used to present a user-manipulated UI interface generated in the display device 200 and used to control the display device 200.

In some embodiments, a drive assembly for driving the display is also included, depending on the type of display screen 275.

In some embodiments, the display screen 275 is a projection display screen and may also include a projection device and a projection screen.

In some embodiments, communicator 220 is a component for communicating with external devices or external servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi chip, a bluetooth communication protocol chip, a wired ethernet communication protocol chip, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver.

In some embodiments, the display apparatus 200 may establish control signal and data signal transmission and reception with the external control apparatus 100 or the content providing apparatus through the communicator 220.

In some embodiments, user interface 265 may be configured to receive infrared control signals from a control device (e.g., an infrared remote control, etc.).

In some embodiments, the detector 230 is a signal used by the display device 200 to collect an external environment or interact with the outside.

In some embodiments, the detector 230 includes a light receiver, a sensor for collecting the intensity of ambient light, and parameters changes can be adaptively displayed by collecting the ambient light, and the like.

In some embodiments, the detector 230 may further include an image collector, such as a camera, etc., which may be configured to collect external environment scenes, collect attributes of the user or gestures interacted with the user, adaptively change display parameters, and recognize user gestures, so as to implement a function of interaction with the user.

In some embodiments, the detector 230 may also include a temperature sensor or the like, such as by sensing ambient temperature.

In some embodiments, the display apparatus 200 may adaptively adjust a display color temperature of an image. For example, the display apparatus 200 may be adjusted to display a cool tone when the temperature is in a high environment, or the display apparatus 200 may be adjusted to display a warm tone when the temperature is in a low environment.

In some embodiments, the detector 230 may also include a sound collector or the like, such as a microphone, which may be used to receive the user's voice. Illustratively, a voice signal including a control instruction of the user to control the display device 200, or to collect an ambient sound for recognizing an ambient scene type, so that the display device 200 can adaptively adapt to an ambient noise.

In some embodiments, as shown in fig. 2, the input/output interface 255 is configured to allow data transfer between the controller 250 and external other devices or other controllers 250. Such as receiving video signal data and audio signal data of an external device, or command instruction data, etc.

In some embodiments, the external device interface 240 may include, but is not limited to, the following: the interface can be any one or more of a high-definition multimedia interface (HDMI), an analog or data high-definition component input interface, a composite video input interface, a USB input interface, an RGB port and the like. The plurality of interfaces may form a composite input/output interface.

In some embodiments, as shown in fig. 2, the tuning demodulator 210 is configured to receive a broadcast television signal through a wired or wireless receiving manner, perform modulation and demodulation processing such as amplification, mixing, resonance, and the like, and demodulate an audio and video signal from a plurality of wireless or wired broadcast television signals, where the audio and video signal may include a television audio and video signal carried in a television channel frequency selected by a user and an EPG data signal.

In some embodiments, the frequency points demodulated by the tuner demodulator 210 are controlled by the controller 250, and the controller 250 can send out control signals according to user selection, so that the modem responds to the television signal frequency selected by the user and modulates and demodulates the television signal carried by the frequency.

In some embodiments, the broadcast television signal may be classified into a terrestrial broadcast signal, a cable broadcast signal, a satellite broadcast signal, an internet broadcast signal, or the like according to a television signal broadcasting system. Or may be classified into a digital modulation signal, an analog modulation signal, and the like according to a modulation type. Or the signals are classified into digital signals, analog signals, and the like according to the type of the signals.

In some embodiments, the controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box. Therefore, the set top box outputs the television audio and video signals modulated and demodulated by the received broadcast television signals to the main body equipment, and the main body equipment receives the audio and video signals through the first input/output interface.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored in memory. The controller 250 may control the overall operation of the display apparatus 200. For example: in response to receiving a user command for selecting a UI object displayed on the display 275, the controller 250 may perform an operation related to the object selected by the user command.

As shown in fig. 2, the controller 250 includes at least one of a Random Access Memory 251 (RAM), a Read-Only Memory 252 (ROM), a video processor 270, an audio processor 280, other processors 253 (e.g., a Graphics Processing Unit (GPU), a Central Processing Unit 254 (CPU), a Communication Interface (Communication Interface), and a Communication Bus 256(Bus), which connects the respective components.

In some embodiments, RAM 251 is used to store temporary data for the operating system or other programs that are running

In some embodiments, ROM 252 is used to store instructions for various system boots.

In some embodiments, the ROM 252 is used to store a Basic Input Output System (BIOS). The system is used for completing power-on self-test of the system, initialization of each functional module in the system, a driver of basic input/output of the system and booting an operating system.

In some embodiments, when the power-on signal is received, the display device 200 starts to power up, the CPU executes the system boot instruction in the ROM 252, and copies the temporary data of the operating system stored in the memory to the RAM 251 so as to start or run the operating system. After the start of the operating system is completed, the CPU copies the temporary data of the various application programs in the memory to the RAM 251, and then, the various application programs are started or run.

In some embodiments, CPU processor 254 is used to execute operating system and application program instructions stored in memory. And executing various application programs, data and contents according to various interactive instructions received from the outside so as to finally display and play various audio and video contents.

In some example embodiments, the CPU processor 254 may comprise a plurality of processors. The plurality of processors may include a main processor and one or more sub-processors. A main processor for performing some operations of the display apparatus 200 in a pre-power-up mode and/or operations of displaying a screen in a normal mode. One or more sub-processors for one operation in a standby mode or the like.

In some embodiments, the graphics processor 253 is used to generate various graphics objects, such as: icons, operation menus, user input instruction display graphics, and the like. The display device comprises an arithmetic unit which carries out operation by receiving various interactive instructions input by a user and displays various objects according to display attributes. And the system comprises a renderer for rendering various objects obtained based on the arithmetic unit, wherein the rendered objects are used for being displayed on a display screen.

In some embodiments, the video processor 270 is configured to receive an external video signal, and perform video processing such as decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, image synthesis, and the like according to a standard codec protocol of the input signal, so as to obtain a signal that can be displayed or played on the direct display device 200.

In some embodiments, video processor 270 includes a demultiplexing module, a video decoding module, an image synthesis module, a frame rate conversion module, a display formatting module, and the like.

The demultiplexing module is used for demultiplexing the input audio and video data stream, and if the input MPEG-2 is input, the demultiplexing module demultiplexes the input audio and video data stream into a video signal and an audio signal.

And the video decoding module is used for processing the demultiplexed video signal, including decoding, scaling and the like.

And the image synthesis module is used for carrying out superposition mixing processing on the GUI signal input by the user or generated by the user and the video image after the zooming processing by the graphic generator so as to generate an image signal for display.

The frame rate conversion module is configured to convert an input video frame rate, such as a 60Hz frame rate into a 120Hz frame rate or a 240Hz frame rate, and the normal format is implemented in, for example, an interpolation frame mode.

The display format module is used for converting the received video output signal after the frame rate conversion, and changing the signal to conform to the signal of the display format, such as outputting an RGB data signal.

In some embodiments, the graphics processor 253 and the video processor may be integrated or separately configured, and the integrated configuration may perform processing of a graphics signal output to the display screen, and the separate configuration may perform different functions, for example, a GPU + frc (frame Rate conversion) architecture.

In some embodiments, the audio processor 280 is configured to receive an external audio signal, decompress and decode the received audio signal according to a standard codec protocol of the input signal, and perform noise reduction, digital-to-analog conversion, and amplification processes to obtain an audio signal that can be played in a speaker.

In some embodiments, video processor 270 may comprise one or more chips. The audio processor may also comprise one or more chips.

In some embodiments, the video processor 270 and the audio processor 280 may be separate chips or may be integrated together with the controller in one or more chips.

The power supply 290 supplies power to the display apparatus 200 from the power input from the external power source under the control of the controller 250. The power supply 290 may include a built-in power supply circuit installed inside the display apparatus 200, or may be a power supply interface installed outside the display apparatus 200 to provide an external power supply in the display apparatus 200.

A user interface 265 for receiving an input signal of a user and then transmitting the received user input signal to the controller 250. The user input signal may be a remote controller signal received through an infrared receiver, and various user control signals may be received through the network communication module.

In some embodiments, the user inputs a user command through the control device or the mobile terminal, the user input interface responds to the user input through the controller 250 according to the user input, and the display apparatus 200 responds to the user input.

In some embodiments, a user may enter user commands on a Graphical User Interface (GUI) displayed on the display 275, and the user input interface receives the user input commands through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

In some embodiments, a "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables conversion between an internal form of information and a form that is acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

The memory 260 includes a memory storing various software modules for driving the display device 200. Such as: various software modules stored in the first memory, including: at least one of a basic module, a detection module, a communication module, a display control module, a browser module, and various service modules.

The base module is a bottom layer software module for signal communication between various hardware in the display device 200 and for sending processing and control signals to the upper layer module. The detection module is used for collecting various information from various sensors or user input interfaces, and the management module is used for performing digital-to-analog conversion and analysis management.

For example, the voice recognition module comprises a voice analysis module and a voice instruction database module. The display control module is used for controlling the display to display the image content, and can be used for playing the multimedia image content, UI interface and other information. And the communication module is used for carrying out control and data communication with external equipment. And the browser module is used for executing a module for data communication between browsing servers. And the service module is used for providing various services and modules including various application programs. Meanwhile, the memory 260 may store a visual effect map for receiving external data and user data, images of various items in various user interfaces, and a focus object, etc.

Fig. 3 exemplarily shows a block diagram of a configuration of the control device 1001 according to an exemplary embodiment. As shown in fig. 3, the control device 1001 includes a controller 110, a communication interface 130, a user input/output interface, a memory, and a power supply source.

The control device 1001 is configured to control the display device 200 and can receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive by the display device 200, serving as an interaction intermediary between the user and the display device 200. Such as: the user operates the channel up/down key on the control device 1001, and the display device 200 responds to the channel up/down operation.

In some embodiments, the control device 1001 may be a smart device. Such as: the control apparatus 1001 may install various applications that control the display apparatus 200 according to user demands.

In some embodiments, as shown in fig. 1, a mobile terminal 1002 or other intelligent electronic device may function similar to control device 1001 after installation of an application that manipulates display device 200. Such as: a user may implement the functionality of physical keys of control device 1001 by installing applications, various function keys or virtual buttons of a graphical user interface that may be provided on mobile terminal 1002 or other intelligent electronic devices.

The controller 110 includes a processor 112 and RAM 113 and ROM 114, a communication interface 130, and a communication bus. The controller is used to control the operation of the control device 1001, as well as the communications between the internal components and the external and internal data processing functions.

The communication interface 130 enables communication of control signals and data signals with the display apparatus 200 under the control of the controller 110. Such as: the received user input signal is transmitted to the display apparatus 200. The communication interface 130 may include at least one of a WiFi chip 131, a bluetooth module 132, an NFC module 133, and other near field communication modules.

A user input/output interface 140, wherein the input interface includes at least one of a microphone 141, a touch pad 142, a sensor 143, keys 144, and other input interfaces. Such as: the user can realize a user instruction input function through actions such as voice, touch, gesture, pressing, and the like, and the input interface converts the received analog signal into a digital signal and converts the digital signal into a corresponding instruction signal, and sends the instruction signal to the display device 200.

The output interface includes an interface that transmits the received user instruction to the display apparatus 200. In some embodiments, the interface may be an infrared interface or a radio frequency interface. Such as: when the infrared signal interface is used, the user input instruction needs to be converted into an infrared control signal according to an infrared control protocol, and the infrared control signal is sent to the display device 200 through the infrared sending module. The following steps are repeated: when the rf signal interface is used, a user input command needs to be converted into a digital signal, and then the digital signal is modulated according to the rf control signal modulation protocol and then transmitted to the display device 200 through the rf transmitting terminal.

In some embodiments, the control device 1001 includes at least one of the communication interface 130 and the input-output interface 140.

A memory 190 for storing various operation programs, data and applications for driving and controlling the control device 1001 under the control of the controller. The memory 190 may store various control signal commands input by a user.

And a power supply 180 for providing operational power support to the components of the control device 1001 under the control of the controller. A battery and associated control circuitry.

In some embodiments, the system may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together make up the basic operating system structure that allows users to manage files, run programs, and use the system. After power-on, the kernel is started, kernel space is activated, hardware is abstracted, hardware parameters are initialized, and virtual memory, a scheduler, signals and interprocess communication (IPC) are operated and maintained. And after the kernel is started, loading the Shell and the user application program. The application program is compiled into machine code after being started, and a process is formed.

Fig. 4 is a schematic diagram of a software system of a display device provided in the present Application, and referring to fig. 4, in some embodiments, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer.

In some embodiments, at least one application program runs in the application program layer, and the application programs can be Window (Window) programs carried by an operating system, system setting programs, clock programs, camera applications and the like; or may be an application developed by a third party developer such as a hi program, a karaoke program, a magic mirror program, or the like. In specific implementation, the application packages in the application layer are not limited to the above examples, and may actually include other application packages, which is not limited in this embodiment of the present application.

The framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions. The application framework layer acts as a processing center that decides to let the applications in the application layer act. The application program can access the resource in the system and obtain the service of the system in execution through the API interface

As shown in fig. 4, in the embodiment of the present application, the application framework layer includes a manager (Managers), a Content Provider (Content Provider), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used for interacting with all activities running in the system; the Location Manager (Location Manager) is used for providing the system service or application with the access of the system Location service; a Package Manager (Package Manager) for retrieving various information related to an application Package currently installed on the device; a Notification Manager (Notification Manager) for controlling display and clearing of Notification messages; a Window Manager (Window Manager) is used to manage icons, windows, toolbars, wallpapers, and desktop components on a user interface.

In some embodiments, the activity manager is to: managing the life cycle of each application program and the general navigation backspacing function, such as controlling the exit of the application program (including switching the user interface currently displayed in the display window to the system desktop), opening, backing (including switching the user interface currently displayed in the display window to the previous user interface of the user interface currently displayed), and the like.

In some embodiments, the window manager is configured to manage all window processes, such as obtaining a display size, determining whether a status bar is available, locking a screen, intercepting a screen, controlling a display change (e.g., zooming out, dithering, distorting, etc.) and the like.

In some embodiments, the system runtime layer provides support for the upper layer, i.e., the framework layer, and when the framework layer is used, the android operating system runs the C/C + + library included in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the core layer includes at least one of the following drivers: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (such as fingerprint sensor, temperature sensor, touch sensor, pressure sensor, etc.), and so on.

In some embodiments, the kernel layer further comprises a power driver module for power management.

In some embodiments, software programs and/or modules corresponding to the software architecture of fig. 4 are stored in the first memory or the second memory shown in fig. 2 or 3.

In some embodiments, taking the magic mirror application (photographing application) as an example, when the remote control receiving device receives a remote control input operation, a corresponding hardware interrupt is sent to the kernel layer. The kernel layer processes the input operation into an original input event (including information such as a value of the input operation, a timestamp of the input operation, etc.). The raw input events are stored at the kernel layer. The application program framework layer obtains an original input event from the kernel layer, identifies a control corresponding to the input event according to the current position of the focus and uses the input operation as a confirmation operation, the control corresponding to the confirmation operation is a control of a magic mirror application icon, the magic mirror application calls an interface of the application framework layer to start the magic mirror application, and then the kernel layer is called to start a camera driver, so that a static image or a video is captured through the camera.

In some embodiments, for a display device with a touch function, taking a split screen operation as an example, the display device receives an input operation (such as a split screen operation) that a user acts on a display screen, and the kernel layer may generate a corresponding input event according to the input operation and report the event to the application framework layer. The window mode (such as multi-window mode) corresponding to the input operation, the position and size of the window and the like are set by an activity manager of the application framework layer. And the window management of the application program framework layer draws a window according to the setting of the activity manager, then sends the drawn window data to the display driver of the kernel layer, and the display driver displays the corresponding application interface in different display areas of the display screen.

In some embodiments, fig. 5 is a schematic diagram of applications that can be provided by the display device provided in the present application, and as shown in fig. 5, the application layer includes at least one application program that can display a corresponding icon control in the display, such as: the system comprises a live television application icon control, a video-on-demand application icon control, a media center application icon control, an application center icon control, a game application icon control and the like.

In some embodiments, the live television application may provide live television via different signal sources. For example, a live television application may provide television signals using input from cable television, radio broadcasts, satellite services, or other types of live television services. And, the live television application may display video of the live television signal on display device 200.

In some embodiments, a video-on-demand application may provide video from different storage sources. Unlike live television applications, video on demand provides a video display from some storage source. For example, the video on demand may come from a server side of the cloud storage, from a local hard disk storage containing stored video programs.

In some embodiments, the media center application may provide various applications for multimedia content playback. For example, a media center, which may be other than live television or video on demand, may provide services that a user may access to various images or audio through a media center application.

In some embodiments, an application center may provide storage for various applications. The application may be a game, an application, or some other application associated with a computer system or other device that may be run on the smart television. The application center may obtain these applications from different sources, store them in local storage, and then be operable on the display device 200.

More specifically, in some embodiments, any one of the display devices 200 described above may have a voice interaction function, so as to improve the intelligence degree of the display device 200 and improve the user experience of the display device 200.

In some embodiments, fig. 6 is an application diagram of a display device in a voice interaction scenario, where a user 1 may speak an instruction that the display device 200 desires to execute by voice, and then voice data may be collected in real time for the display device 200, and the instruction of the user 1 included in the voice data is recognized, and after the instruction of the user 1 is recognized, the instruction is directly executed, and in the whole process, the user 1 does not actually operate the display device 200 or other devices, but simply speaks the instruction.

In some embodiments, when the display device 200 shown in fig. 2 is applied in the scenario shown in fig. 6, the display device 200 may collect voice data in real time through its sound collector 231, and then the sound collector 231 transmits the collected voice data to the controller 250, and finally the controller 250 recognizes instructions included in the voice data.

In some embodiments, fig. 7 is a flowchart illustrating an application of the display device in a voice interaction scenario, which may be executed by the device in the scenario illustrated in fig. 6, specifically, in S11, the sound collector 231 in the display device 200 collects voice data in the surrounding environment of the display device 200 in real time, and sends the collected voice data to the controller 250 for recognition.

In some embodiments, the controller 250 recognizes an instruction included in the voice data after receiving the voice data at S12 shown in fig. 7. For example, if the voice data includes an instruction of "increase brightness" given by the user 1, the controller 250 may execute the recognized instruction by the controller 250 and control the display 275 to increase the brightness after recognizing the instruction included in the voice data. It is to be understood that in this case, the controller 250 recognizes each received voice data, and there may be a case where there is no instruction in recognizing the voice data.

In other embodiments, the model identified based on the command is large and the calculation efficiency is low, and it may be further specified that the user 1 adds a keyword, such as "ABCD", before speaking the command, and then the user needs to speak the command "ABCD, increase brightness", so that in S12 shown in fig. 7, after receiving the voice data, the controller 250 first identifies whether there is a keyword of "ABCD" in each voice data, and after identifying the relevant keyword, uses the command identification model to identify the specific command corresponding to "increase brightness" in the voice data.

In some embodiments, controller 250, upon receiving the voice data, may also denoise the voice data, including removing echo and ambient noise, process the voice data as clean voice data, and recognize the processed voice data.

In some embodiments, fig. 7 is a schematic diagram of another application of the display device in a voice interaction scenario, in which the display device 200 may be connected to the server 400 through the internet, and after the display device 200 collects voice data, the voice data may be sent to the server 400 through the internet, the server 400 recognizes an instruction included in the voice data, and sends the recognized instruction back to the display device 200, so that the display device 200 may directly execute the received instruction. This scenario reduces the requirements on the computing power of the display device 200 compared to the scenario shown in fig. 6, and enables a larger recognition model to be set on the server 400 to further improve the accuracy of instruction recognition in the speech data.

In some embodiments, when the display device 200 shown in fig. 2 is applied in the scenario shown in fig. 6, the display device 200 may collect voice data in real time through the sound collector 231 thereof, then the sound collector 231 transmits the collected voice data to the controller 250, the controller 250 transmits the voice data to the server 400 through the communicator 220, and after the server 400 recognizes an instruction included in the voice data, the display device 200 receives the instruction transmitted by the server 400 through the communicator 220, and finally the controller 250 executes the received instruction.

In some embodiments, fig. 9 is another flowchart illustrating the application of the display device in the voice interaction scenario, which may be executed by the device in the scenario illustrated in fig. 8, in S21, the sound collector 231 in the display device 200 collects voice data in the surrounding environment of the display device 200 in real time and sends the collected voice data to the controller 250, the controller 250 sends the voice data to the server 400 through the communicator 220 in S22, the server identifies an instruction included in the voice data in S23, then the server 400 sends the identified instruction back to the display device 200 in S24, correspondingly, the display device 200 sends the instruction to the controller 250 after receiving the instruction through the communicator 220, and finally the controller 250 may directly execute the received instruction.

In some embodiments, as in S23 shown in fig. 7, the server 400, after receiving the voice data, identifies an instruction included in the voice data. For example, the voice data includes an instruction of "increase brightness" given by the user 1. Since the model of command recognition is large, and the server 400 recognizes each received voice data, there may be a case where there is no instruction in recognizing voice data, and therefore in order to reduce the recognition of invalidity by the server 400 and the amount of communication interaction data between the display device 200 and the server 400, in a specific implementation, it may also be provided that the user 1 adds a keyword, for example "ABCD", the user needs to say an instruction of "ABCD, increase brightness", and then, the model is recognized by the controller 250 of the display apparatus 200 in S22 by first recognizing the model through the keyword having a small model and a low computation amount, identifying whether the keyword "ABCD" exists in the voice data, if the keyword is not identified in the voice data currently being processed by the controller 250, the controller 250 does not send the voice data to the server 400; if the keyword is recognized in the voice data currently being processed by the controller 250, the controller 250 sends all the voice data or a part behind the keyword in the voice data to the server 400, and the server 400 recognizes the received voice data. Since the voice data received by the controller 250 at this time includes a keyword, it is more likely that the voice data recognized by the server 400 also includes an instruction of the user, so that invalid recognition calculation of the server 400 can be reduced, and invalid communication between the display device 200 and the server 400 can also be reduced.

In some embodiments, in order to enable the display device 200 to have a function of recognizing instructions in the voice data in a specific scenario as shown in fig. 6, or to enable the display device 200 to have a function of recognizing keywords in the voice data in a specific scenario as shown in fig. 6 or fig. 8, as a provider of the voice interaction function of the display device 200, it is also necessary to make a machine learning model, such as a deep learning model like textcnn, transform, etc., that can be used for recognizing instructions or recognizing keywords. And stores these models in the display device 200 for use by the display device 200 in recognition.

In some embodiments, fig. 10 is a schematic diagram of issuing an identification model by a provider of the identification model, where after obtaining the identification model (which may be an instruction identification model or a keyword identification model), a server 400 provided by the provider may send the identification model to each display device 200. Among them, the process shown in fig. 10 may be that the display apparatuses 200 are produced, and the server 400 transmits the server to each display apparatus 200; alternatively, the server 400 may transmit the recognition model to the display apparatus 200 through the internet after the display apparatus 200 starts to be used.

In some embodiments, the server 400 may obtain the recognition model by collecting voice data and learning based on a machine learning model. For example, fig. 11 is a schematic flow chart illustrating a process in which the server 400 obtains the recognition model, wherein in S31, each display device (taking display device 1-display device N, for example, N) collects voice data 1-N, and in S32, sends the collected voice data 1-N to the server 400. Subsequently, in S33, the provider staff may label each voice data and the instruction or keyword included in the voice data by a manual labeling method, send the voice data itself and the labeling information corresponding to the voice data as data to the machine learning model, and learn by the server, where the learned recognition model is used subsequently, and when a voice data to be recognized is input, the recognition model compares the voice data with the learned voice data and outputs the probability of each labeling information, and finally, the labeling information corresponding to the maximum probability may be used as the recognition result of the voice data to be recognized. In S34, the server 400 may transmit the calculated recognition model to each display device.

In some embodiments, instead of calculating the recognition model using the voice data actually collected by the display devices 1-N as in the embodiment shown in fig. 11, the server 400 may directly input different voice data and the label information of each voice data by the staff member, and send the calculated recognition model to each display device.

In some embodiments, the voice data collected and sent to the display devices 1-N of the server as shown in fig. 11 and the recognition models calculated by the server are sent to the display devices 1-N as two separate processes, that is, the server receives the voice data collected by N display devices in S32, and the server sends the trained recognition models to the other N display devices in S34. The N display devices in the two processes may be the same or different, or may be partially the same.

In some embodiments, since the number of samples used is limited when obtaining the recognition model, so that the recognition model set by the display device 200 cannot be recognized completely with one hundred percent accuracy, the provider may further collect, by the server 400, the voice data collected during the actual use of each display device 200 at any time, and update the recognized recognition model according to the collected voice data, so as to further improve the recognition accuracy of the recognition model.

For example, fig. 12 is a schematic flow chart of the server updating the recognition model, and it can be understood that, before the embodiment shown in fig. 12 is executed, the recognition model is set in each display device in the manner shown in fig. 10. Then, as shown in S31 of fig. 12, each display device (taking display device 1-display device N, for example, N) collects voice data 1-N, and transmits the collected voice data 1-N to the server 400 in S32. Subsequently, in S33, the staff of the provider may label each voice data and the instruction or keyword included in the voice data by manual labeling, send the voice data itself and the labeling information corresponding to the voice data as data to the machine learning model, update the calculated recognition model by the server according to the received new voice data, and in S34, the server 400 may resend the updated recognition model to each display device 200, so that each display device 200 may be updated using the updated recognition model. For any one of the N display devices, since the new learning model uses the voice data collected by the display device 200, the accuracy of subsequently recognizing the collected voice data by the display device 200 can be effectively improved.

In some embodiments, each display device shown in fig. 12 may send the received voice data to the server, or send the voice data collected in a fixed time period to the server after the time period is over, or send the collected voice data to the server in a unified manner after a certain amount of voice data is collected, or send the received voice to the server according to an instruction of a user of the display device or an instruction of a staff member of the server.

In some embodiments, the N display devices shown in fig. 12 may simultaneously send the voice data to the server at the same appointed time, and the server updates the recognition model according to the received N voice data; or, the N display devices may also send the voice data to the server, and the server may start to update the recognition model according to the received voice data after the number of the received voice data is greater than N.

When a voice interaction system is built, four layers of processing capacity are needed besides a voice recognition module and a text-to-voice module. Firstly, a basic feature processing layer mainly comprises word segmentation, semantic label labeling, emotion recognition and the like; the second layer is an intention understanding layer and is used for converting the user question into a machine-understandable structured user intention representation, and the user intention at the moment can be in various possibilities; the third layer is a dialogue management layer, which is used for further defining the user intention based on the context information and updating the dialogue state, so that a decision is made according to the dialogue strategy, and the process calls a data service (such as obtaining media information) and a reply scheme provided by the fourth business service layer. For example, for a search for the same title (title), if the search result of music is obviously better than that of a movie, the voice interaction system feeds back the search result of music to the user after obtaining the feedback information of the service layer.

In some embodiments, after receiving the voice input by the user, the display device generally performs voice recognition on the voice input by the user, determines a user intention through a semantic understanding engine, and then provides a relevant service for the user according to the user intention. The current semantic understanding engine carries out multi-round interaction and omission completion under a specific scene through multi-round modeling of a limited domain and reference resolution modeling, and meanwhile, a central control decision module based on a machine learning model is utilized to comprehensively consider results of different semantic processing modules to carry out sequencing decision so as to complete positioning of user intentions. However, in more conventional fields, factors such as conversation history and scenes where users are located are not considered, and accurate intention understanding and positioning decisions cannot be made under the condition that cross business decisions or user speaking intentions are fuzzy.

In order to solve the above technical problems, an embodiment of the present application provides a display device, where when two or more candidate user intentions corresponding to a user input sentence are provided, the display device actively feeds back an inquiry sentence to a user through an anthropomorphic interaction manner, and then determines a real intention of the user according to a response sentence input by the user, so as to effectively improve accuracy of understanding of the user intention in a voice interaction process. The following examples are given for illustrative purposes.

Based on the display device 200 described in the above embodiment, in a possible implementation, the voice collecting device of the display device 200 may be a microphone array corresponding to the display device 200, or the voice collecting device may also be a microphone on a control device corresponding to the display device 200.

In this embodiment, after the voice data of the user is collected by the voice collecting device, the collected voice data is sent to the audio processor in the display device, and the audio processor preprocesses the voice data and then sends the preprocessed voice data to the controller in the display device.

The controller determines a candidate user intention corresponding to first voice data based on an intention recognition model after receiving the first voice data input by a user.

In some embodiments, the intention recognition model may classify sentences or queries into corresponding intention categories by classification. Wherein different user intentions may correspond to different domain dictionaries such as book names, song names, trade names, etc. The determination may be made based on the degree of matching or the degree of coincidence between the query and the dictionary, and the query is determined to which domain if the degree of coincidence between the query and the dictionary is high.

For example, when the query inputted by the user is "sunset song", the intention of the query is a music intention, and when the query inputted by the user is "news simulcast", the intention of the query is a news search intention.

In some embodiments, the query input by the user may correspond to a plurality of user intentions, for example, when the user queries "biochemical crisis", since "biochemical crisis" not only has related games but also has related movies, in order to more accurately locate the user intentions, in a possible embodiment of the present application, the query input by the user corresponds to a plurality of user intentions, and a query sentence is generated according to a plurality of candidate user intentions that have been determined, for example, "do you want to watch a movie or play a game? And then sending the inquiry sentence to a display screen for displaying and/or sending the inquiry sentence to a loudspeaker for playing, and continuing to collect voice data of the user.

Illustratively, when the voice data of the user is collected as 'watching a movie', a movie related to 'biochemical crisis' is searched and pushed to a display screen for displaying or playing; and when the voice data of the user is collected subsequently and is game playing, searching the game related to the biochemical crisis, pushing the game to a display screen for displaying, or directly starting the game.

The display device provided by the embodiment of the application actively performs voice interaction with the user through a humanized interaction mode when cross service decision is faced or the user intention is fuzzy, and then determines the real intention of the user according to the conversation with the user, so that the search click frequency of the user can be saved, the search time is shortened, and the accuracy of understanding the user intention in the voice interaction process is greatly improved.

Based on the description in the foregoing embodiments, in some embodiments of the present application, after receiving the first voice data, the controller of the display device may further obtain a historical user intention determined in the voice interaction process, and then determine a candidate user intention corresponding to the first voice data by using an intention recognition model according to the first voice data and the historical user intention. For example, for the same query sentence (query), if the user is searching news-like assets for a certain period of time, the query should be understood as a news search service rather than an encyclopedia service.

The intention recognition model can process recognition of user general reply expressions aiming at the central control action, such as recognition of user general reply expressions like 'first', 'movie', 'i do not want to see the' and the like. The whole model processing steps are as follows:

firstly, query normalization processing: and traversing the candidate user intention set, and replacing the intention names appearing in the query. The intention name here supports a certain generalization, for example, the generalized expression of music search is "listen to song", "song", etc., and this part of data adds some more compact and free expressions on the basis of the generalization of intention in language generation.

II, intention analysis: and traversing a predefined general intention classification rule set of a database (each rule comprises rule content, a rule matching mode, an intention decision result and priority, the rule matching is ordered according to the priority), if the rule is matched, adding the intention decision result of the rule into a final analysis result, replacing the matched part of the query with a placeholder, and then traversing subsequent rules until all the rules are traversed.

Thirdly, checking the result: and judging whether the intention recognition result can be output or not by combining the analysis integrity. If the output cannot be performed, the result object is set to null.

For a better understanding of the present application, the present embodiment assumes that the query entered by the user is "spring flower bloom", where there are both related dramas and related songs. When the historical user intention determined in the acquired voice interaction process is music, the user can be considered to be listening to music at present, and therefore the candidate user intention corresponding to the query is determined to be music; when the historical user intention determined in the acquired voice interaction process is 'drama', the user can be considered to be watching the drama at present, and therefore the candidate user intention corresponding to the query is determined to be 'drama'.

In some possible implementations, the historical user intent may be a user intent determined during a previous round of voice interaction.

The display device provided by the embodiment of the application locates the user intention of the currently received voice data by combining the historical user intention determined in the voice interaction process, and is helpful for quickly locating the real intention of the user.

In some embodiments of the present application, after receiving the first voice data and acquiring a historical user intention in a voice interaction process, the controller of the display device determines whether the first voice data and the historical user intention belong to the same dialog sequence based on a dialog state tracking model.

The dialog state tracking model comprises a session segmentation module and a dialog state updating module. The session segmentation module is used for judging whether a new session should be opened or not in response to the current user, and can perform segmentation in a mode based on time intervals and text semantics. The time interval-based segmentation is based on the assumption that the intention of a user does not change in a short time, and different segmentation standards are formulated according to different previous rounds of business through analysis of user logs. The segmentation based on text semantics focuses on processing the situations of speech recognition error, insufficient semantic analysis capability, strong correlation among services and the like, and the segmentation is carried out by calculating text similarity and a correlation matrix among services.

In some embodiments, when a duration of an interval between a time of receiving the current first voice data and a time of determining the historical user intention is greater than a preset duration threshold, it may be determined that the currently received first voice data and the historical user intention data do not belong to the same dialog sequence, and otherwise, it belongs to the same dialog sequence.

For example, if the user selects to play a song in a certain time period by means of voice interaction, and wakes up the voice assistant again after a certain time interval and performs voice input, it can be understood that when the time interval is short, the probability that the user continues to select to listen to a song is high, and it is likely that another song is to be selected to be played, so it can be considered that the current input voice of the user and the user intention determined in the last voice interaction belong to the same dialog sequence. On the contrary, when the interval duration is longer, the probability that the user continues to select to listen to the song is smaller, which may be that the user wants to change an entertainment mode, for example, to watch television instead, so that it can be considered that the user intention determined when the voice currently input by the user interacts with the last voice does not belong to the same dialog sequence.

For example, suppose the user wakes up the voice assistant for a certain period of time and says "play song of sunset", at which time the display device recognizes the user's intention as music, plays song "song of sunset". After a period of time, the user wakes up the voice assistant again and says that the spring warmer blooms, and when the interval duration is less than the preset duration threshold, the user can think that the song played on sunset and the spring warmer blooms belong to the same conversation sequence, and the intention of the user is music; when the interval duration is greater than or equal to the preset duration threshold, it can be considered that "playing sunset song" and "spring warm bloom" do not belong to the same dialog sequence, and the user's intention may be music or tv drama.

In some embodiments, when the similarity between the text semantics after the current first speech data parsing and the text semantics of the historical user intention is smaller than a preset similarity threshold, it may be determined that the currently received first speech data and the historical user intention data do not belong to the same dialog sequence, and on the contrary, they belong to the same dialog sequence.

For example, suppose a user selects a song to be played by voice interaction for a certain period of time, and the user wakes up the voice assistant again after a certain period of time and performs voice input. It can be understood that, at this time, if the similarity between the text semantics after the input voice data is analyzed and the text semantics of the user intention determined in the previous voice interaction is relatively large, it can be considered that the current intention of the user is not changed, and the voice currently input by the user and the user intention determined in the previous voice interaction belong to the same dialog sequence. On the contrary, if the similarity between the text semantics after the voice data analysis input by the user and the text semantics of the user intention determined in the previous voice interaction is smaller, the current intention of the user can be considered to have changed, and the voice currently input by the user and the user intention determined in the previous voice interaction do not belong to the same dialog sequence.

For example, suppose the user wakes up the voice assistant for a certain period of time and says "play song of sunset", at which time the display device recognizes the user's intention as music, plays song "song of sunset". After a period of time, the user wakes up the voice assistant again and speaks 'play game', and because the similarity between the currently input 'play game' and the user intention 'music' determined in the last voice interaction is small, the 'play of the sunset song' and the 'play game' are not considered to belong to the same conversation sequence.

The conversation state updating module is used for creating or updating the conversation state of the current user according to the segmentation result of the conversation segmentation module.

In this embodiment, when it is determined that the first speech data and the historical user intention belong to the same dialog sequence, the candidate user intention of the first speech data may be determined by combining the first speech data and the historical user intention; when the first voice data and the historical user intention are determined not to belong to the same dialogue sequence, a new dialogue sequence is established, and the candidate user intention of the first voice data is determined based on the first voice data.

Based on the description in the foregoing embodiments, in some embodiments of the present application, scores may be given to each output module in the dialogue strategy learning model based on each candidate user intention corresponding to the first voice data, and a discretized dialogue action may be output according to the score of each output module to perform a dialogue interaction with the user.

The dialogue strategy learning model comprises a rewriting module, a reference resolution module, a vertical field intention analysis module, a task multi-round response module, a question and answer module, a news search module, a chat module, a recommendation module and a candidate intention analysis module.

In one possible implementation, the following strategy may be employed to select the output module:

step 1, sorting the output modules according to the scores of the output modules, and executing step 4 if the score of the first sorted output module is greater than or equal to a given threshold value 1; otherwise, step 2 is executed.

Step 2, determining whether the score of the first output module in the sequence is smaller than a given threshold value 2, if so, outputting default action; otherwise, step 3 is executed. Wherein the given threshold 2 is smaller than the given threshold 1.

Step 3, if the score of the output module in the second order is greater than or equal to the given threshold 2 and the difference between the scores of the two output modules in the first order is less than the given threshold 3, outputting a select action, wherein the select action is used for providing a selection option, and the selection option comprises the output module in the first order and the output module in the second order; otherwise, a confirm action is output, including the first ordered output module.

Step 4, determining whether the first-ranked module is a non-candidate intention analysis module, if so, directly outputting an in action, wherein the in action comprises the first-ranked output module; otherwise, step 5 is executed.

Step 5, sorting the candidate user intentions according to the scores of the candidate user intentions, and if the score of the candidate user intention with the highest score in the candidate user intentions is greater than or equal to a given threshold 4 or the candidate user intention is unique and the score is greater than or equal to a given threshold 5, directly outputting an inline action which comprises the candidate user intention with the highest score and represents that the candidate user intention with the highest score is the target user intention; otherwise, step 5.1 is performed.

Step 5.1, determining whether the score of the candidate user intention with the highest score is smaller than a given threshold 6, if so, directly outputting a default action; if not, go to step 5.2.

And 5.2, determining whether the score of the candidate user intention with the highest score is greater than or equal to a given threshold 6, if the score of the candidate user intention with the highest score is greater than or equal to the given threshold 6 and the difference between the score of the candidate user intention with the highest score and the score of the candidate user intention with the highest score is less than a given interval threshold, outputting a select action, and otherwise, outputting a confirm action. Wherein the select action may provide a selection option, the selection option including the candidate user intent with the highest score and the candidate user intent with the second highest score; the confirm action includes the highest scoring candidate user intent.

According to the display device provided by the embodiment of the application, the general intention understanding model, the conversation state tracking model and the conversation strategy learning model are added, the corresponding system reply is generated by combining the system conversation action, the voice interaction system is supported to carry out active interaction with the user through rich conversation action types, the real intention positioning of the user is completed, and the use experience of the user can be effectively improved.

Based on the description in the foregoing embodiment, a voice interaction method is further provided in this embodiment, referring to fig. 13, where fig. 13 is a first flowchart of the voice interaction method provided in this embodiment, and in a possible implementation, the voice interaction method includes:

s1301, receiving first voice data input by a user, and determining candidate user intentions corresponding to the first voice data.

S1302, when the first voice data corresponds to the candidate user intentions, generating an inquiry statement according to the candidate user intentions, and feeding back the inquiry statement to the user.

The query statement is used for prompting the user to select one user intention from the candidate user intentions.

And S1303, receiving second voice data input by the user, and determining a target user intention corresponding to the first voice data in the candidate user intentions according to the second voice data.

And S1304, outputting the associated information associated with the target user intention.

According to the voice interaction method, when the candidate user intentions corresponding to the user input sentences are two or more than two, the voice interaction system can actively feed back the query sentences to the user through an anthropomorphic interaction mode, then the real intentions of the user are determined according to the response sentences input by the user, and the accuracy of understanding the user intentions in the voice interaction process can be effectively improved.

For better understanding of the embodiment of the present application, referring to fig. 14, fig. 14 is a schematic flowchart illustration of a voice interaction method provided in the embodiment of the present application.

In fig. 14, assuming that "toy train" is included in the voice data input by the user and that there are both related songs and related movies, the display device generates an inquiry sentence "you want to watch or listen to music" after receiving the voice data, and if "listen to music" is included in the voice data input by the user again, inquires about songs related to "toy train" and displays "inquiring about songs for you" on the display interface of the display device.

For better understanding of the embodiment of the present application, reference is made to fig. 15a to 15d, and fig. 15a to 15d are schematic diagrams of voice interaction of a display device in the embodiment of the present invention.

When a user needs to perform voice interaction with the display device 200, a voice wake-up instruction may be sent to the display device 200 in a preset wake-up manner. For example, in one possible embodiment, the user may speak a previously set wake-up keyword, such as "hi, a little, etc., through the voice collecting means of the display device 200; at this time, the voice collecting device of the display device 200 may send the collected voice information to the controller, and the controller identifies the received voice information, and if the identification result includes the above wake-up keyword, controls the display device to enter a voice interaction state.

In some embodiments, after the display device enters the voice interaction state, a reminder message may be displayed on the display screen for reminding the user that the display device 200 has entered the voice interaction state, and as shown in fig. 15a, "what help you need? ".

When the display equipment enters a voice interaction state, the voice acquisition device starts to acquire voice information input by a user and sends the acquired voice information to the controller. For example, in one possible implementation, a user may speak a "toy train" through a microphone of display device 200; at this time, the microphone of the display device 200 may transmit the collected voice information to the controller. And after receiving the voice information, the controller performs voice recognition on the voice information and determines a candidate user intention corresponding to the voice information.

When the candidate user intention corresponding to the voice information includes both a movie and music, the query sentence is generated by the language generation module and displayed on the display screen, as shown in fig. 15b, it is possible to display "do you want to watch a movie or listen to music? ".

And in the process of displaying the inquiry sentence by the display equipment, the voice acquisition device continuously acquires the voice information input by the user. When the voice information input by the user is 'listening to music', searching songs related to the 'toy train' through the server; when the voice information input by the user is 'watch movie', the movie related to the 'toy train' is searched by the server. As shown in fig. 15c, after determining that the voice information input by the user is "listen to music", it may be displayed on the display interface: song … … related to "toy train" being searched for you

After the display device finishes the search task, the search result can be displayed on the display screen. As shown in fig. 15d, on the display screen are displayed: the following songs were searched for you: "toy train MP 3".

Referring to fig. 16a to 16d, fig. 16a to 16d are schematic diagrams illustrating another voice interaction of the display device according to the embodiment of the present invention.

When the display equipment enters a voice interaction state, the voice acquisition device starts to acquire voice information input by a user and sends the acquired voice information to the controller. For example, in one possible implementation, a user may speak "transformers" through a microphone of display device 200; at this time, the microphone of the display device 200 may transmit the collected voice information to the controller. And after receiving the voice information, the controller performs voice recognition on the voice information and determines a candidate user intention corresponding to the voice information. Meanwhile, the voice recognition result is displayed on the display interface, and as shown in fig. 16a, "transformers" may be displayed on the display screen.

When the candidate user intentions corresponding to the voice information include both a movie and a commodity, the language generation module is used to generate an inquiry sentence, and the inquiry sentence is displayed on the display screen, as shown in fig. 16b, a message "do you want to see a movie or do shopping? ".

And in the process of displaying the inquiry sentence by the display equipment, the voice acquisition device continuously acquires the voice information input by the user. When the voice information input by the user is shopping, searching the commodity related to the transformers through the server; when the voice information input by the user is 'watch a movie', the server searches for a movie related to 'transformers'. As shown in fig. 16c, after determining that the voice message input by the user is "shopping", it may be displayed on the display interface: under-search for "transformers" related commodity … … for you

After the display device finishes the search task, the search result can be displayed on the display screen. As shown in fig. 16d, the item links of the searched respective items are displayed on the display screen.

It is understood that the voice interaction method described in the above embodiments may be performed by a server. For example, when the display device detects an input operation of a user, voice data input by the user is acquired, then the voice data input by the user is sent to the server, after voice recognition is performed on the voice data input by the user by the server, a target user intention corresponding to the voice data is determined, and response information associated with the target user intention is fed back to the display device.

In some embodiments, the server may perform data interaction with the display device through a network, or the server may be integrated in the display device and perform data interaction with the display device through a communication bus in the display device.

In addition, the voice interaction method described in the above embodiment may also be executed by the display device, for example, when the display device detects an input operation of a user, the voice data input by the user is acquired, then after voice recognition is performed on the voice data input by the user, a target user intention corresponding to the voice data is determined, and response information associated with the target user intention is sent to a display office for display.

It is understood that the voice interaction method described in the foregoing embodiments may be applied not only to the foregoing display device, but also to other electronic devices with a voice interaction function, such as a smart audio, a smart home, a wearable device, a child toy, a learning machine, and the like, and the embodiments of the present invention are not limited thereto.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A display device, characterized in that the display device comprises:

the voice acquisition device is used for acquiring voice data;

the audio processor is used for processing the collected voice data;

a display screen for displaying an image;

a controller configured to:

outputting association information associated with the target user intent.

2. The display device according to claim 1, wherein the controller is configured to:

when the first voice data corresponds to a single candidate user intention, outputting associated information associated with the candidate user intention.

3. The display device according to claim 1, wherein the controller is configured to:

acquiring historical user intentions determined in a voice interaction process;

4. The display device according to claim 3, wherein the controller is configured to:

5. The display device according to claim 1, wherein the controller is configured to:

6. The display device according to claim 5, wherein the controller is configured to:

7. A method of voice interaction, the method comprising:

when the first voice data corresponds to a plurality of candidate user intentions, generating a query statement according to the candidate user intentions, and feeding back the query statement to the user, wherein the query statement is used for prompting the user to select one user intention from the candidate user intentions;

outputting association information associated with the target user intent.

8. The method of claim 7, further comprising:

9. The method of claim 7, wherein the determining the candidate user intent corresponding to the first speech data based on the intent recognition model comprises:

acquiring historical user intentions determined in a voice interaction process;

10. The method of voice interaction according to claim 9, wherein determining, using the intent recognition model, the candidate user intent corresponding to the first speech data based on the first speech data and the historical user intent comprises:

11. The method of claim 7, wherein generating a query statement according to the plurality of candidate user intents comprises:

and when the output module with the highest score in the dialogue strategy learning model is the candidate intention analysis module, generating the query sentence according to the score of each candidate user intention.

12. The method of claim 11, wherein generating the query statement based on the score of each of the candidate user intentions comprises: