CN112002321B - Display device, server and voice interaction method - Google Patents

Display device, server and voice interaction method Download PDF

Info

Publication number
CN112002321B
CN112002321B CN202010803789.XA CN202010803789A CN112002321B CN 112002321 B CN112002321 B CN 112002321B CN 202010803789 A CN202010803789 A CN 202010803789A CN 112002321 B CN112002321 B CN 112002321B
Authority
CN
China
Prior art keywords
decision result
user
action type
service modules
alternative service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010803789.XA
Other languages
Chinese (zh)
Other versions
CN112002321A (en
Inventor
朱飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Electronic Technology Wuhan Co ltd
Original Assignee
Hisense Electronic Technology Wuhan Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Electronic Technology Wuhan Co ltd filed Critical Hisense Electronic Technology Wuhan Co ltd
Priority to CN202010803789.XA priority Critical patent/CN112002321B/en
Publication of CN112002321A publication Critical patent/CN112002321A/en
Application granted granted Critical
Publication of CN112002321B publication Critical patent/CN112002321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/433Query formulation using audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The embodiment of the application provides a display device, a server and a voice interaction method. The display device includes a display, an audio acquisition device, and a controller configured to: responding to a first voice control signal input by a user, starting a round of decision, and sending the first voice control signal to a server; receiving and outputting a first decision result from a server, wherein the first decision result is provided with an action type; according to the action type of the first decision result as selection, receiving a second voice control signal input by a user, and sending the second voice control signal to a server; and receiving and outputting a second decision result from the server, and ending the round of decision. According to the application, when the intention of the user is difficult to accurately judge, the user is interacted with, so that the user inputs the second voice control signal, the intention of the user is comprehensively analyzed according to the second voice control signal and the first voice control signal, the accuracy of a decision result is improved, and the user experience is facilitated to be improved.

Description

Display device, server and voice interaction method
Technical Field
The application relates to the technical field of man-machine interaction, in particular to a display device, a server and a voice interaction method.
Background
Today, more and more intelligent devices such as smart televisions and smart speakers, intelligent interaction with users may be accomplished through voice assistant applications. After the user sends out the voice control signal to the intelligent device provided with the voice assistant application, the intelligent device can analyze the intention of the user through the decision engine and output a decision result corresponding to the voice control signal.
In the related art, a decision engine may analyze a user's intention through a rule algorithm. A rule algorithm is a decision-making way to decide the output order of a service by setting the priority of certain services. For example, at the television end, video services such as television and film are main services, a rule algorithm can output decision results of video and audio preferentially in a mode of manually setting a threshold value, and at the sound box end, music is main services, and the rule algorithm can output decision results of audio such as music preferentially.
However, as integrated services are more and more diversified and intelligent voice assistant applications are improved, the rule algorithm is more and more difficult to adapt to requirements, for example, as service modules are more and more, threshold setting in the rule algorithm is more and more complex, accurate setting is difficult to be completed manually, so that the decision accuracy of a decision engine is reduced, the decision result of an intelligent device cannot reflect the actual intention of a user, and user experience is further affected.
Disclosure of Invention
In order to solve the technical problems, the application provides display equipment, a server and a voice interaction method.
In a first aspect, the present application provides a display device comprising:
a display;
an audio collection device configured to collect user input audio;
a controller coupled to the display and the audio acquisition device, the controller configured to:
responding to a first voice control signal input by a user, starting a round of decision, and sending the first voice control signal to a server;
receiving and outputting a first decision result from the server, wherein the first decision result is provided with an action type;
according to the action type of the first decision result as selection, receiving a second voice control signal input by a user, and sending the second voice control signal to a server;
and receiving and outputting a second decision result from the server, and ending the round of decision.
In a second aspect, the present application provides a server configured to:
analyzing a first voice control signal from display equipment to obtain a first feature vector;
calculating a first sorting score of each service module according to the first feature vector, comparing a plurality of the first sorting scores to obtain a first decision result, and sending the first decision result to a display device, wherein the first decision result comprises an action type;
Analyzing a second voice control signal and a first voice control signal from the display equipment according to the action type of the first decision result as selection to obtain a second feature vector;
and calculating a second sorting score of each service module according to the second feature vector, comparing a plurality of second sorting scores to obtain a second decision result, and sending the second decision result to a display device.
In a third aspect, an embodiment of the present application provides a voice interaction method, for a display device, where the method includes:
responding to a first voice control signal input by a user, starting a round of decision, and sending the first voice control signal to a server;
receiving and outputting a first decision result from the server, wherein the first decision result is provided with an action type;
according to the action type of the first decision result as selection, receiving a second voice control signal input by a user, and sending the second voice control signal to a server;
and receiving and outputting a second decision result from the server, and ending the round of decision.
In a fourth aspect, an embodiment of the present application provides a voice interaction method, which is used for a server, and the method includes:
Analyzing a first voice control signal from display equipment to obtain a first feature vector;
calculating a first sorting score of each service module according to the first feature vector, comparing a plurality of the first sorting scores to obtain a first decision result, and sending the first decision result to a display device, wherein the first decision result comprises an action type;
according to the action type as selection, analyzing a second voice control signal and a first voice control signal from the display equipment to obtain a second feature vector;
and calculating a second sorting score of each service module according to the second feature vector, comparing a plurality of second sorting scores to obtain a second decision result, and sending the second decision result to a display device.
The display device, the server and the voice interaction method provided by the application have the beneficial effects that:
after the first voice control signal input by the user is received, if the intention of the user is difficult to accurately judge, the embodiment of the application interacts with the user, so that the user inputs the second voice control signal, the intention of the user is comprehensively analyzed according to the second voice control signal and the first voice control signal, the accuracy of a decision result is improved, and the user experience is improved.
Drawings
In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
A schematic diagram of an operational scenario between a display device and a control apparatus according to some embodiments is schematically shown in fig. 1;
a hardware configuration block diagram of a display device 200 according to some embodiments is exemplarily shown in fig. 2;
a hardware configuration block diagram of the control apparatus 100 according to some embodiments is exemplarily shown in fig. 3;
a schematic diagram of the software configuration in a display device 200 according to some embodiments is exemplarily shown in fig. 4;
an icon control interface display schematic of an application in a display device 200 according to some embodiments is illustrated in fig. 5;
a schematic of a decision engine according to some embodiments is shown schematically in fig. 6;
a flow diagram of an offline policy learning method according to some embodiments is schematically shown in fig. 7;
a schematic diagram of policy learning according to some embodiments is schematically shown in fig. 8;
A user request processing diagram according to some embodiments is schematically illustrated in fig. 9;
a flow diagram of a voice interaction method according to some embodiments is schematically shown in fig. 10;
a flow diagram of a voice interaction method according to further embodiments is schematically shown in fig. 11;
a flow diagram of a first feature vector generation method according to some embodiments is illustrated in fig. 12.
Detailed Description
For the purposes of making the objects, embodiments and advantages of the present application more apparent, an exemplary embodiment of the present application will be described more fully hereinafter with reference to the accompanying drawings in which exemplary embodiments of the application are shown, it being understood that the exemplary embodiments described are merely some, but not all, of the examples of the application.
Based on the exemplary embodiments described herein, all other embodiments that may be obtained by one of ordinary skill in the art without making any inventive effort are within the scope of the appended claims. Furthermore, while the present disclosure has been described in terms of an exemplary embodiment or embodiments, it should be understood that each aspect of the disclosure can be practiced separately from the other aspects.
It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated (Unless otherwise indicated). It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to those elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The term "module" as used in this disclosure refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the function associated with that element.
The term "remote control" as used herein refers to a component of an electronic device (such as a display device as disclosed herein) that can be controlled wirelessly, typically over a relatively short distance. Typically, the electronic device is connected to the electronic device using infrared and/or Radio Frequency (RF) signals and/or bluetooth, and may also include functional modules such as WiFi, wireless USB, bluetooth, motion sensors, etc. For example: the hand-held touch remote controller replaces most of the physical built-in hard keys in a general remote control device with a touch screen user interface.
The term "gesture" as used herein refers to a user behavior by which a user expresses an intended idea, action, purpose, and/or result through a change in hand shape or movement of a hand, etc.
A schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment is exemplarily shown in fig. 1. As shown in fig. 1, a user may operate the display apparatus 200 through the mobile terminal 300 and the control device 100.
In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, etc., and the display device 200 is controlled by a wireless or other wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc. Such as: the user can input corresponding control instructions through volume up-down keys, channel control keys, up/down/left/right movement keys, voice input keys, menu keys, on-off keys, etc. on the remote controller to realize the functions of the control display device 200.
In some embodiments, mobile terminals, tablet computers, notebook computers, and other smart devices may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device. The application program, by configuration, can provide various controls to the user in an intuitive User Interface (UI) on a screen associated with the smart device.
In some embodiments, the mobile terminal 300 may install a software application with the display device 200, implement connection communication through a network communication protocol, and achieve the purpose of one-to-one control operation and data communication. Such as: it is possible to implement a control command protocol established between the mobile terminal 300 and the display device 200, synchronize a remote control keyboard to the mobile terminal 300, and implement a function of controlling the display device 200 by controlling a user interface on the mobile terminal 300. The audio/video content displayed on the mobile terminal 300 can also be transmitted to the display device 200, so as to realize the synchronous display function.
As also shown in fig. 1, the display device 200 is also in data communication with the server 400 via a variety of communication means. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. By way of example, display device 200 receives software program updates, or accesses a remotely stored digital media library by sending and receiving information, as well as Electronic Program Guide (EPG) interactions. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers. Other web service content such as video on demand and advertising services are provided through the server 400.
The display device 200 may be a liquid crystal display, an OLED display, a projection display device. The particular display device type, size, resolution, etc. are not limited, and those skilled in the art will appreciate that the display device 200 may be modified in performance and configuration as desired.
The display apparatus 200 may additionally provide a smart network television function of a computer support function, including, but not limited to, a network television, a smart television, an Internet Protocol Television (IPTV), etc., in addition to the broadcast receiving television function.
A hardware configuration block diagram of the display device 200 according to an exemplary embodiment is illustrated in fig. 2.
In some embodiments, at least one of the controller 250, the modem 210, the communicator 220, the detector 230, the input/output interface 255, the display 275, the audio output interface 285, the memory 260, the power supply 290, the user interface 265, and the external device interface 240 is included in the display apparatus 200.
In some embodiments, the display 275 is configured to receive image signals from the first processor output, and to display video content and images and components of the menu manipulation interface.
In some embodiments, display 275 includes a display screen assembly for presenting pictures, and a drive assembly for driving the display of images.
In some embodiments, the video content is displayed from broadcast television content, or alternatively, from various broadcast signals that may be received via a wired or wireless communication protocol. Alternatively, various image contents received from the network server side transmitted from the network communication protocol may be displayed.
In some embodiments, the display 275 is used to present a user-manipulated UI interface generated in the display device 200 and used to control the display device 200.
In some embodiments, depending on the type of display 275, a drive assembly for driving the display is also included.
In some embodiments, display 275 is a projection display and may further include a projection device and a projection screen.
In some embodiments, communicator 220 is a component for communicating with external devices or external servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi chip, a bluetooth communication protocol chip, a wired ethernet communication protocol chip, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver.
In some embodiments, the display apparatus 200 may establish control signal and data signal transmission and reception between the communicator 220 and the external control device 100 or the content providing apparatus.
In some embodiments, the user interface 265 may be used to receive infrared control signals from the control device 100 (e.g., an infrared remote control, etc.).
In some embodiments, the detector 230 is a signal that the display device 200 uses to capture or interact with the external environment.
In some embodiments, the detector 230 includes an optical receiver, a sensor for capturing the intensity of ambient light, a parameter change may be adaptively displayed by capturing ambient light, etc.
In some embodiments, the detector 230 may further include an image collector, such as a camera, a video camera, etc., which may be used to collect external environmental scenes, collect attributes of a user or interact with a user, adaptively change display parameters, and recognize a user gesture to realize an interaction function with the user.
In some embodiments, the detector 230 may also include a temperature sensor or the like, such as by sensing ambient temperature.
In some embodiments, the display device 200 may adaptively adjust the display color temperature of the image. The display device 200 may be adjusted to display a colder color temperature shade of the image, such as when the temperature is higher, or the display device 200 may be adjusted to display a warmer color shade of the image when the temperature is lower.
In some embodiments, the detector 230 may also be a sound collector or the like, such as a microphone, that may be used to receive the user's sound. Illustratively, a voice signal including a control instruction for a user to control the display apparatus 200, or an acquisition environmental sound is used to recognize an environmental scene type so that the display apparatus 200 can adapt to environmental noise.
In some embodiments, as shown in fig. 2, the input/output interface 255 is configured to enable data transfer between the controller 250 and external other devices or other controllers 250. Such as receiving video signal data and audio signal data of an external device, command instruction data, or the like.
In some embodiments, external device interface 240 may include, but is not limited to, the following: any one or more interfaces of a high definition multimedia interface HDMI interface, an analog or data high definition component input interface, a composite video input interface, a USB input interface, an RGB port, and the like can be used. The plurality of interfaces may form a composite input/output interface.
In some embodiments, as shown in fig. 2, the modem 210 is configured to receive the broadcast television signal by a wired or wireless receiving manner, and may perform modulation and demodulation processes such as amplification, mixing, and resonance, and demodulate the audio/video signal from a plurality of wireless or wired broadcast television signals, where the audio/video signal may include a television audio/video signal carried in a television channel frequency selected by a user, and an EPG data signal.
In some embodiments, the frequency point demodulated by the modem 210 is controlled by the controller 250, and the controller 250 may send a control signal according to the user selection, so that the modem responds to the television signal frequency selected by the user and modulates and demodulates the television signal carried by the frequency.
In some embodiments, the broadcast television signal may be classified into a terrestrial broadcast signal, a cable broadcast signal, a satellite broadcast signal, an internet broadcast signal, or the like according to a broadcasting system of the television signal. Or may be differentiated into digital modulation signals, analog modulation signals, etc., depending on the type of modulation. Or it may be classified into digital signals, analog signals, etc. according to the kind of signals.
In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like. In this way, the set-top box outputs the television audio and video signals modulated and demodulated by the received broadcast television signals to the main body equipment, and the main body equipment receives the audio and video signals through the first input/output interface.
In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored on the memory. The controller 250 may control the overall operation of the display apparatus 200. For example: in response to receiving a user command to select to display a UI object on the display 275, the controller 250 may perform an operation related to the object selected by the user command.
In some embodiments, the object may be any one of selectable objects, such as a hyperlink or an icon. Operations related to the selected object, such as: displaying an operation of connecting to a hyperlink page, a document, an image, or the like, or executing an operation of a program corresponding to the icon. The user command for selecting the UI object may be an input command through various input means (e.g., mouse, keyboard, touch pad, etc.) connected to the display device 200 or a voice command corresponding to a voice uttered by the user.
As shown in fig. 2, the controller 250 includes at least one of a random access Memory 251 (Random Access Memory, RAM), a Read-Only Memory 252 (ROM), a video processor 270, an audio processor 280, other processors 253 (e.g., a graphics processor (Graphics Processing Unit, GPU), a central processing unit 254 (Central Processing Unit, CPU), a communication interface (Communication Interface), and a communication Bus 256 (Bus), which connects the respective components.
In some embodiments, RAM 251 is used to store temporary data for the operating system or other on-the-fly programs.
In some embodiments, ROM 252 is used to store instructions for various system boots.
In some embodiments, ROM 252 is used to store a basic input output system, referred to as a basic input output system (Basic Input Output System, BIOS). The system comprises a drive program and a boot operating system, wherein the drive program is used for completing power-on self-checking of the system, initialization of each functional module in the system and basic input/output of the system.
In some embodiments, upon receipt of the power-on signal, the display device 200 power starts up, the CPU runs system boot instructions in the ROM 252, copies temporary data of the operating system stored in memory into the RAM 251, in order to start up or run the operating system. When the operating system is started, the CPU copies temporary data of various applications in the memory to the RAM 251, and then, facilitates starting or running of the various applications.
In some embodiments, CPU processor 254 is used to execute operating system and application program instructions stored in memory. And executing various application programs, data and contents according to various interactive instructions received from the outside, so as to finally display and play various audio and video contents.
In some exemplary embodiments, the CPU processor 254 may comprise a plurality of processors. The plurality of processors may include one main processor and one or more sub-processors. A main processor for performing some operations of the display apparatus 200 in the pre-power-up mode and/or displaying a picture in the normal mode. One or more sub-processors for one operation in a standby mode or the like.
In some embodiments, the graphics processor 253 is configured to generate various graphical objects, such as: icons, operation menus, user input instruction display graphics, and the like. The device comprises an arithmetic unit, wherein the arithmetic unit is used for receiving various interaction instructions input by a user to carry out operation and displaying various objects according to display attributes. And a renderer for rendering the various objects obtained by the arithmetic unit, wherein the rendered objects are used for being displayed on a display.
In some embodiments, video processor 270 is configured to receive external video signals, perform video processing such as decompression, decoding, scaling, noise reduction, frame conversion, resolution conversion, image composition, etc., according to standard codec protocols for input signals, and may result in signals that are displayed or played on directly displayable device 200.
In some embodiments, video processor 270 includes a demultiplexing module, a video decoding module, an image compositing module, a frame conversion module, a display formatting module, and the like.
The demultiplexing module is used for demultiplexing the input audio/video data stream, such as the input MPEG-2, and demultiplexes the input audio/video data stream into video signals, audio signals and the like.
And the video decoding module is used for processing the demultiplexed video signals, including decoding, scaling and the like.
And an image synthesis module, such as an image synthesizer, for performing superposition mixing processing on the graphic generator and the video image after the scaling processing according to the GUI signal input by the user or generated by the graphic generator, so as to generate an image signal for display.
The frame number conversion module is used for converting the number of input video frames, such as converting 60Hz frame number into 120Hz frame number or 240Hz frame number, and the common format is realized by adopting a frame inserting mode.
The display format module is used for converting the received frame number into a video output signal, and changing the signal to be in accordance with the display format, such as outputting RGB data signals.
In some embodiments, the graphics processor 253 may be integrated with the video processor, or may be separately configured, where the integrated configuration may perform processing of graphics signals output to the display, and the separate configuration may perform different functions, such as gpu+ FRC (Frame Rate Conversion)) architecture, respectively.
In some embodiments, the audio processor 280 is configured to receive an external audio signal, decompress and decode the audio signal according to a standard codec protocol of an input signal, and perform noise reduction, digital-to-analog conversion, and amplification processing, so as to obtain a sound signal that can be played in a speaker.
In some embodiments, video processor 270 may include one or more chips. The audio processor may also comprise one or more chips.
In some embodiments, video processor 270 and audio processor 280 may be separate chips or may be integrated together with the controller in one or more chips.
In some embodiments, the audio output, under the control of the controller 250, receives sound signals output by the audio processor 280, such as: the speaker 286, and an external sound output terminal that can be output to a sound emitting device of an external device, other than the speaker carried by the display device 200 itself, such as: external sound interface or earphone interface, etc. can also include the close range communication module in the communication interface, for example: and the Bluetooth module is used for outputting sound of the Bluetooth loudspeaker.
The power supply 290 supplies power input from an external power source to the display device 200 under the control of the controller 250. The power supply 290 may include a built-in power circuit installed inside the display device 200, or may be an external power source installed in the display device 200, and a power interface for providing an external power source in the display device 200.
The user interface 265 is used to receive an input signal from a user and then transmit the received user input signal to the controller 250. The user input signal may be a remote control signal received through an infrared receiver, and various user control signals may be received through a network communication module.
In some embodiments, a user inputs a user command through the control apparatus 100 or the mobile terminal 300, the user input interface is then responsive to the user input through the controller 250, and the display device 200 is then responsive to the user input.
In some embodiments, a user may input a user command through a Graphical User Interface (GUI) displayed on the display 275, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.
In some embodiments, a "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a form acceptable to the user. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.
The memory 260 includes memory storing various software modules for driving the display device 200. Such as: various software modules stored in the first memory, including: at least one of a base module, a detection module, a communication module, a display control module, a browser module, various service modules, and the like.
The base module is a bottom software module for signal communication between the various hardware in the display device 200 and for sending processing and control signals to the upper modules. The detection module is used for collecting various information from various sensors or user input interfaces and carrying out digital-to-analog conversion and analysis management.
For example, the voice recognition module includes a voice analysis module and a voice instruction database module. The display control module is used for controlling the display to display the image content, and can be used for playing the multimedia image content, the UI interface and other information. And the communication module is used for carrying out control and data communication with external equipment. And the browser module is used for executing data communication between the browsing servers. And the service module is used for providing various services and various application programs. Meanwhile, the memory 260 also stores received external data and user data, images of various items in various user interfaces, visual effect maps of focus objects, and the like.
Fig. 3 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 3, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface, a memory, and a power supply.
The control apparatus 100 is configured to control the display device 200, and to receive an input operation instruction of a user, and to convert the operation instruction into an instruction recognizable and responsive to the display device 200, and to function as an interaction between the user and the display device 200. Such as: the user responds to the channel addition and subtraction operation by operating the channel addition and subtraction key on the control apparatus 100.
In some embodiments, the control apparatus 100 may be a smart device. Such as: the control apparatus 100 may install various applications for controlling the display device 200 according to user's needs.
In some embodiments, as shown in fig. 1, a mobile terminal 300 or other intelligent electronic device may function similarly to the control apparatus 100 after installing an application for manipulating the display device 200. Such as: the user may implement the functions of the physical keys of the control apparatus 100 by installing various function keys or virtual buttons of a graphical user interface available on the mobile terminal 300 or other intelligent electronic device.
The controller 110 includes a processor 112 and RAM 113 and ROM 114, a communication interface 130, and a communication bus. The controller is used to control the operation and operation of the control device 100, as well as the communication collaboration among the internal components and the external and internal data processing functions.
The communication interface 130 enables communication of control signals and data signals with the display device 200 under the control of the controller 110. Such as: the received user input signal is transmitted to the display device 200. The communication interface 130 may include at least one of a WiFi chip 131, a bluetooth module 132, an NFC module 133, and other near field communication modules.
A user input/output interface 140, wherein the input interface includes at least one of a microphone 141, a touchpad 142, a sensor 143, keys 144, and other input interfaces. Such as: the user can implement a user instruction input function through actions such as voice, touch, gesture, press, and the like, and the input interface converts a received analog signal into a digital signal and converts the digital signal into a corresponding instruction signal, and sends the corresponding instruction signal to the display device 200.
The output interface includes an interface that transmits the received user instruction to the display device 200. In some embodiments, an infrared interface may be used, as well as a radio frequency interface. Such as: when the infrared signal interface is used, the user input instruction needs to be converted into an infrared control signal according to an infrared control protocol, and the infrared control signal is sent to the display device 200 through the infrared sending module. And the following steps: when the radio frequency signal interface is used, the user input instruction is converted into a digital signal, and then the digital signal is modulated according to a radio frequency control signal modulation protocol and then transmitted to the display device 200 through the radio frequency transmission terminal.
In some embodiments, the control device 100 includes at least one of a communication interface 130 and an input-output interface 140. The control device 100 is provided with a communication interface 130, such as: the WiFi, bluetooth, NFC, etc. modules may send the user input instruction to the display device 200 through a WiFi protocol, or a bluetooth protocol, or an NFC protocol code.
A memory 190 for storing various operation programs, data and applications for driving and controlling the control device 200 under the control of the controller. The memory 190 may store various control signal instructions input by a user.
And a power supply 180 for providing operation power support for each element of the control device 100 under the control of the controller. May be a battery and associated control circuitry.
In some embodiments, the system may include a Kernel (Kernel), a command parser (shell), a file system, and an application. The kernel, shell, and file system together form the basic operating system architecture that allows users to manage files, run programs, and use the system. After power-up, the kernel is started, the kernel space is activated, hardware is abstracted, hardware parameters are initialized, virtual memory, a scheduler, signal and inter-process communication (IPC) are operated and maintained. After the kernel is started, shell and user application programs are loaded again. The application program is compiled into machine code after being started to form a process.
Referring to FIG. 4, in some embodiments, the system is divided into four layers, from top to bottom, an application layer (referred to as an "application layer"), an application framework layer (Application Framework layer) (referred to as a "framework layer"), a An Zhuoyun row (Android run) and a system library layer (referred to as a "system runtime layer"), and a kernel layer, respectively.
In some embodiments, at least one application program is running in the application program layer, and these application programs may be a Window (Window) program of an operating system, a system setting program, a clock program, a camera application, and the like; and may be an application program developed by a third party developer, such as a hi-see program, a K-song program, a magic mirror program, etc. In particular implementations, the application packages in the application layer are not limited to the above examples, and may actually include other application packages, which the embodiments of the present application do not limit.
The framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions. The application framework layer corresponds to a processing center that decides to let the applications in the application layer act. Through the API interface, the application program can access the resources in the system and acquire the services of the system in the execution.
As shown in fig. 4, the application framework layer in the embodiment of the present application includes a manager (manager), a Content Provider (Content Provider), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used to interact with all activities that are running in the system; a Location Manager (Location Manager) is used to provide system services or applications with access to system Location services; a Package Manager (Package Manager) for retrieving various information about an application Package currently installed on the device; a notification manager (Notification Manager) for controlling the display and clearing of notification messages; a Window Manager (Window Manager) is used to manage bracketing icons, windows, toolbars, wallpaper, and desktop components on the user interface.
In some embodiments, the activity manager is to: the lifecycle of each application program is managed, as well as the usual navigation rollback functions, such as controlling the exit of the application program (including switching the currently displayed user interface in the display window to the system desktop), opening, backing (including switching the currently displayed user interface in the display window to the previous user interface of the currently displayed user interface), etc.
In some embodiments, the window manager is configured to manage all window procedures, such as obtaining a display screen size, determining whether there is a status bar, locking the screen, intercepting the screen, controlling display window changes (e.g., scaling the display window down, dithering, distorting, etc.), and so on.
In some embodiments, the system runtime layer provides support for the upper layer, the framework layer, and when the framework layer is in use, the android operating system runs the C/C++ libraries contained in the system runtime layer to implement the functions to be implemented by the framework layer.
In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the kernel layer contains at least one of the following drivers: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, touch sensor, pressure sensor, etc.), and the like.
In some embodiments, the kernel layer further includes a power driver module for power management.
In some embodiments, the software programs and/or modules corresponding to the software architecture in fig. 4 are stored in the first memory or the second memory shown in fig. 2 or fig. 3.
In some embodiments, for a display device with a touch function, taking a split screen operation as an example, the display device receives an input operation (such as a split screen operation) acted on a display screen by a user, and the kernel layer may generate a corresponding input event according to the input operation and report the event to the application framework layer. The window mode (e.g., multi-window mode) and window position and size corresponding to the input operation are set by the activity manager of the application framework layer. And window management of the application framework layer draws a window according to the setting of the activity manager, then the drawn window data is sent to a display driver of the kernel layer, and the display driver displays application interfaces corresponding to the window data in different display areas of the display screen.
In some embodiments, as shown in fig. 5, the application layer contains at least one icon control that the application can display in the display, such as: a live television application icon control, a video on demand application icon control, a media center application icon control, an application center icon control, a game application icon control, and the like.
In some embodiments, the live television application may provide live television via different signal sources. For example, a live television application may provide television signals using inputs from cable television, radio broadcast, satellite services, or other types of live television services. And, the live television application may display video of the live television signal on the display device 200.
In some embodiments, the video on demand application may provide video from different storage sources. Unlike live television applications, video-on-demand provides video displays from some storage sources. For example, video-on-demand may come from the server side of cloud storage, from a local hard disk storage containing stored video programs.
In some embodiments, the media center application may provide various multimedia content playing applications. For example, a media center may be a different service than live television or video on demand, and a user may access various images or audio through a media center application.
In some embodiments, an application center may be provided to store various applications. The application may be a game, an application, or some other application associated with a computer system or other device but which may be run in a smart television. The application center may obtain these applications from different sources, store them in local storage, and then be run on the display device 200.
The hardware or software architecture in some embodiments may be based on the description in the foregoing embodiments, and in some embodiments may be based on other similar hardware or software architectures, so long as the technical solution of the present application may be implemented.
In some embodiments, the application center may be provided with a voice assistant application to implement intelligent voice services, such as searching for media assets, adjusting volume, and the like. When the user clicks the voice assistant application icon control, the voice assistant application may be awakened, and after the voice assistant application wakes up, the user may input a user request, which may be a voice control signal, to the display device. In addition to clicking on the voice assistant application icon, the user may wake up the voice assistant application by issuing a preset voice command to the display device, which may be some preset wake-up word.
The display device can output a decision result according to the voice control signal, so that interaction between a user and the display device is realized. According to different specific contents requested by a user, the decision result can be image-text data, the image-text data can be displayed through a display of the display device, the decision result can be audio data and can be played through a loudspeaker of the display device, the decision result can also be a control signal of the display device, and the display device is controlled through a controller of the display device.
In some embodiments, the voice assistant application may also be provided on the audio device, where the decision result output by the audio device may be audio data or a control signal. The decision process will be described below taking the example of a voice assistant being provided on the display device.
In some embodiments, the user request may be a media resource search request, and the display device may send the media resource search request to the server, where the server retrieves a media resource library according to the media resource search request, obtains recommended media resources, feeds back the recommended media resources to the display device, and displays the recommended media resources for viewing by the user. Because the number of the media assets in the media asset library is huge, and the title, introduction and other media asset information of part of the media assets are relatively close, the media assets recommended by the server according to the user request may not be the media assets which the user really wants to watch, i.e. the decision result of the server is inaccurate.
In order to improve the accuracy of decision results and the interactive experience of a user and display equipment, the application discloses a voice interaction scheme, wherein intelligent voice recognition and natural language understanding in intelligent voice service are used as reinforcement learning environments, a decision engine is used as a agent (agent) of a reinforcement learning model, and the decision engine is utilized to output decision results requested by the user.
In some embodiments, the decision engine may be disposed on a server, and the display device may send the user request to the server, and after the decision engine makes a decision, obtain a decision result, and then send the decision result to the display device for display.
In some embodiments, elements of the decision engine may include state, action, and reward.
The state of the decision engine may be a representation of the user and the characteristics requested by the user, such as a feature vector. The feature vector may include a user request feature component, a media asset history feature component, and a user history feature component. The server analyzes the user request to obtain each characteristic component of the characteristic vector, and the characteristic components are spliced to obtain the characteristic vector.
The user request feature component may be a vector of the parsing results of the user request under each service module of the server. The multiple service modules of the server can provide different types of media for users, and for the same user request, different service modules of the server may output different analysis results. For example, for a user request "wuhan city long", in an "encyclopedia" service module, the obtained analysis result may be (p 1, confidence 1), where p1 is the probability that the user request belongs to the "encyclopedia" service module, and confidence1 is the confidence level of the analysis result of the "encyclopedia" service module; in the "news" service module, the analysis result obtained may be (p 2, confidence 2), where p2 is the probability that the user request belongs to the "news" service module, and confidence2 is the confidence level of the analysis result of the "encyclopedia" service module.
The media asset history feature component may include historical media assets that each business module finds according to a user request, and historical click-through rates for the historical media assets.
The user history feature component may include a history request record in a time dimension of 10 minutes, 1 hour, 24 hours, etc. before the user request is made by the user and a business module to which the media asset the user is interested.
The action of the decision engine may include an internal action and an external action. Wherein the internal actions can be set to calculate the ranking scores alpha of the user requests belonging to the respective business modules i Ranking score α i Is a, a= [ a ] 12 ,...,αn]Where n is the total number of service modules, α i Any value from 0 to 1 can be taken, and all alpha i The sum of (2) is 1, and can be determined according to the analysis method of the feature vector by the decision engine. The internal action is a continuous action until the user request ranking score is obtained for all business modules.
The external action may be an output action of the decision engine. The decision engine can analyze alpha according to a preset rule to obtain the action type of the output action, and output corresponding external actions according to the action type.
In some embodiments, the preset rule may be that according to the existence of a plurality of ranking scores greater than a preset threshold, and the difference between the plurality of ranking scores greater than the preset threshold is smaller than a preset range, the action type is selected (providing a candidate to ask the user for further confirmation), and the service modules corresponding to the plurality of ranking scores are candidate service modules; according to the fact that at least one sorting score larger than the preset threshold exists, and the difference value between the largest two sorting scores is larger than a preset range, the action type is obtained as an form (a certain business result is directly output); and obtaining the action type as default (without any result) according to the absence of the sequencing score which is larger than the preset threshold.
For example, the preset threshold may be set to 0.35, the preset range may be set to 0.1, for the user to request "the city length of martial arts", the ranking score of the encyclopedia module is 0.5, the ranking score of the news module is also 0.4, the ranking score of the movie module is 0.1, the ranking scores of the two business modules, i.e., the encyclopedia module and the news module, are both greater than 0.35, and the difference is less than 0.1, the action type of the external action is selected, and the encyclopedia module and the news module are set as module options; for a user request of 'I want to see to fly a bullet', the sorting score of the encyclopedia module is 0.1, the sorting score of the news module is 0.1, and the sorting score of the film and television module is 0.8, so that the action type of an external action is an action; for the user request "astronaut", the ranking score of the encyclopedia module is 0.33, the ranking score of the news module is 0.33, and the ranking score of the film and television module is 0.33, so that the action type of the external action is default.
For example, the preset threshold may be set to 0.5, the preset range may be set to 0.1, for the user to request "the city length of martial arts", the ranking score of the encyclopedia module is 0.8, the ranking score of the news module is also 0.8, the ranking score of the movie module is 0.1, the ranking scores of the two business modules, i.e., the encyclopedia module and the news module, are both greater than 0.5, and the difference is less than 0.1, the action type of the external action is selected, and the encyclopedia module and the news module are set as module options; for a user request of 'I want to see a bullet flying', the ordering score of the encyclopedia module is 0.4, the ordering score of the news module is 0.3, and the ordering score of the film and television module is 0.9, so that the action type of an external action is an action; for the user request 'I want to see the bullet flying to the second part', the sorting score of the encyclopedia module is 0.1, the sorting score of the news module is 0.2, and the sorting score of the film and television module is 0.4, so that the action type of the external action is default.
The re-ward of the decision engine is an important piece of reinforcement learning model, and in some embodiments, the rewards of the decision engine include action rewards and user feedback rewards.
The action rewards may be rewards for internal actions. By setting action rewards, an agent expecting a decision engine can learn a better sequencing score calculation method, so that a more accurate action type can be directly output as a informed decision result, the man-machine interaction times for obtaining the accurate decision result are reduced, and the user experience is improved.
The user feedback rewards are used for judging whether the user is satisfied with the decision result according to the user feedback actions, so that whether the decision result is accurate is analyzed. At present, only a small amount of user requests can obtain user feedback actions corresponding to the decision result at the intelligent sound box or television end, for example, after the display equipment gives out that the decision result is a movie search result, the user clicks movie resource playing of the decision result through the remote controller, and the decision result can be considered to be accurate. Most of the services have no user feedback action to judge whether the decision result is accurate, for example, for the user request of 'fast forward 5 minutes', after the display device adjusts the media asset playing progress, the user may have no further operation, i.e. no user feedback action to indicate whether the display device adjusts the media asset playing progress to meet the user request.
In some embodiments, whether the user is satisfied with the decision result of the last user request may be determined by analyzing the text similarity and the interval time between two consecutive user requests in the user log. For example, when the first user request is "i want to see the action piece of Liu Dehua" and after 1 minute, the second user request is "i say i want to see the action piece of Liu Dehua", the user may determine that the decision result of the first user request is not satisfied according to the very high text similarity of the two user requests and the short interval time, and may set a negative user feedback reward for the decision result of the first user request.
To improve the accuracy of decision making by the decision engine, the decision engine may be made to perform policy learning based on a DDPG (Deep Deterministic Policy Gradient, depth deterministic policy gradient) algorithm. Referring to FIG. 6, the decision engine may include an actor-critic network architecture and incorporate an experience pool to break the correlation between data and improve policy stability. The actor can calculate a according to the state t ,a t =α i According to a t Performing actions, namely outputting a decision result according to a user request; critic can oversee whether the action of actor is correct and reasonable, and plays the role of referee. Both the actor and the critic have a target network and a policy network respectively, and the two networks have identical structures, but the parameter updating is asynchronous, and a soft updating mode is adopted to update the target network according to the parameters of the policy network. The target network of actor is a policy Network θ μ The optimizer of the actor updates the strategy network theta according to gradient optimization μ Obtaining a policy network theta μ’ The target network of critic is Q network theta Q Critic's optimizer works as Q policy network θ Q Is a residual of (c). Here too, the residual of the critic optimizer is passed to the actor's optimizer for gradient adjustment.
In some embodiments, the environment of fig. 6 may include external factors to the decision process, such as user requests entered by a user.
In some embodiments, the policy learning may include offline policy learning, see fig. 7, which is a flow diagram of an offline policy learning method according to some embodiments, as shown in fig. 7, which may include steps S101-S104.
Step S101: and randomly selecting a user request from the training sample, and analyzing the user request to obtain a state corresponding to the user request.
The training sample may include a plurality of different user requests, and after one user request is randomly selected, the user request is parsed to obtain a feature vector of the user request, and the feature vector is used as a state corresponding to the user request.
Step S102: and making action according to the state, and calculating the comprehensive rewards of the state.
Inputting the state into a decision engine, wherein the decision engine makes internal actions and external actions according to the state, obtains action rewards according to the internal actions, and obtains user feedback rewards according to user feedback actions. In the offline policy learning, the user feedback action can be an action of manually marking rewards, whether the external action is accurate or not is judged manually, if so, the service module in the policy result is marked with positive user feedback rewards, if not, the service module in the policy result is marked with negative user feedback rewards, and the user feedback rewards are given out in a supervised learning mode. Aggregate rewards = action rewards + user feedback rewards.
Further, after the comprehensive rewards are calculated, the next user request is obtained, and the state of the next user request is obtained through analysis.
Step S103: the [ state, action, rewind, next_state ] is stored into the experience pool.
state is the state obtained in step S101, action is the external action in step S102, rewind is the integrated bonus, and next_state is the state requested by the next user.
Step S104: and according to the overflow of the experience pool, taking out the selected user request from the experience pool to train the network of the actor and the critic until the termination condition is met.
In some embodiments, the experience pool generates new quadruples by presetting buffer size, and in the training process, when the data size of the experience pool exceeds the buffer size, the new quadruples are stored, so that the quadruples stored earliest are removed, and the total number of the experience pool is ensured not to exceed the buffer size. The termination condition of training is set such that the total number of training exceeds the maximum predicted number of training. In some embodiments, policy learning may include online policy learning, which may refer to an offline policy learning process, wherein user feedback rewards may be obtained by analyzing text similarity and interval time between two consecutive user requests in a user log.
FIG. 8 is a schematic diagram of policy learning according to some embodiments, as shown in FIG. 8, after the experience pool is full, i.e., the experience pool overflows, agent training can be performed using review. The actions output by the decision engine include direct output (input), reverse question (select) and default (default), and the user can receive or negate the direct output or default decision, make supplementary information for the reverse question decision, and then perform feature extraction to obtain the next state, where the next state is a result corrected according to the supplementary information and the previous state. When offline policy learning is performed, the review includes action rewards performed on internal actions of actions, positive user feedback rewards such as accept (receiving) or negative user feedback rewards such as deny (negating) of manual judgment, wherein the accept and the deny can be respectively preset with scores, for example, the score range can be selected to be 1-5 points, and of course, the score range can be adjusted; when online policy learning is performed, when the action is select, user feedback rewards are obtained through additional user requests, and feature vectors are enriched according to the additional user requests, so that the next state is obtained.
FIG. 9 is a schematic diagram of user request processing according to some embodiments. As shown in fig. 9, after the user requests input, the server may parse through a plurality of service modules, such as service module a, service modules B, … …, and service module N.
Analyzing a user request to obtain a state, inputting the state into a reinforcement learning decision engine to obtain a decision action, namely the action in the text, and correspondingly storing the decision action and the state as decision data, wherein the output of a service module can be optimized according to the feedback of the user action and a reward network to the decision action, the reward network can be obtained through offline training, and the user action can be obtained through an online acquisition mode.
Based on the above technical solution, the present application provides a voice interaction method for a display device, referring to fig. 10, the voice interaction method may include the following steps:
step S201: and responding to the received first voice control signal input by the user, starting the round of decision and sending the first voice control signal to the server.
In some embodiments, the display device initiates a round of decision in which the voice assistant application maintains the awake state after receiving the first voice control signal.
In some embodiments, the voice assistant application is in a non-awake state, and the user may wake the voice assistant application, causing the voice assistant application to enter the awake state. The user may wake up the voice assistant application by clicking on a voice assistant application icon on the display device or issuing a preset voice command to the display device, although the user may wake up the voice assistant application in other ways, which are not illustrated here.
When the voice assistant application is in an awake state, the user may input a first voice control signal to the display device to search for media assets. The first voice control signal may include a user voice, such as "martial arts city length".
In some embodiments, the display device may send the first voice control signal to the server, and the server makes a decision according to the first voice control signal to obtain a first decision result, and returns the first decision result to the display device. The first decision result may be provided with an action type, such as select, inform or default.
Step S202: and receiving and outputting a first decision result from the server, wherein the first decision result is provided with an action type.
According to different contents of the first decision result, the output form of the first decision result can be a form of display, speaker playing, speaker synchronous playing when the display is displayed, and the like.
Step S203: and receiving a second voice control signal input by a user according to the action type of the first decision result as selection, and sending the second voice control signal to a server.
When the action type of the first decision result is select, it indicates that the server cannot accurately judge the user intention, and the user is required to input further information, and at this time, the first decision result may be a selection question. For example, "encyclopedia information is also related news? ".
The user may input a second voice control signal to the display device according to the first decision result. For example, the user may input a second voice-control signal as "encyclopedia" to the display device based on the first decision result.
The display device may forward the second voice control signal to the server, the server combines the second voice control signal with the first voice control signal to obtain a second decision result, and returns the second decision result to the display device, where in some embodiments, the action type of the second decision result is form.
Step S204: and receiving and outputting a second decision result from the server, and ending the round of decision.
In some embodiments, the output form of the second decision result may also be a form of display, speaker playing, speaker synchronous playing when the display is displayed, etc. according to the decision content.
The display device may cause the voice assistant application to enter a non-awake state based on outputting the second decision result.
In some embodiments, the present round of decision is ended according to the action type of the first decision result being notification or default.
In some embodiments, if the action type of the first decision result is a notification or default, the present round of decision may be ended, causing the voice assistant application to enter a non-awake state.
Corresponding to the above voice interaction method, the embodiment of the present application further provides a voice interaction method, which is used for a server, and referring to fig. 11, the voice interaction method may include the following steps:
step S301: and analyzing the first voice control signal from the display equipment to obtain a first feature vector.
In some embodiments, the server performs voice-to-text conversion on the first voice control signal to obtain a first user request, and then parses the first user request.
In some embodiments, the server may be provided with a plurality of service modules, such as encyclopedia modules, news modules, video modules, and other modules, where each service module may parse the first user request to obtain a parsing result. In general, the analysis result of a certain service module on the first user request may conform to the user intention. In order to obtain the service module, the server may perform multidimensional analysis on the first user request to obtain a plurality of feature components, and the analysis process may refer to fig. 12, including steps S401 to S404.
Step S401: and respectively analyzing the first voice control signal in each service module to obtain the user request characteristic component.
In some embodiments, the server may select a plurality of service modules to parse the first user request to obtain data such as probability that the first user request belongs to one service module and confidence (confidence) of the result matched in the service module, and generate a feature component of the user request according to the probability and the confidence.
For example, assuming that the first user request is "wuhan city long", the first user request is parsed under the "encyclopedia" service module to obtain (p 1, confidence 1), that is, the probability that the first user request belongs to the encyclopedia module is p1, and the confidence of the media resource matched by the first user request in the encyclopedia module is confidence1; the first user request is analyzed under the 'news' service module to obtain (p 2, confidence 2), namely the probability that the first user request belongs to the news module is p2, the confidence level of the media resource matched in the news module by the first user request is confidence2, and the feature component of the user request is ([ p1, confidence1, p2, confidence2 ]).
Step S402: and analyzing the historical analysis results of the first voice control signal in each service module to obtain the media seniority history characteristic component.
In some embodiments, the server has previously processed the first user request, and may obtain the media asset history feature component according to the historical analysis result of the first user request at each service module. The historical analysis result may include the historical media assets that are resolved by each service module according to the first user request, and the historical click rate of the historical media assets, and the media asset history feature component may include each historical media asset and the corresponding historical click rate.
Step S403: and analyzing the historical media asset behaviors of the user in each service module to obtain the historical characteristic components of the user.
In some embodiments, the server stores historical media asset behaviors of the user on the display device, and may analyze preferences and behavior habits of the user based on the historical media asset behaviors of the user. The history media asset action may include a history request record in a time dimension of 10 minutes, 1 hour, 24 hours, etc. before the user issues the first voice control signal, and a service module to which the media asset focused by the user belongs, and the media asset history feature component may include the history request records and the service module to which the media asset focused by the user belongs.
Step S404: and generating a first feature vector according to the user request feature component, the media asset history feature component and the user history feature component.
In some embodiments, according to steps S401 to S403, feature components of the first user request in multiple dimensions are obtained, and the multiple feature components are spliced together in turn to obtain a first feature vector, that is, a state corresponding to the first user request.
Step S302: and calculating a first sorting score of each service module according to the first feature vector, comparing a plurality of the first sorting scores to obtain a first decision result, and sending the first decision result to a display device, wherein the first decision result comprises an action type.
In some embodiments, the server may be pre-trained with a decision engine, and input the first feature vector as a state to the decision engine to make the decision engine act according to the state.
The actions of the decision engine may include internal actions and external actions. The internal actions may be set to calculate a first ranking score α for the first user request belonging to each business module i Obtaining a set of first ranking scores: alpha= [ alpha ] 12 ,...,αn]Where n is the total number of service modules, α i Any value from 0 to 1 can be taken, and the value can be specifically determined according to the analysis method of the decision engine on the first feature vector. The internal action is a continuous action until a first ranking score is obtained for the first user request at all business modules.
Further, according to the largest first ranking score, an action reward of the first decision result may be obtained, where the size of the action reward may be the same as the first ranking score.
The external action may be an output action of the decision engine. The decision engine can analyze alpha according to a preset rule to obtain the action type of the output action, and output corresponding external actions according to the action type.
In some embodiments, the preset rule may be that according to the existence of a plurality of first sorting scores greater than a preset threshold, and the difference between the plurality of first sorting scores greater than the preset threshold is smaller than a preset range, an action type is selected, and the service modules corresponding to the plurality of first sorting scores are alternative service modules; according to the fact that at least one alternative service module exists, and the difference value between the first sorting scores of the largest two alternative service modules is larger than a preset range, obtaining the action type as form; according to the fact that only one alternative service module exists, obtaining the action type as form; and obtaining the action type as default according to the absence of the first sorting score which is larger than the preset threshold.
For example, the preset threshold may be set to 0.5, the preset range may be set to 0.1, for the first user request "the city length of martial arts", the first ranking score of the encyclopedia module is 0.8, the first ranking score of the news module is also 0.8, the first ranking score of the movie module is 0.1, the first ranking scores of the two service modules, that is, the encyclopedia module and the news module, are both greater than 0.5, and the difference is less than 0.1, the action type of the external action is selected, and the encyclopedia module and the news module are set as module options; for a first user request of 'I want to see to fly a bullet', the first ordering score of the encyclopedia module is 0.4, the first ordering score of the news module is 0.3, and the first ordering score of the film and television module is 0.9, so that the action type of an external action can be notified; for a first user request of 'I want to see the bullet to fly to the second part', the first ordering score of the encyclopedia module is 0.1, the first ordering score of the news module is 0.2, and the first ordering score of the film and television module is 0.4, so that the action type of the external action can be obtained as a default.
In some embodiments, when the action type is form, the decision engine may be the analysis result of the alternative service module; when the action type is select, the first decision result may be a feedback statement including two alternative business modules, e.g., the first decision result may be: "encyclopedia information is also related news? The first decision result may also be a feedback statement containing the alternative business module and the first user request, for example, the feedback statement may be "you want to know encyclopedia information of the city length of martial arts or related news"; when the action type is default, the first decision result may include a preset feedback statement, such as "do not find your information".
Step S303: and analyzing the second voice control signal and the first voice control signal from the display equipment according to the action type of the first decision result as selection to obtain a second feature vector.
In some embodiments, the second voice control signal may be a signal containing one alternative traffic module, e.g., a signal containing an "encyclopedia" traffic module.
The server analyzes the second voice control signal to obtain a second user request, and supplements the second user request to the first user request to obtain a corrected user request. For example, according to the second user request being "encyclopedia information", the second user request and the text of the first user request are spliced according to the Chinese grammar, so as to obtain the revised user request "encyclopedia information of the city of martial arts in the Wuhan city".
And the server analyzes the corrected user request to obtain a second feature vector. The parsing process for the revised user request may refer to the parsing process for the first user request.
Step S304: and calculating a second sorting score of each service module according to the second feature vector, comparing a plurality of second sorting scores to obtain a second decision result, and sending the second decision result to a display device.
The decision engine of the server calculates a second ranking score of each service module according to the second feature vector, and the calculation process of the second ranking score can refer to the calculation process of the first ranking score.
And obtaining a second decision result according to the second sorting score. Wherein, because the information is added to the modified user request compared with the first user request, the probability that the action type of the second decision result is "form" is greatly improved. For example, the second decision result may be an resolution result of the revised user request under the "encyclopedia" service module.
And sending the second decision result to the display device, so that the user can acquire the second decision result through the display device.
Further, according to the maximum value of the second ranking score, action rewards of the second decision result are obtained, according to the similarity of the first voice control signal and the second voice control signal, user behavior feedback rewards are obtained, and comprehensive rewards are generated according to the user feedback rewards and the action rewards. The decision module of the server may optimize the calculation method of the first ranking score according to the integrated rewards, i.e. by learning an optimized decision engine through an online policy, thereby adjusting the first ranking score. The optimization of the decision engine by the server may be an untimely optimization, and in particular the optimization time may be determined according to the business requirements of the voice helper application. In addition to online policy learning, the server may perform offline policy learning, and optimize the policy engine by integrating the results of the online policy learning and the offline policy learning.
As can be seen from the above embodiments, after receiving the first voice control signal input by the user, if it is difficult to accurately determine the intention of the user, the embodiment of the present application interacts with the user, so that the user inputs the second voice control signal, and comprehensively analyzes the intention of the user according to the second voice control signal and the first voice control signal, thereby improving the accuracy of the decision result. Furthermore, policy learning is performed by setting action rewards and user feedback rewards, so that the number of interactions with users is reduced on the basis of guaranteeing the accuracy of decision results, and user experience is further improved.
Since the foregoing embodiments are all described in other modes by reference to the above, the same parts are provided between different embodiments, and the same and similar parts are provided between the embodiments in the present specification. And will not be described in detail herein.
It should be noted that in this specification, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a circuit structure, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such circuit structure, article, or apparatus. Without further limitation, the statement "comprises" or "comprising" a … … "does not exclude the presence of other identical elements in a circuit structure, article or apparatus that comprises the element.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure of the application herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims. The above embodiments of the present application do not limit the scope of the present application.

Claims (8)

1. A display device, characterized by comprising:
a display;
an audio collection device configured to collect user input audio;
a controller coupled to the display and the audio acquisition device, the controller configured to:
responding to a first voice control signal input by a user, starting a round of decision, and sending the first voice control signal to a server;
receiving and outputting a first decision result from the server, wherein the first decision result is provided with an action type, the server obtains the action type as a selection according to the fact that a plurality of alternative service modules exist, the difference value between the first sorting scores of the plurality of alternative service modules is smaller than a preset range, the first decision result is generated according to the alternative service modules and the action type, the alternative service modules are service modules with the first sorting scores larger than a preset threshold, the action type is notified according to the fact that one alternative service module exists, the first decision result is generated according to the analysis result of the alternative service modules, the action type is defaulted according to the fact that the alternative service modules do not exist, the first decision result is generated according to the fact that the first feedback statement exists, the action type is notified according to the fact that the difference value between the first sorting scores of the plurality of the alternative service modules is larger than the preset range, and the first decision result is generated according to the alternative service modules with the largest first sorting score;
According to the action type of the first decision result as selection, receiving a second voice control signal input by a user, and sending the second voice control signal to a server;
and receiving and outputting a second decision result from the server, and ending the round of decision.
2. The display device of claim 1, wherein the controller is further configured to:
and ending the round of decision according to the action type of the first decision result as notification or default.
3. The display device of claim 1, wherein receiving and outputting the first decision result from the server comprises:
and receiving and controlling the display to display the first decision result, wherein the first decision result comprises a plurality of alternative service modules for the user to select.
4. A server, wherein the server is configured to:
analyzing a first voice control signal from display equipment to obtain a first feature vector;
calculating a first sorting score of each service module according to the first feature vector, comparing a plurality of the first sorting scores to obtain a first decision result, and sending the first decision result to a display device, wherein the first decision result comprises an action type, the server obtains the action type as a default according to the existence of a plurality of alternative service modules, the difference value between the first sorting scores of the plurality of alternative service modules is smaller than a preset range, the action type is selected, the first decision result is generated according to the alternative service modules and the action type, the alternative service modules are service modules with the first sorting score larger than a preset threshold, the action type is obtained as a notification according to the existence of one of the alternative service modules, the first decision result is generated according to the analysis result of the alternative service module, the action type is obtained as a default according to the absence of the alternative service modules, the first decision result is generated according to a preset feedback statement, the difference value between the first sorting scores of the plurality of the alternative service modules is larger than the preset range, the first decision result is obtained as a first decision result according to the fact that the difference value between the first sorting scores of the alternative service modules is larger than the preset range, and the first decision result is generated according to the first alternative service module;
Analyzing a second voice control signal and a first voice control signal from the display equipment according to the action type of the first decision result as selection to obtain a second feature vector;
and calculating a second sorting score of each service module according to the second feature vector, comparing a plurality of second sorting scores to obtain a second decision result, and sending the second decision result to a display device.
5. A voice interaction method for a display device, comprising:
responding to a first voice control signal input by a user, starting a round of decision, and sending the first voice control signal to a server;
receiving and outputting a first decision result from the server, wherein the first decision result is provided with an action type, the server obtains the action type as a selection according to the fact that a plurality of alternative service modules exist, the difference value between the first sorting scores of the plurality of alternative service modules is smaller than a preset range, the first decision result is generated according to the alternative service modules and the action type, the alternative service modules are service modules with the first sorting scores larger than a preset threshold, the action type is notified according to the fact that one alternative service module exists, the first decision result is generated according to the analysis result of the alternative service modules, the action type is defaulted according to the fact that the alternative service modules do not exist, the first decision result is generated according to the fact that the first feedback statement exists, the action type is notified according to the fact that the difference value between the first sorting scores of the plurality of the alternative service modules is larger than the preset range, and the first decision result is generated according to the alternative service modules with the largest first sorting score;
According to the action type of the first decision result as selection, receiving a second voice control signal input by a user, and sending the second voice control signal to a server;
and receiving and outputting a second decision result from the server, and ending the round of decision.
6. A voice interaction method for a server, comprising:
analyzing a first voice control signal from display equipment to obtain a first feature vector;
calculating a first sorting score of each service module according to the first feature vector, comparing a plurality of the first sorting scores to obtain a first decision result, and sending the first decision result to a display device, wherein the first decision result comprises an action type, the server obtains the action type as a default according to the existence of a plurality of alternative service modules, the difference value between the first sorting scores of the plurality of alternative service modules is smaller than a preset range, the action type is selected, the first decision result is generated according to the alternative service modules and the action type, the alternative service modules are service modules with the first sorting score larger than a preset threshold, the action type is obtained as a notification according to the existence of one of the alternative service modules, the first decision result is generated according to the analysis result of the alternative service module, the action type is obtained as a default according to the absence of the alternative service modules, the first decision result is generated according to a preset feedback statement, the difference value between the first sorting scores of the plurality of the alternative service modules is larger than the preset range, the first decision result is obtained as a first decision result according to the fact that the difference value between the first sorting scores of the alternative service modules is larger than the preset range, and the first decision result is generated according to the first alternative service module;
According to the action type as selection, analyzing a second voice control signal and a first voice control signal from the display equipment to obtain a second feature vector;
and calculating a second sorting score of each service module according to the second feature vector, comparing a plurality of second sorting scores to obtain a second decision result, and sending the second decision result to a display device.
7. The method of claim 6, wherein parsing the first voice control signal from the display device to obtain a first feature vector comprises:
respectively analyzing the first voice control signal in each service module to obtain a user request characteristic component;
analyzing the historical analysis results of the first voice control signal in each service module to obtain media seniority history characteristic components;
analyzing the historical media asset behaviors of the user in each service module to obtain the historical characteristic components of the user;
and generating a first feature vector according to the user request feature component, the media asset history feature component and the user history feature component.
8. The voice interaction method according to claim 6, further comprising:
Obtaining action rewards of the first decision result according to the maximum value of the first sorting scores;
obtaining action rewards of the second decision result according to the maximum value of the second sorting scores;
and obtaining user behavior feedback rewards according to the similarity of the first voice control signal and the second voice control signal.
CN202010803789.XA 2020-08-11 2020-08-11 Display device, server and voice interaction method Active CN112002321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010803789.XA CN112002321B (en) 2020-08-11 2020-08-11 Display device, server and voice interaction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010803789.XA CN112002321B (en) 2020-08-11 2020-08-11 Display device, server and voice interaction method

Publications (2)

Publication Number Publication Date
CN112002321A CN112002321A (en) 2020-11-27
CN112002321B true CN112002321B (en) 2023-09-19

Family

ID=73463761

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010803789.XA Active CN112002321B (en) 2020-08-11 2020-08-11 Display device, server and voice interaction method

Country Status (1)

Country Link
CN (1) CN112002321B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113038217A (en) * 2021-03-19 2021-06-25 海信电子科技(武汉)有限公司 Display device, server and response language generation method
CN113614713A (en) * 2021-06-29 2021-11-05 华为技术有限公司 Human-computer interaction method, device, equipment and vehicle

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120079344A (en) * 2011-01-04 2012-07-12 삼성전자주식회사 Apparatus and method for voice command recognition based on combination of dialog models
CN103021403A (en) * 2012-12-31 2013-04-03 威盛电子股份有限公司 Voice recognition based selecting method and mobile terminal device and information system thereof
CN105529030A (en) * 2015-12-29 2016-04-27 百度在线网络技术(北京)有限公司 Speech recognition processing method and device
CN106663424A (en) * 2014-03-31 2017-05-10 三菱电机株式会社 Device and method for understanding user intent
KR20180046649A (en) * 2016-10-28 2018-05-09 한국과학기술연구원 User intention detection system for initiation of interaction based on multi-modal perception and a method using the same
CN108877792A (en) * 2018-05-30 2018-11-23 北京百度网讯科技有限公司 For handling method, apparatus, electronic equipment and the computer readable storage medium of voice dialogue
CN109313536A (en) * 2016-06-13 2019-02-05 微软技术许可有限责任公司 Dummy keyboard based on the task icons for being intended to dynamic generation
CN110114783A (en) * 2016-11-04 2019-08-09 渊慧科技有限公司 Utilize the intensified learning of nonproductive task
CN110473521A (en) * 2019-02-26 2019-11-19 北京蓦然认知科技有限公司 A kind of training method of task model, device, equipment
CN111508482A (en) * 2019-01-11 2020-08-07 阿里巴巴集团控股有限公司 Semantic understanding and voice interaction method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150179170A1 (en) * 2013-12-20 2015-06-25 Microsoft Corporation Discriminative Policy Training for Dialog Systems
JP6540414B2 (en) * 2015-09-17 2019-07-10 本田技研工業株式会社 Speech processing apparatus and speech processing method
DK201770431A1 (en) * 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
KR20190107289A (en) * 2019-08-30 2019-09-19 엘지전자 주식회사 Artificial robot and method for speech recognitionthe same

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120079344A (en) * 2011-01-04 2012-07-12 삼성전자주식회사 Apparatus and method for voice command recognition based on combination of dialog models
CN103021403A (en) * 2012-12-31 2013-04-03 威盛电子股份有限公司 Voice recognition based selecting method and mobile terminal device and information system thereof
CN106663424A (en) * 2014-03-31 2017-05-10 三菱电机株式会社 Device and method for understanding user intent
CN105529030A (en) * 2015-12-29 2016-04-27 百度在线网络技术(北京)有限公司 Speech recognition processing method and device
CN109313536A (en) * 2016-06-13 2019-02-05 微软技术许可有限责任公司 Dummy keyboard based on the task icons for being intended to dynamic generation
KR20180046649A (en) * 2016-10-28 2018-05-09 한국과학기술연구원 User intention detection system for initiation of interaction based on multi-modal perception and a method using the same
CN110114783A (en) * 2016-11-04 2019-08-09 渊慧科技有限公司 Utilize the intensified learning of nonproductive task
CN108877792A (en) * 2018-05-30 2018-11-23 北京百度网讯科技有限公司 For handling method, apparatus, electronic equipment and the computer readable storage medium of voice dialogue
CN111508482A (en) * 2019-01-11 2020-08-07 阿里巴巴集团控股有限公司 Semantic understanding and voice interaction method, device, equipment and storage medium
CN110473521A (en) * 2019-02-26 2019-11-19 北京蓦然认知科技有限公司 A kind of training method of task model, device, equipment

Also Published As

Publication number Publication date
CN112002321A (en) 2020-11-27

Similar Documents

Publication Publication Date Title
CN111770366A (en) Message reissue method, server and display device
CN112511882B (en) Display device and voice call-out method
CN112163086B (en) Multi-intention recognition method and display device
CN111984763B (en) Question answering processing method and intelligent device
CN111866553B (en) Media information calibration method and server
CN111949782A (en) Information recommendation method and service equipment
CN112002321B (en) Display device, server and voice interaction method
CN112165641A (en) Display device
CN114118064A (en) Display device, text error correction method and server
CN112380420A (en) Searching method and display device
CN112153440A (en) Display device and display system
CN112804567B (en) Display equipment, server and video recommendation method
CN112885354B (en) Display device, server and display control method based on voice
CN111836083B (en) Display device and screen sounding method
CN112073787B (en) Display device and home page display method
WO2021184575A1 (en) Display device and display method
CN111984167A (en) Rapid naming method and display device
CN112256232B (en) Display device and natural language generation post-processing method
CN113490057B (en) Display device and media asset recommendation method
CN112562666B (en) Method for screening equipment and service equipment
CN114627864A (en) Display device and voice interaction method
CN113038217A (en) Display device, server and response language generation method
CN112199560A (en) Setting item searching method and display device
CN111950288A (en) Entity labeling method in named entity recognition and intelligent equipment
CN112329475B (en) Statement processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant