CN113066491A

CN113066491A - Display device and voice interaction method

Info

Publication number: CN113066491A
Application number: CN202110291989.6A
Authority: CN
Inventors: 张大钊; 王冰
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-07-02

Abstract

The embodiment of the application provides a display device and a voice interaction method, wherein the display device comprises a display used for presenting a user interface; a controller connected with the display, the controller configured to: receiving a voice instruction input by a user; responding to the voice instruction, and acquiring response data corresponding to the voice instruction; when the response data comprise audio data and display data, generating a response interface according to the display data, and matching a text corresponding to the audio data with an image-text object on the response interface to obtain a matched reference text and a target image-text; controlling the display to display a response interface and controlling an audio output device connected with the display to play audio corresponding to the audio data; and when the reference text is played, updating the display effect of the target image-text on the response interface, so that the display effect of the target image-text is different from the display effect before the reference text is played. The technical problem that voice interaction experience is poor is solved.

Description

Display device and voice interaction method

Technical Field

The present application relates to the field of voice interaction technologies, and in particular, to a display device and a voice interaction method.

Background

With the development of the television towards intellectualization, the current television can support voice control besides the traditional remote controller control. The user can input a section of voice to the television, the television can recognize the text from the section of voice, then the semantics of the section of text is inquired through the network, and the response is carried out according to the relation between the preset semantics and the service of the display equipment. For example, the voice input by the user to the television is a query sentence, and the response of the display device to the voice is to display an answer corresponding to the query sentence and read the answer. However, in the related art, the process of reading the answer by the display device and the process of displaying the answer are independent from each other, so that the user needs to concentrate on the attention while listening to the read answer and viewing the displayed answer, and the voice interaction experience is poor.

Disclosure of Invention

In order to solve the technical problem of poor voice interaction experience, the application provides a display device and a voice interaction method.

In a first aspect, the present application provides a display device comprising:

a display for presenting a user interface;

a controller connected with the display, the controller configured to:

receiving a voice instruction input by a user;

responding to the voice instruction, and acquiring response data corresponding to the voice instruction;

when the response data comprise audio data and display data, generating a response interface according to the display data, and matching a text corresponding to the audio data with a graph-text object on the response interface to obtain a matched reference text and a target graph-text, wherein the reference text belongs to the text corresponding to the audio data, and the target graph-text belongs to the graph-text object;

controlling the display to display the response interface and controlling an audio output device connected with the display to play audio corresponding to the audio data;

and when the reference text is played, updating the display effect of the target image-text on the response interface, so that the display effect of the target image-text is different from the display effect of the target image-text before the reference text is played.

In some embodiments, the controller is further configured to:

and after the reference text is played, restoring the display effect of the target image-text to ensure that the display effect of the target image-text is the same as the display effect before the reference text is played.

In some embodiments, the controller is further configured to:

and after the reference text is played, updating the display effect of the target image-text, so that the display effect of the target image-text is different from the display effect before the reference text is played and different from the display effect after the reference text is played.

In some embodiments, the matching the text corresponding to the audio data with the image-text object on the response interface to obtain a matched reference text and a matched target image-text includes:

splitting a text corresponding to the audio data into a plurality of character groups;

and matching the character set with the text on the response interface, if the matching is successful, determining the character set as a reference text, and determining the text on the response interface as a target text, wherein the image-text object on the response interface comprises the text on the response interface, and the target image-text comprises the target text.

and matching the character set with the graph on the response interface, if the matching is successful, determining the character set as a reference text, and determining the text on the response interface as a target graph, wherein the graph-text object on the response interface comprises the graph on the response interface, and the target graph-text comprises the target graph.

In a second aspect, the present application provides a voice interaction method, including:

receiving a voice instruction input by a user;

controlling the display to display the response interface and controlling an audio output device to play audio corresponding to the audio data;

The display equipment and the voice interaction method have the advantages that:

the display device provided by the application can analyze the response data after receiving the response data corresponding to the voice command, can generate a response interface according to the display data when the response data comprise audio data and display data, and detects the target pictures and texts corresponding to the audio data on the response interface, so that when the audio corresponding to the target pictures and texts is broadcasted, the display effect of the target pictures and texts is updated on the response interface, the display effect of the target pictures and texts is different from that before the broadcasting, a user can obtain the current broadcasting progress according to the display effect on the response interface, the broadcasting text of voice broadcasting is linked with the change of a UI interface, and the user experience is improved.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic diagram illustrating an operational scenario between a display device and a control apparatus according to some embodiments;

a block diagram of the hardware configuration of the control device 100 according to some embodiments is illustrated in fig. 2;

a block diagram of a hardware configuration of a display device 200 according to some embodiments is illustrated in fig. 3;

a schematic diagram of a software configuration in a display device 200 according to some embodiments is illustrated in fig. 4;

FIG. 5 is a schematic diagram illustrating the principles of voice interaction, according to some embodiments;

FIG. 6 is a schematic diagram illustrating a voice interaction interface, according to some embodiments;

FIG. 7 is a schematic diagram illustrating a voice interaction interface, according to some embodiments;

a schematic diagram of a voice interaction interface in accordance with some embodiments is illustrated in FIG. 8.

Detailed Description

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display apparatus 200 through the smart device 300 or the control device 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, and controls the display device 200 in a wireless or wired manner. The user may input a user instruction through a key on a remote controller, voice input, control panel input, etc., to control the display apparatus 200.

In some embodiments, the smart device 300 (e.g., mobile terminal, tablet, computer, laptop, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application program running on the smart device.

In some embodiments, the display device 200 may also be controlled in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received by a module configured inside the display device 200 to obtain a voice command, or may be received by a voice control device provided outside the display device 200.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction from a user and convert the operation instruction into an instruction recognizable and responsive by the display device 200, serving as an interaction intermediary between the user and the display device 200.

Fig. 3 shows a hardware configuration block diagram of the display apparatus 200 according to an exemplary embodiment.

In some embodiments, the display apparatus 200 includes at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, a user interface.

In some embodiments the controller comprises a processor, a video processor, an audio processor, a graphics processor, a RAM, a ROM, a first interface to an nth interface for input/output.

In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, performing display of video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

In some embodiments, the display 260 may be a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.

In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The display apparatus 200 may establish transmission and reception of control signals and data signals with the external control apparatus 100 or the server 400 through the communicator 220.

In some embodiments, the user interface may be configured to receive control signals for controlling the apparatus 100 (e.g., an infrared remote control, etc.).

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which may be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 230 includes a sound collector, such as a microphone, which is used to receive external sounds.

In some embodiments, the external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, and the like. The interface may be a composite input/output interface formed by the plurality of interfaces.

In some embodiments, the tuner demodulator 210 receives broadcast television signals via wired or wireless reception, and demodulates audio/video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored in memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command for selecting a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the object may be any one of selectable objects, such as a hyperlink, an icon, or other actionable control. The operations related to the selected object are: displaying an operation connected to a hyperlink page, document, image, or the like, or performing an operation of a program corresponding to the icon.

In some embodiments the controller comprises at least one of a Central Processing Unit (CPU), a video processor, an audio processor, a Graphics Processing Unit (GPU), a Random Access Memory (RAM), a ROM (Read-Only Memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.

A CPU processor. For executing operating system and application program instructions stored in the memory, and executing various application programs, data and contents according to various interactive instructions receiving external input, so as to finally display and play various audio-video contents. The CPU processor may include a plurality of processors. E.g. comprising a main processor and one or more sub-processors.

In some embodiments, a graphics processor for generating various graphics objects, such as: icons, operation menus, user input instruction display graphics, and the like. The graphic processor comprises an arithmetic unit, which performs operation by receiving various interactive instructions input by a user and displays various objects according to display attributes; the system also comprises a renderer for rendering various objects obtained based on the arithmetic unit, wherein the rendered objects are used for being displayed on a display.

In some embodiments, the video processor is configured to receive an external video signal, and perform video processing such as decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, and image synthesis according to a standard codec protocol of the input signal, so as to obtain a signal that can be displayed or played on the direct display device 200.

In some embodiments, the video processor includes a demultiplexing module, a video decoding module, an image synthesis module, a frame rate conversion module, a display formatting module, and the like. The demultiplexing module is used for demultiplexing the input audio and video data stream. And the video decoding module is used for processing the video signal after demultiplexing, including decoding, scaling and the like. And the image synthesis module is used for carrying out superposition mixing processing on the GUI signal input by the user or generated by the user and the video image after the zooming processing by the graphic generator so as to generate an image signal for display. And the frame rate conversion module is used for converting the frame rate of the input video. And the display formatting module is used for converting the received video output signal after the frame rate conversion, and changing the signal to be in accordance with the signal of the display format, such as an output RGB data signal.

In some embodiments, the audio processor is configured to receive an external audio signal, decompress and decode the received audio signal according to a standard codec protocol of the input signal, and perform noise reduction, digital-to-analog conversion, and amplification processing to obtain an audio signal that can be played in the speaker.

In some embodiments, a user may enter user commands on a Graphical User Interface (GUI) displayed on display 260, and the user input interface receives the user input commands through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

In some embodiments, a "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables conversion between an internal form of information and a form that is acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

In some embodiments, a system of a display device may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together make up the basic operating system structure that allows users to manage files, run programs, and use the system. After power-on, the kernel is started, kernel space is activated, hardware is abstracted, hardware parameters are initialized, and virtual memory, a scheduler, signals and interprocess communication (IPC) are operated and maintained. And after the kernel is started, loading the Shell and the user application program. The application program is compiled into machine code after being started, and a process is formed.

The system of the display device may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together make up the basic operating system structure that allows users to manage files, run programs, and use the system. After power-on, the kernel is started, kernel space is activated, hardware is abstracted, hardware parameters are initialized, and virtual memory, a scheduler, signals and interprocess communication (IPC) are operated and maintained. And after the kernel is started, loading the Shell and the user application program. The application program is compiled into machine code after being started, and a process is formed.

Referring to fig. 4, in some embodiments, the system is divided into four layers, which are an Application (Applications) layer (abbreviated as "Application layer"), an Application Framework (Application Framework) layer (abbreviated as "Framework layer"), an Android runtime (Android runtime) and system library layer (abbreviated as "system runtime library layer"), and a kernel layer from top to bottom.

In some embodiments, at least one application program runs in the application program layer, and the application programs may be windows (windows) programs carried by an operating system, system setting programs, clock programs or the like; or an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples.

The framework layer provides an Application Programming Interface (API) and a programming framework for the application. The application framework layer includes a number of predefined functions. The application framework layer acts as a processing center that decides to let the applications in the application layer act. The application program can access the resources in the system and obtain the services of the system in execution through the API interface.

As shown in fig. 4, in the embodiment of the present application, the application framework layer includes a manager (Managers), a Content Provider (Content Provider), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used for interacting with all activities running in the system; the Location Manager (Location Manager) is used for providing the system service or application with the access of the system Location service; a Package Manager (Package Manager) for retrieving various information related to an application Package currently installed on the device; a Notification Manager (Notification Manager) for controlling display and clearing of Notification messages; a Window Manager (Window Manager) is used to manage the icons, windows, toolbars, wallpapers, and desktop components on a user interface.

In some embodiments, the activity manager is used to manage the lifecycle of the various applications as well as general navigational fallback functions, such as controlling exit, opening, fallback, etc. of the applications. The window manager is used for managing all window programs, such as obtaining the size of a display screen, judging whether a status bar exists, locking the screen, intercepting the screen, controlling the change of the display window (for example, reducing the display window, displaying a shake, displaying a distortion deformation, and the like), and the like.

In some embodiments, the system runtime layer provides support for the upper layer, i.e., the framework layer, and when the framework layer is used, the android operating system runs the C/C + + library included in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the core layer includes at least one of the following drivers: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

The hardware or software architecture in some embodiments may be based on the description in the above embodiments, and in some embodiments may be based on other hardware or software architectures that are similar to the above embodiments, and it is sufficient to implement the technical solution of the present application.

For clarity of explanation of the embodiments of the present application, a speech recognition network architecture provided by the embodiments of the present application is described below with reference to fig. 5.

Referring to fig. 5, fig. 5 is a schematic diagram of a voice recognition network architecture according to an embodiment of the present application. In fig. 5, the smart device is configured to receive input information and output a processing result of the information. The voice recognition service equipment is electronic equipment with voice recognition service deployed, the semantic service equipment is electronic equipment with semantic service deployed, and the business service equipment is electronic equipment with business service deployed. The electronic device may include a server, a computer, and the like, and the speech recognition service, the semantic service (also referred to as a semantic engine), and the business service are web services that can be deployed on the electronic device, wherein the speech recognition service is used for recognizing audio as text, the semantic service is used for semantic parsing of the text, and the business service is used for providing specific services such as a weather query service for ink weather, a music query service for QQ music, and the like. In one embodiment, there may be multiple entity service devices deployed with different business services in the architecture shown in fig. 5, and one or more function services may also be aggregated in one or more entity service devices.

In some embodiments, the following describes an example of a process for processing information input to a smart device based on the architecture shown in fig. 5, where the information input to the smart device is an example of a query statement input by voice, the process may include the following three processes:

[ Speech recognition ]

The intelligent device can upload the audio of the query sentence to the voice recognition service device after receiving the query sentence input by voice, so that the voice recognition service device can recognize the audio as a text through the voice recognition service and then return the text to the intelligent device. In one embodiment, before uploading the audio of the query statement to the speech recognition service device, the smart device may perform denoising processing on the audio of the query statement, where the denoising processing may include removing echo and environmental noise.

[ semantic understanding ]

The intelligent device uploads the text of the query sentence identified by the voice identification service to the semantic service device, and the semantic service device performs semantic analysis on the text through semantic service to obtain the service field, intention and the like of the text.

[ semantic response ]

And the semantic service equipment issues a query instruction to corresponding business service equipment according to the semantic analysis result of the text of the query statement so as to obtain the query result given by the business service. The intelligent device can obtain the query result from the semantic service device and output the query result. As an embodiment, the semantic service device may further send a semantic parsing result of the query statement to the intelligent device, so that the intelligent device outputs a feedback statement in the semantic parsing result.

It should be noted that the architecture shown in fig. 5 is only an example, and is not intended to limit the scope of the present application. In the embodiment of the present application, other architectures may also be adopted to implement similar functions, for example: all or part of the three processes can be completed by the intelligent terminal, and are not described herein.

In some embodiments, the intelligent device shown in fig. 5 may be a display device, such as a smart television, the functions of the speech recognition service device may be implemented by cooperation of a sound collector and a controller provided on the display device, and the functions of the semantic service device and the business service device may be implemented by the controller of the display device or by a server of the display device.

In some embodiments, a query statement or other interactive statement that a user enters a display device through speech may be referred to as a voice instruction.

In some embodiments, the display device obtains, from the semantic service device, a query result given by the business service, and the display device may analyze the query result to generate response data of the voice instruction, and then control the display device to execute a corresponding action according to the response data. For example, after the query result is analyzed, the obtained query result includes a text with a broadcast identifier. The display device may generate the response data according to a preset response rule, and an exemplary response rule is: when the text with the broadcast identification is acquired, a dialog box containing the text corresponding to the broadcast data is generated on the voice interaction interface, and the text corresponding to the broadcast data is broadcasted in a voice mode. Therefore, the display device can generate response data including the UI interface data and the broadcast data according to the preset response rule, a dialog box containing a text corresponding to the query result is arranged on the UI interface corresponding to the UI interface data, and the broadcast data includes a text corresponding to the broadcast data and a control instruction for calling the audio playing device to play the text corresponding to the broadcast data.

In some embodiments, the semantic analysis result of the voice instruction is acquired by the display device from the semantic service device, and the display device may analyze the semantic analysis result to generate response data, and then control the display device to execute a corresponding action according to the response data.

In some embodiments, the response data corresponding to the voice instruction includes service type data and does not include broadcast data, where the broadcast data may include audio data and a text corresponding to the audio data, and the text may also be referred to as a broadcast text, that is, a text that needs to be broadcasted by voice, and of course, the broadcast data may also include only the broadcast text, the display device may generate corresponding audio data according to the broadcast text, the service type data may include UI interface data and/or a control instruction of the display device, and the UI interface data may include display data used for generating the response interface. This typically occurs in a voice scene where the user instructs the display device. For example, when the voice instruction is a volume adjustment instruction to increase the volume, the response data may include UI interface data to display a volume bar and a control instruction to increase the volume of a speaker. And the display equipment adjusts the volume and displays the volume bar according to the response data without voice broadcast.

In some embodiments, the response data corresponding to the voice command includes broadcast data and service type data, wherein the service type data may include UI interface data. This generally occurs in a voice scene, which is a scene of man-machine conversation, for example, a user sends a command "how to day", and in response to the command, the display device needs to feed back a query result to the user in a form of voice broadcast, and at this time, response data corresponding to the voice command includes broadcast data. When the voice instruction is an instruction for inquiring the current weather, the response data can comprise UI data for displaying the details of the current weather and broadcast data comprising weather information such as temperature, wind power, humidity and the like. The display device can display the UI interface according to the response data and perform voice broadcast.

However, in the related art, when the response data corresponding to the voice instruction includes the broadcast data and the service type data, the two processes of displaying the UI interface and the voice broadcast executed by the display device are independent of each other and are not associated with each other, and a user needs to associate the broadcast text of the voice broadcast with the UI interface by himself or herself, so that the experience is not good enough.

In order to solve the technical problem, in some embodiments, after the display device obtains the response data of the voice instruction, when the response data includes the broadcast data, a response interface is generated according to the UI interface data, the target graphics and texts corresponding to the broadcast text are detected in the response interface data, the response interface is displayed and the voice broadcast is performed, and when the audio corresponding to the target graphics and texts is broadcast, the display effect of the target graphics and texts on the response interface is updated, so that the user can see the current voice broadcast content on the response interface, the automatic association of the voice broadcast text and the UI interface is realized, and the user experience is improved.

The following describes the above technical solution for associating the voice broadcast text with the response interface in detail by taking the voice interaction process between the user and the display device as an example.

In some embodiments, a voice control button may be disposed on the remote controller of the display device, and after the user presses the voice control button on the remote controller, the controller of the display device may control the display of the display device to display the voice interaction interface and control the sound collector, such as a microphone, to collect sound around the display device. At this time, the user may input a voice instruction to the display device.

In some embodiments, the display device may support a voice wake-up function, and the sound collector of the display device may be in a state of continuously collecting sound. After the user speaks the awakening word, the display device performs voice recognition on the voice instruction input by the user, and after the voice instruction is recognized to be the awakening word, the display of the display device can be controlled to display a voice interaction interface, and at the moment, the user can continue to input the voice instruction to the display device.

In some embodiments, after a user inputs a voice instruction, in a process that the display device acquires response data of the voice instruction or the display device responds according to the response data, the sound collector of the display device can keep a sound collection state, the user can press a voice control button on the remote controller at any time to re-input the voice instruction or speak a wakeup word, at this time, the display device can end a last voice interaction process, and a new voice interaction process is started according to the voice instruction newly input by the user, so that the real-time performance of voice interaction is guaranteed.

In some embodiments, when the current interface of the display device is a voice interaction interface, the display device performs voice recognition on a voice instruction input by a user to obtain a text corresponding to the voice instruction, the display device itself or a server of the display device performs semantic understanding on the text to obtain a user intention, the user intention is processed to obtain a semantic analysis result, response data is generated according to the semantic analysis result, the response data may be referred to as initial response data, and if the display device performs voice broadcast directly according to the initial response data and displays a response interface, a situation that the voice broadcast is independent from the response interface may occur.

In order to avoid the situation that the voice broadcast is independent of the response interface, in some embodiments, the display device may process the initial response data to obtain final response data, and perform a response according to the final response data to obtain an effect of associating the voice broadcast with the response interface. Of course, if the initial response data does not include the broadcast data, the response can be directly performed according to the initial response data.

Taking the example that the initial response data includes the broadcast data, the process of the display device processing the initial response data to obtain the final response data may be described below.

In some embodiments, if the initial response data includes broadcast data and UI interface data, a response interface may be generated according to the UI interface data, and then a target image-text corresponding to the broadcast text is detected on the response interface. The target graphics and text can be objects which are associated with the broadcast text and can be displayed specially, such as characters and graphics, and the special display means display different from display before broadcast.

In some embodiments, the target text may include target text. Since the content displayed on the response interface usually includes the same text as the broadcast text, for example, a dialog box is provided on the response interface, and the text displayed in the dialog box may be the same as the broadcast text, a target text requiring a display effect that changes with the broadcast progress may be determined from the text displayed on the response interface.

In order to obtain the target text, the broadcast text can be split into at least two character groups according to some preset splitting rules.

An exemplary splitting rule for broadcasting a text is as follows: the broadcast text is divided into single Chinese characters, each Chinese character can be used as a character group, if a punctuation mark exists between the Chinese character and the next Chinese character, the punctuation mark can be ignored, and the punctuation mark can also be written into the character group corresponding to the Chinese character or written into the character group of the next Chinese character. For example, for the broadcast text "hello," it can be split into two character sets, one being "you" and the other being "good! ".

An exemplary splitting rule for broadcasting a text is as follows: the broadcast text is divided into a plurality of words, and the punctuation mark processing can refer to the above. For example, for the broadcast text "hello, i call small a. ", it can be split into four character groups, which are: "hello," my, "" call, "and" small a. ".

An exemplary splitting rule for broadcasting a text is as follows: the punctuation marks are used for distinguishing, and the broadcast text is split into a plurality of short sentences, for example, for the broadcast text 'hello, I call Xiao A'. ", it can be split into two character groups, which are: "you are good", "I call small A. ".

The splitting rule is only an exemplary splitting rule for broadcasting the text, and in actual implementation, the splitting rule can be other rules.

In some embodiments, if the response interface includes a text that is the same as the broadcast text, each character set corresponding to the broadcast text may be determined as a reference text, and a text that is the same as the character set on the response interface is determined as a target text of the reference text, that is, the reference text is a text in the broadcast data, and the target text is a text on the response interface.

In some embodiments, in addition to containing the same text as the announced text, the content displayed in the response interface may also include other text that may be text matched to the character set, and the text matched to the character set may also be identified as the target text for the character set. For example, for the announcement text "today's weather is cloudy. ", it can be split into the following character groups, respectively: "today", "weather", "is", "cloudy". "on the response interface, there are voice interactive dialog boxes and today's weather detail information, wherein the same text as the broadcast text is displayed in the voice interactive dialog boxes: "today's weather is cloudy. "today's weather detail information includes a text" cloudy "matching the character group, and therefore" cloudy "in the weather detail information of today can also be determined as one target text corresponding to the character group" cloudy ".

In some embodiments, the rules for text matching may be text identity or text meaning identity, proximity, or correlation. For example, if the character group corresponding to the played text is "3 to 8 degrees celsius", and the weather detail information of the response interface contains the text "3 to 8 degrees celsius", the text "3 to 8 degrees celsius" can be determined as the target text according to the meaning of "3 to 8 degrees celsius" and "3 to 8 degrees celsius".

In some embodiments, the target image-text corresponding to the reference text may include the target graphic. The response interface may display a plurality of graphics, some of which may be provided with text descriptions that may be matched with the broadcast text. For example, the broadcast text is' weak ultraviolet rays, is not suitable for fishing and is suitable for indoor exercise. "can split this report text into following several character sets, do respectively: "ultraviolet", "weak", "unsuitable", "fishing", "suitable", "indoor", "exercise". The graphics displayed on the response interface can comprise a sun graphic and a fish graphic, and one side of the sun graphic is provided with a text description: the ultraviolet ray is weak, and one side of the fish figure is provided with a text description: "general fishing". At this time, according to the rule of text matching, it is possible to obtain that the caption "ultraviolet weak" is related to the character group "ultraviolet" content and the caption "fishing general" is related to the character group "fishing" content, and therefore, it is possible to specify the sun pattern as one target pattern corresponding to the character group "ultraviolet" and the fish pattern as one target pattern corresponding to the character group "fishing".

Therefore, after a segment of broadcast text is split into a plurality of character groups, one character group is taken as a reference text on a response interface, and may correspond to one target text or a plurality of target texts. One character set may or may not correspond to one target graphic, and in some embodiments, one character set may correspond to a plurality of target graphics.

In some embodiments, the display device may provide numbering for each character set to facilitate distinguishing between different character sets.

After the reference text is obtained, in order to ensure that the display device can update the display effect of the target image-text on time after playing the target image-text corresponding to the reference text, the display device needs to acquire a broadcasting progress in the voice broadcasting process.

In some embodiments, the display device may collect sound through the microphone in a voice broadcasting process, so as to obtain sound emitted by a speaker of the display device, perform voice-to-text conversion on the sound emitted by the speaker, match the converted text with the broadcast text, and obtain a current playing progress, and further highlight and display the target graphics when the broadcasting progress corresponds to the target graphics. The time difference between the broadcast progress acquired by the method and the actual broadcast progress is the same as the time for data processing of the display device, and the actual broadcast progress of the display device can be better reflected, so that the accuracy of highlighting the broadcast target is guaranteed.

However, the above method for acquiring the broadcast progress needs the display device to perform data processing on the sound emitted by the speaker in real time, so that the performance consumption is high, and when the computing capability of the display device is weak, the display device may be stuck.

In some embodiments, the display device may calculate in advance the time that needs to be spent from the broadcast starting point to the reference text after starting to broadcast, wherein the broadcast starting point is the first character of the broadcast text, and then when recording after the broadcast starts, the broadcast progress can be obtained according to the voice broadcast speed of the display device, and the following is specifically implemented:

in some embodiments, after the display device obtains the plurality of reference texts, it is possible to calculate broadcast time that the display device needs to spend from a start point of the voice broadcast to the reference texts when performing the voice broadcast.

In some embodiments, the display device may support the announcement with different timbres during the voice interaction, and the speech speed of the different timbre announcements may be slightly different, for example, the timbres supported by the display device include female voice and male voice, the female voice is faster in speech speed, and the male voice is slower in speech speed. The tone of the display device can be set to female sound by default, certainly, the tone of the display device can be set to male sound by default, a user can set the tone of the display device in advance, the display device can carry out voice broadcast according to the tone set by the user, and if the user does not set the tone of the display device in advance, the display device can broadcast according to the default tone.

In some embodiments, the display device may determine its own voice broadcasting speed according to the current tone, calculate a time when the reference text starts to be broadcasted according to the voice broadcasting speed and a character distance between the starting point of the reference text and the starting point of the text corresponding to the audio data, and calculate a time when the reference text finishes being broadcasted according to the character length of the reference text and the broadcasting speed.

In some embodiments, after calculating the time when the reference text starts to be broadcasted and the time when the reference text finishes to be broadcasted, the display device may adjust the UI interface data to update the display effect of the target graphics and text corresponding to the reference text according to a preset display rule when the time when the reference text starts to be broadcasted, that is, the broadcasting progress reaches the reference text, so that the display effect of the target graphics and text is different from that before being broadcasted. Therefore, the user can know that the content corresponding to the audio frequency of the current voice broadcast is the target image-text with the changed display effect according to the change of the display effect of the target image-text on the response interface, and the user can be enabled to clearly confirm the current broadcast progress.

In some embodiments, the UI interface data may also be adjusted to the state that after a reference text is broadcast, the display effect of the corresponding target graphics is not changed, until a next reference text is broadcast, the display effect of the target graphics is restored, and the display effect of the target graphics corresponding to the next reference text is updated, so that a target graphics has at least two display effects, and the two states of "not broadcast or broadcast to the next target graphics", "broadcasting or broadcast but not yet broadcast the next target graphics" may be corresponded, so that the user may clearly confirm the current broadcast progress.

In some embodiments, the UI interface data may be further adjusted to update the display effect of the target image-text after a reference text is reported, so that the display effect of the target image-text is different from the display effect before the reference text is played and different from the display effect during the reference text is played. Therefore, one target picture and text has at least three display effects and can correspond to three states of 'not broadcasting', 'broadcasting' and 'broadcasted', and a user can clearly confirm the current broadcasting progress.

In some embodiments, if the target image-text is the target text, the display effect may be updated by changing the color of the target text to be different from the previous color. In each of the above states, the colors of the target texts may be different from each other, wherein the color difference between the text color corresponding to "broadcasting" and "broadcasting or broadcasted but not yet broadcasting the next target graphic" may be relatively large, and the color difference between the text color corresponding to "not broadcasting", "not broadcasted or broadcasted to the next target graphic", "not broadcasted" and "broadcasted" and the background color may be relatively small, so that a user may easily grasp the latest broadcasting progress. Compared with the broadcast content, the color difference between the text color corresponding to the broadcast content and the background color can be relatively larger, so that the broadcast content and the non-broadcast content can be distinguished conveniently by a user.

Therefore, the color conversion method achieves the effect of emphatically displaying the target text, different colors can be regarded as corresponding to different emphasis levels, the color difference between the text color corresponding to the high emphasis level and the background color can be relatively large, and the color difference between the text color corresponding to the low emphasis level and the background color can be relatively small.

In some embodiments, if the target graphics context is the target graphics context, the method for updating the display effect of the broadcast target may be to adjust the position of the target graphics context on the response interface, so that the position of the target graphics context on the response interface is different from the position of the target graphics context before the reference text is played.

Therefore, the position conversion method achieves the effect of emphatically displaying the target graphics, different positions can be regarded as corresponding to different emphasis levels, the emphasis level of the target graphics corresponding to the broadcasted reference text is the highest, and the target graphics corresponding to the broadcasted reference text is the lowest in the broadcasted times and the unrebroadcasted times. A high emphasis level corresponds to a more prominent region, such as near the center region or upper region of the responsive interface, and a low emphasis level corresponds to a less prominent region, such as near the edge region or lower region of the responsive interface.

The color conversion emphasizing method may be applied to the target graphic, and the position change emphasizing method may be applied to the target text.

In some embodiments, if the broadcast target is a target graphic, the method of highlighting the broadcast target may further include: and if the focus of the display equipment is not on the target image-text before the reference text is played, moving the focus of the display equipment to the target image-text when the reference text is played.

After the UI interface data is adjusted according to the emphasized display rule, the UI interface corresponding to the adjusted UI interface data can be a dynamic interface that changes with the broadcasting progress, and the final response data can be obtained according to the adjusted UI interface data and the broadcasting data.

In some embodiments, after the display device obtains the final response data, the display device may control the audio output device to start broadcasting the broadcast text, control the display to display a response interface, and update the display effect of the target graphics on the response interface after broadcasting the broadcast target.

The audio output device may correspond to the audio output interface in fig. 3, wherein the audio output interface may include or connect a speaker and an external sound output terminal.

To further describe the display interface change in the voice interaction process, taking the target image-text as an example, fig. 6 to 8 show schematic diagrams of voice interaction interfaces according to some embodiments, where the voice interaction interfaces are response interfaces of voice instructions.

Referring to fig. 6, the broadcast text may be the text shown in fig. 6, including: "to find the following joke writing class must write a short story on the painting, including religion, royalty, … …". The display device may split the broadcasted text into multiple character groups, the first character group may be "find jokes below for you", the second and subsequent character groups may include a single character, i.e., the second character group is "write", the third character group is "do", the fourth character group is "class", and so on.

Each character set is used as a reference text, when the reference text is broadcasted, the color of the target text corresponding to the reference text can be changed, in fig. 6, "a written class student needs to write a short story, include religion, and" the color of "is different from the color of other texts" to indicate that the broadcasting progress is "teaching" certainly.

Referring to fig. 7, the broadcast text includes: "Laoshan area cloudy today, 3 to 8 degrees Celsius, … …". The display device can split the broadcast text into a plurality of character groups, the first character group can be ' Laoshan area ', the second character group can be ' today ', the third character group can be ' cloudy, 3-8 ℃, and the like.

Each character set is used as a reference text, when one reference text is broadcasted, the color of the target text corresponding to the reference text can be changed, and in fig. 7, the color of "cloudy, 3 to 8 degrees centigrade" is different from the color of other texts, which indicates that the broadcasting progress is "cloudy, 3 to 8 degrees centigrade".

Fig. 7 also includes a target text that matches the reference text, for example, "cloudy 3-8 ℃ in the bottom left corner of fig. 7" matches the reference text "cloudy, 3-8 ℃, so" cloudy 3-8 ℃ "can also be set as the target text, and when the" cloudy, 3-8 ℃ is broadcasted by voice, the "cloudy 3-8 ℃" can also be displayed in a color-changing manner.

Fig. 8 is a schematic diagram of the interface in fig. 7 after updating. As shown in fig. 8, when the broadcast progress is "good air", the target text "air quality 62 (good air)" matching the "good air" in the reference text may be displayed in a color-changing manner.

As can be seen from fig. 7 and 8, when different reference texts are broadcasted, different target texts can be highlighted respectively to prompt the user of the current broadcast progress.

As can be seen from the above embodiments, the display device provided by the present application, after receiving the response data corresponding to the voice command, may analyze the response data, when the response data includes audio data and display data, may generate a response interface according to the display data, and detect the target graphics context corresponding to the audio data at the response interface, so that when the audio corresponding to the target graphics context is broadcasted, the display effect of the target graphics context is updated on the response interface, so that the display effect of the target graphics context is different from that before the broadcast, so that the user may obtain the current broadcast progress according to the display effect on the response interface, thereby implementing the connection between the broadcast text of the voice broadcast and the change of the UI interface, and improving the user experience.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, comprising:

a display for presenting a user interface;

a controller connected with the display, the controller configured to:

receiving a voice instruction input by a user;

2. The display device of claim 1, wherein the controller is further configured to:

3. The display device of claim 1, wherein the controller is further configured to:

and after the reference text is played, updating the display effect of the target image-text, so that the display effect of the target image-text is different from the display effect before the reference text is played and different from the display effect when the reference text is played.

4. The display device of claim 1, wherein the matching of the text corresponding to the audio data with the graphics-text object on the response interface to obtain the matched reference text and the target graphics-text comprises:

5. The display device of claim 4, wherein the controller is further configured to:

acquiring the broadcasting speed of display equipment and acquiring the character distance between the starting point of the reference text and the starting point of the text corresponding to the audio data;

calculating the time when the reference text starts to be broadcasted according to the broadcasting speed and the character spacing;

and calculating the time when the reference text finishes broadcasting according to the broadcasting speed and the character length of the reference text.

6. The display device of claim 1, wherein the matching of the text corresponding to the audio data with the graphics-text object on the response interface to obtain the matched reference text and the target graphics-text comprises:

7. The display device of claim 1, wherein the updating the display effect of the target text on the response interface to make the display effect of the target text different from the display effect of the target text before the reference text is played comprises:

and changing the color of the target image-text to make the color of the target image-text different from the color of the reference text before playing.

8. The display device of claim 1, wherein the updating the display effect of the target text on the response interface to make the display effect of the target text different from the display effect of the target text before the reference text is played comprises:

and adjusting the position of the target image-text on the response interface to ensure that the position of the target image-text on the response interface is different from the position of the target image-text before the reference text is played.

9. The display device of claim 1, wherein the updating the display effect of the target text on the response interface to make the display effect of the target text different from the display effect of the target text before the reference text is played comprises:

and if the focus of the display equipment is not on the target image-text before the reference text is played, moving the focus of the display equipment to the target image-text when the reference text is played.

10. A method of voice interaction, comprising:

receiving a voice instruction input by a user;