CN117806587A

CN117806587A - Display device and multi-round dialog prediction generation method

Info

Publication number: CN117806587A
Application number: CN202310595623.7A
Authority: CN
Inventors: 胡仁林; 汪先健; 朱守勤
Original assignee: Vidaa Netherlands International Holdings BV
Current assignee: Vidaa Netherlands International Holdings BV
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2024-04-02

Abstract

Some embodiments of the present application provide a display device and a multi-round dialog prediction generation method. The display device may obtain a multi-round dialog template from a multi-round dialog template database and select a target single-round dialog template based on the single-round dialog template information. The display device may obtain a target entity type of a slot to be filled in the target single-round dialog template, and obtain a target entity according to the target entity type. The display device may perform entity population on the target single-turn dialog template based on the target entity, resulting in a single-turn dialog corpus. After the single-round dialogue corpus is combined, a plurality of rounds of dialogue corpus is obtained. The display device can automatically generate multiple rounds of dialogue corpus, and can improve the accuracy of voice recognition.

Description

Display device and multi-round dialog prediction generation method

Technical Field

The application relates to the technical field of display equipment, in particular to display equipment and a multi-round dialogue expected generation method.

Background

The display device is a terminal device capable of outputting specific display pictures, along with the rapid development of the display device, the functions of the display device are more and more abundant, the performance is more and more powerful, the bidirectional man-machine interaction function can be realized, and various functions such as video, entertainment and data are integrated, so that the user diversified and personalized requirements are met.

The display device may have voice interaction functionality. People can utilize the voice interaction function to realize a series of operations of voice control display equipment, such as video watching, music listening, weather searching, equipment control and the like. In the process of realizing the voice interaction function, the display device can recognize a voice instruction input by a user as a text, and then analyze the text semantically, so that the user instruction is analyzed, and corresponding operation is performed to realize the function indicated by the user. In order to perform semantic analysis on the voice instruction text, a voice recognition model may be constructed in advance according to training expectation, so that semantics are analyzed by using the model.

When the user controls the display device through voice, multiple rounds of conversations may be required to realize the user demands according to different application scenes. For example, the user voice indicates to watch the asset, and the display device needs to ask the user to select a specific asset to play. In order for the display device to support multiple rounds of dialog capability, a speech recognition model may be created from the multiple rounds of dialog corpus. However, purchasing multiple rounds of dialog corpus results in higher usage costs, and the purchased multiple rounds of dialog corpus may not be suitable for an application scenario of a display device, resulting in lower accuracy of speech recognition and poorer usage experience for a user.

Disclosure of Invention

The application provides a display device and a multi-turn dialogue expected generation method, which are used for solving the problem that in the related technology, the purchased multi-turn dialogue corpus is not suitable for an application scene of the display device, so that the accuracy of voice recognition is low.

In a first aspect, some embodiments of the present application provide a display device including a display and a controller. Wherein the controller is configured to:

acquiring a multi-round dialogue template from a multi-round dialogue template database; the multi-round dialogue template comprises a plurality of single-round dialogue template information;

selecting a target single-round dialogue template based on the single-round dialogue template information; the target single-round dialogue template comprises a groove to be filled;

obtaining a target entity type corresponding to the slot to be filled, and obtaining a target entity according to the target entity type;

performing entity filling on the target single-round dialogue template based on the target entity to obtain single-round dialogue corpus;

and combining the single-round dialogue corpus to obtain multi-round dialogue corpus.

In a second aspect, some embodiments of the present application provide a method for generating a multi-round dialog prediction, which is applied to a display device, and the method includes:

According to the technical scheme, some embodiments of the application provide a display device and a multi-round dialog prediction generation method. The display device may obtain a multi-round dialog template from a multi-round dialog template database and select a target single-round dialog template based on the single-round dialog template information. The display device may obtain a target entity type of a slot to be filled in the target single-round dialog template, and obtain a target entity according to the target entity type. The display device may perform entity population on the target single-turn dialog template based on the target entity, resulting in a single-turn dialog corpus. After the single-round dialogue corpus is combined, a plurality of rounds of dialogue corpus is obtained. The display device can automatically generate multiple rounds of dialogue corpus, so that the use cost is reduced, and the accuracy of voice recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 illustrates a usage scenario of a display device according to some embodiments;

fig. 2 shows a hardware configuration block diagram of the control apparatus 100 according to some embodiments;

fig. 3 illustrates a hardware configuration block diagram of a display device 200 according to some embodiments;

FIG. 4 illustrates a software configuration diagram in a display device 200 according to some embodiments;

FIG. 5 illustrates a schematic diagram of an application panel in some embodiments;

FIG. 6 illustrates a voice interaction network architecture diagram of a display device in some embodiments;

FIG. 7 illustrates a schematic diagram of a user and a display device conducting multiple rounds of conversations in some embodiments;

FIG. 8 illustrates a schematic diagram of a user and a display device conducting multiple rounds of conversations in some embodiments;

FIG. 9 illustrates an interactive flow diagram for components of a display device in some embodiments;

FIG. 10 illustrates a schematic diagram of a system setup UI interface in some embodiments;

FIG. 11 illustrates a schematic diagram of voice interaction pattern confirmation information in some embodiments.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The display device provided in the embodiment of the application may have various implementation forms, for example, may be a television, an intelligent television, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), an electronic desktop (electronic table), and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.

In some embodiments, a smart device 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.

In some embodiments, the display device may receive instructions not using the smart device or control device described above, but rather receive control of the user by touch or gesture, or the like.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.

In some embodiments, the display device 200 is also in data communication with a server 400. May be configured to allow the display device 200 to be communicatively coupled via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.

As shown in fig. 3, the display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.

The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.

A user interface, which may be used to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, or the like. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

The modem 210 receives broadcast television signals through a wired or wireless reception manner, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the controller includes at least one of a central processor (CentralProcessingUnit, CPU), a video processor, an audio processor, a graphics processor (GraphicsProcessingUnit, GPU), a RAM (RandomAccess Memory, RAM), a ROM (Read-only memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.

The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user, which enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form for user interfaces is a graphical user interface (GraphicUserInterface, GUI), which refers to a graphically displayed user interface associated with computer operations. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

As shown in fig. 4, the system of the display device is divided into three layers, an application layer, a middleware layer, and a hardware layer, from top to bottom.

The application layer mainly comprises a common application on the television and an application framework (application framework), wherein the common application is mainly an application developed based on a Browser, such as: HTML5APPs; native applications (native apps);

an application framework (application framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange … … and the interface for the use of these functions (toolbar, status column, menu, dialog box).

Native applications (native apps) may support online or offline, message pushing, or local resource access.

The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.

The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

The display device 200 may have various functions such as browsing web pages, playing media assets, entertainment games, projecting screens, etc., thereby providing a wide variety of services to users. The user may control the display device 200 to launch a related application program to thereby launch a corresponding function.

In some embodiments, after the user controls the display device 200 to power on, a user interface may be displayed in the display 260. The user interface may be a specific target image, for example, various media materials obtained from a network signal source, including video, pictures and the like. The user interface may also be some UI interface of the display device 200, such as a system recommendation page or the like.

A "My applications" control may be included in the user interface. The user may trigger entry into the corresponding application panel by clicking on the My application control to enter a display instruction for the application panel page. The application panel includes therein an application program that the display device 200 has installed, i.e., a function supported by the display device 200. It should be noted that, the application installed in the display device 200 may be a system application or a third party application. Fig. 5 shows a schematic diagram of an application panel in some embodiments. As shown in fig. 5, the application panel includes three controls, namely a "player", "game", and "video chat". The user may click on a control to cause the display device 200 to implement the corresponding function.

A user may input instructions to the display apparatus 200 using the control device 100, such as a remote controller, a mobile terminal, or the like, to control the display apparatus 200 to implement various functions. The user may control the movement of the focus in the display 260 using the control device 100 to select a different control to open the control. The user may also input some text to the display device 200 using the control apparatus 100, for example, may input a media asset name or the like when searching for media assets.

In some embodiments, the display device has a voice recognition function in consideration of the user's use experience, so that the user can input control instructions to the display device by means of voice input to implement voice interaction, thereby controlling the display device 200 to implement various operations, such as voice controlling the display device 200 to enter into an application panel page. The display device 200 may be provided with an audio input interface configured to connect to a sound collector for collecting signals, such as user speech. The display device 200 may be externally connected to a sound collector, such as an externally connected microphone. The display device may also incorporate a sound collector.

Fig. 6 illustrates a voice interaction network architecture diagram of a display device in some embodiments. As shown in fig. 6, the display device 200 is used to receive input information such as sound and output a processing result of the information. The voice recognition module is deployed with a voice recognition service (AutomaticSpeechRecognition, ASR) for recognizing audio as text; the semantic understanding module is provided with a semantic understanding service (NaturalLanguageUnderstanding, NLU) for carrying out semantic analysis on the text; the business management module is deployed with business instruction management services such as dialogue management (DialogManagement, DM) for providing business instructions; the language generation module is provided with a language generation service (NaturalLanguage Understanding, NLG) for converting instructions for instructing the display device to execute into a text language; the voice synthesis module is provided with a voice synthesis (TextToSpeech, TTS) service, and is used for processing the text language corresponding to the instruction and then sending the processed text language to a loudspeaker for broadcasting. Multiple entity service devices with different service services can exist in the voice interaction network architecture, and one or more functional services can be integrated in one or more entity service devices.

In some embodiments, a procedure for processing information input to the display device 200 based on the architecture shown in fig. 6 will be described by way of example, taking the information input to the display device 200 as a query sentence input by voice as an example:

and (3) voice recognition: the display apparatus 200 may perform noise reduction processing and feature extraction on the audio of the query sentence after receiving the query sentence input through the voice, where the noise reduction processing may include steps of removing echo and environmental noise, and the like.

Semantic understanding: natural language understanding is performed on the identified candidate text and associated context information. The text is parsed into structured, machine-readable information, business fields, intents, word slots, etc., to express semantics, etc., resulting in executable intent determination intent confidence scores, and the semantic understanding module selects one or more candidate executable intents based on the determined intent confidence scores.

And (3) service management: the semantic understanding module issues a query instruction to the corresponding service management module according to the semantic analysis result of the text of the query statement to acquire the query result given by the service, performs actions required by the user to finish the final request, and feeds back the device execution instruction corresponding to the query result.

Language generation: is configured to generate the information or instructions into language text. The method can be divided into boring type, task type, knowledge question-answering type and recommendation type. The NLG in the chat type dialogue carries out intention recognition, emotion analysis and the like according to the context, and then generates an openness reply; in the task type dialogue, dialogue reply is generated according to the learned strategy, and general reply comprises clarification requirement, guidance user, inquiry, confirmation, dialogue ending language and the like; generating knowledge (knowledge, entity, fragment, etc.) required by a user according to the recognition and classification of question types, information retrieval or text matching in the knowledge question-answering type dialogue; and in the recommended dialogue system, interest matching and candidate recommended content sorting are carried out according to the hobbies of the user, and then the recommended content for the user is generated.

And (3) speech synthesis: configured to present a speech output to a user. The speech synthesis processing module synthesizes a speech output based on text provided by the digital assistant. For example, the generated dialog response is in the form of a text string. The speech synthesis module converts the text string into audible speech output.

It should be noted that the architecture shown in fig. 6 is only an example, and is not intended to limit the scope of the present application. Other architectures may also be employed to achieve similar functionality in embodiments of the present application, for example: all or part of the above process may be performed by the display device 200, which is not described herein.

In some embodiments, a user may control the display device 200 to implement various functions using input. For a user's voice command, the display device 200 may convert voice data into text data and perform semantic analysis on the text data, thereby confirming the user command. In response to the instruction, the display device 200 may perform a corresponding operation, and may also respond to the user, such as voice replying to the user's execution result.

In order to perform semantic analysis on the voice instruction text, a plurality of training predictions, which may be voice text corpus, may be obtained in advance, and a voice recognition model may be constructed according to the training predictions, so that the semantics are analyzed by using the model.

However, for different application scenarios, when the user controls the display device 200 through voice, the input voice command may not embody the complete intention of the user, so that the display device 200 cannot implement the corresponding operation according to the current voice of the user. The inability to embody the complete intention of the user refers to an instruction that the intention of the user is incomplete, or the intention of the user is clear but the required basic information is more, so that the user is difficult to finish the one-time speaking, for example, the user instruction is "watch a movie", the intention of the user can be determined to be the media resource search, but the user cannot determine what the media resource the user wants to search is, and the current failure to respond to the user instruction is caused. At this time, the display device 200 needs to inquire about the media conditions to be searched by the user.

For the above case, the display device 200 may perform a plurality of rounds of conversations with the user to achieve the user's needs. The multi-round dialog refers to: continuous conversational scenarios are needed to address a specific task class based on multiple rounds of contextual content information. For single-round information provided by the user, if the display apparatus 200 cannot explicitly perform the related operation, it is necessary to consider multiple-round context information on the basis of the single-round information to satisfy the actual needs of the user.

FIG. 7 illustrates a schematic diagram of a user and a display device conducting multiple rounds of conversations in some embodiments. As shown in fig. 7, the user first initiates a voice instruction "i want to watch a movie" to the display device 200. According to the voice instruction, the display apparatus 200 can determine that the user intends to search for media assets, but cannot determine movie information to be played. At this time, the display device 200 may ask the user about the name of the media asset to be played, and may reply to the user "ask what is the name of the movie you want to watch". After hearing the reply voice of the display device 200, the user may instruct movie information, such as reply to "XXX movie", to supplement information required to execute the user instruction. After receiving the voice newly input by the user, the display apparatus 200 may search for the asset resources and play them. Meanwhile, the display device 200 may voice-reply to the user "video about XXX has been recommended for you" to prompt the user. The above is a multi-turn dialog scenario between the user and the display device 200.

In some embodiments, for some voice instructions, there may be a complete intent of the user, but when there are multiple execution choices for the display device 200, the display device 200 is temporarily unable to execute the user instructions, requiring confirmation to the user of the operation to be performed. For example, the user instruction is "watch movie-a", at which time the user's complete intent may be determined, but the display device 200 may query for multiple movie-a related assets, resulting in an inability to determine which asset to play. At this time, the display device 200 needs to ask the user for the content to be played.

FIG. 8 illustrates a schematic diagram of a user and a display device conducting multiple rounds of conversations in some embodiments. As shown in fig. 8, the user first initiates a voice command "recommend XXX movie third season" to the display device 200. Based on the voice command, the display device 200 may determine that the user intends to search for media assets, and the user voice command also includes complete search information, and the display device 200 may perform a search process, such as searching for related resources in a server. However, due to the diversity of the film sources in the network, the display device 200 may search for a plurality of related resources for one asset name, including "XXX movie third season", "XXX movie 3", and "XXX movie (three)". At this time, the display device 200 cannot determine which resource is specifically played, and the user needs to confirm again to execute the user instruction. The display device 200 may reply in voice to the user "find multiple related resources asking you which to play". After hearing the reply voice of the display device 200, the user may choose to play the content, e.g., reply "first", to instruct the display device 200 to play the first asset. After receiving the new voice input by the user, the display device 200 may play the first resource, and at the same time, the display device 200 may reply to the user with the voice "good for you to play" to prompt the user.

It should be noted that, in order for the display device 200 to support multiple-round dialogue capability, a speech recognition model may be created in advance from multiple-round dialogue corpora. The multi-turn dialogue corpus can be obtained by purchasing the multi-turn dialogue corpus or manually annotating the multi-turn dialogue corpus. However, the manner in which multiple rounds of dialog corpus or manual annotation are purchased can result in higher usage costs. The multi-round dialogue corpus purchased at the same time is not suitable for the application scene of the display device, so that the accuracy of voice recognition is low, and the use experience for users is poor. The manual annotation mode may result in a smaller number of multiple dialogue corpora, resulting in a less efficient model to be created when recognizing speech.

In order to solve the above problem, the display device 200 may generate multiple dialogue corpora by itself to simulate a real multiple dialogue scene between the user and the display device 200, so as to meet the requirements on the number of multiple dialogue corpora and application scene when creating a model, and improve the accuracy of voice recognition and the use experience of the user while reducing the use cost.

FIG. 9 shows an interactive flow diagram of components of display device 200 in some embodiments, including the steps of:

S101, acquiring a multi-round dialogue template from a multi-round dialogue template database; the multi-round dialogue template comprises a plurality of single-round dialogue template information;

s102, selecting a target single-round dialogue template based on the single-round dialogue template information; the target single-round dialogue template comprises a groove to be filled;

s103, obtaining a target entity type corresponding to the slot to be filled, and obtaining a target entity according to the target entity type;

s104, performing entity filling on the target single-round dialogue template based on the target entity to obtain single-round dialogue corpus;

s105, combining the single-round dialogue corpus to obtain multi-round dialogue corpus.

In some embodiments, controller 250 may first obtain a multi-turn dialog template, thereby generating a multi-turn dialog corpus from the multi-turn dialog template. The display device 200 may have a multi-turn dialog template database previously set therein, in which a plurality of multi-turn dialog templates may be stored, and the controller 250 may acquire the multi-turn dialog templates based on the preset multi-turn dialog template database. The multi-round dialog templates may be pre-created by a technician and stored in a multi-round dialog template database for application by the display device 200.

It should be noted that, in the embodiment of the present application, the multi-turn dialog is set to be composed of a plurality of single-turn dialogs, where a single-turn dialog refers to one-time dialog behavior of the display device 200 or the user. Each single-turn dialog may represent an indication of an interaction by the display device 200 or the user, e.g., one dialog by the display device 200 replying to the user or one voice command by the user indicating that the display device 200 is a single-turn dialog.

The multi-turn dialog template includes a plurality of single-turn dialog template information, each single-turn dialog template information corresponds to one single-turn dialog, and some information required in the corresponding single-turn dialog, such as user intention, domain information, entity information, and the like, may be included in the single-turn dialog template information, thereby defining the content of the single-turn dialog. Each multi-round dialogue template defines how many single-round dialogues the multi-round dialogue scene corresponding to the multi-round dialogue template has, and meanwhile, the single-round dialogue template information corresponding to each single-round dialogue is included. The controller 250 may generate a corresponding single-turn dialog according to the single-turn dialog template information to obtain a single-turn dialog corpus. The controller 250 may generate multiple rounds of dialogue corpora from the single round of dialogue corpora for each round.

In view of the corpus and the adaptability of the display device 200, the application scene of the multi-round dialog template design may be an application scene related to the display device 200, so as to simulate a multi-round dialog scene between a user and the display device in reality.

In some embodiments, controller 250 may sample the multi-round dialog templates in a multi-round dialog template database, resulting in a plurality of multi-round dialog templates. Wherein each multi-turn dialog template may be sampled multiple times to increase the number of multi-turn dialog corpora generated.

In sampling the multi-round conversation template, the controller 250 may first obtain probability distribution information of the multi-round conversation template in the multi-round conversation template database.

For each multi-turn dialog template, there will be a scene prediction probability that characterizes the probability that the corresponding multi-turn dialog of the template will appear in the real scene. For an application scenario corresponding to a certain user intention, multiple rounds of dialogs may exist, and the occurrence probability of each round of dialogs may be different. For example, for an application scenario where a user searches for a media asset, it may be that the user instructs to search for a certain movie media asset, and after searching for a corresponding resource, the display device 200 may have three results, including: a plurality of resources are searched, one resource is searched, and no resource is searched. For three results, three multiple rounds of dialog would correspond. When searching for multiple resources, the display device 200 may reply to the user that "search for multiple related resources, please select a video to be watched", so that the user selects a certain resource to watch. When a resource is searched, the display device 200 may reply to the user that "the relevant resource has been searched, please confirm whether to play" so that the user confirms whether to perform the play process. When no resource is searched, the display device 200 may reply to the user that "related resources are not searched, please search again" to prompt the user that the play is impossible.

It should be noted that, in a real scene, the probability of actual occurrence of each multi-turn dialog is different. For example, there are many cases where a plurality of resources are searched, that is, the probability of occurrence is high. For this reason, the scene prediction probabilities of the corresponding multiple rounds of dialog templates in each application scene may be preset, for example, the scene prediction probabilities of three cases of searching for multiple resources, searching for one resource, and not searching for a resource may be 0.6, 0.1, and 0.3, respectively.

Each multi-round dialogue template in the multi-round dialogue template database has a scene prediction probability, and the controller 250 can count all scene prediction probabilities to obtain probability distribution information of all multi-round dialogue templates.

The controller 250 may calculate the multi-round dialog template sampling scale based on the probability distribution information. For example, the scene prediction probabilities of three cases of searching for a plurality of resources, searching for a resource and not searching for a resource are respectively 0.6, 0.1 and 0.3, so the sampling ratio of the three multi-round dialog templates is 6:1:3.

The controller 250 may obtain a multi-round dialog template in a multi-round dialog template database based on the multi-round dialog template sampling scale. It should be noted that the sampling proportion of the multi-round dialogue templates only limits the sampling proportion, but does not limit the sampling quantity, and each multi-round dialogue template can be sampled multiple times so as to increase the quantity of the generated multi-round dialogue corpus.

In some embodiments, controller 250 may select a target single-turn dialog template based on the single-turn dialog template information. Some information types of the current single-turn dialog may be defined in the single-turn dialog template information, and the controller 250 may match the corresponding single-turn dialog template according to the information types.

The single-turn dialog template information may include a dialog behavior type. The dialog behavior type may be used to characterize dialog intents of a single-turn dialog, and may also characterize dialog initiators of the single-turn dialog.

In the embodiment of the application, the single-round dialogue acts are divided into a system dialogue act and a user dialogue act. The system dialogue acts refer to dialogue acts that the display device 200 replies to the user according to previous dialogue information, and the user dialogue acts refer to dialogue acts that the user instructs the display device 200 to perform some operation during interaction with the television. The dialog behavior type may distinguish whether the dialog initiator is the display device 200 or the user.

It should be noted that, the first time a dialog is initiated in the multi-round dialog, the user will send a voice command to the display device 200 to open the single-round dialog when the user has a certain requirement. Meanwhile, the user and the display device 200 alternately perform the dialogue acts, that is, the sequence of the dialogues is: user-display device 200-user-display device 200. In the multiple rounds of conversations, the conversation initiator of the odd number of conversations is the user, and the conversation initiator of the even number of conversations is the display device 200.

For both system dialog behavior and user dialog behavior, dialog intent may be characterized. For example, the system dialogue acts may include apices (success), apifiaire (failure), request (request for optional slots), confirm (query confirmation), etc. for characterizing the system intention, which may be an execution result of the prompt display device 200 or an instruction case of requesting the user for the next step. User dialog actions may include search, affirm, select, and deny for representing user intent.

The controller 250 may acquire a dialog behavior type based on the single-turn dialog template information to determine a dialog intention of the current single-turn dialog.

The controller 250 may traverse a preset single-turn dialog template database to obtain a single-turn dialog template corresponding to the dialog behavior type, which is referred to as a first single-turn dialog template in the embodiment of the present application.

The display device 200 is provided with a single-turn conversation template database in advance, in which a plurality of single-turn conversation templates are stored. It should be noted that each single-round dialog template may correspond to a dialog behavior type for characterizing dialog initiators and dialog intents of the template. After determining the dialogue action type in the single-turn dialogue template information, the controller 250 may filter the single-turn dialogue template database according to the dialogue action type to obtain a first single-turn dialogue template corresponding to the dialogue action type.

After acquiring the first single-round dialog template, the controller 250 may acquire the domain type based on the single-round dialog template information. Wherein the single-round dialog template information may also characterize the domain type of the multi-round dialog. The domain type is the business domain related to the dialogue, including music domain, video domain, boring domain, etc., for example, the user dialogue is "search movie", and the domain type is the video domain. The domain type may be stored in a single-round dialog template information of a first-round dialog, i.e. the domain type is defined in the dialog initiated by the user for the first time.

The controller 250 may screen the target single-round dialog templates among the first single-round dialog templates according to the field type. The single-round dialogue template corresponds to a field type, and a first single-round dialogue template of the field type can be directly acquired and used as a target single-round dialogue template.

In some embodiments, the controller 250 may consider some of the entity information in the single-round dialog template information when screening the first single-round dialog template based on the domain type. Considering that the scope of the domain types is large, some contents of the single-round dialog can be limited again in the single-round dialog template information, and entity constraint information can be added in the single-round dialog template information by limiting the entity. For example, the entity constraint information may be movie name, actor name, movie date of the show, etc. The controller 250 may further filter the single round dialog templates by entity constraint information.

The controller 250 may first screen out a first single-round dialog template associated with a domain type, which is referred to as a second single-round dialog template in this embodiment, that is, the second single-round dialog template is a template of the domain type.

The controller 250 may detect entity constraint information in the single round dialog template information.

If the single-turn dialog template information includes entity constraint information, the controller 250 may obtain an entity type corresponding to the entity constraint information, which is referred to as a first entity type in this embodiment of the present application.

The controller 250 may match the first entity slot in the second single-round dialog template based on the first entity type. The single-round dialogue template can comprise a plurality of slots to be filled, and each slot can be filled with one piece of entity data, so that the single-round dialogue template is supplemented. And each slot may correspond to an entity type for indicating the entity filling the entity type. The controller 250 may traverse all slots of the second single-round dialog template, thereby obtaining slots corresponding to the first entity type, which in the embodiment of the present application is referred to as a first entity slot.

The controller 250 may obtain a target single-round dialog template based on the first entity slot, which is a second single-round dialog template including the first entity slot.

In some embodiments, if the single-round dialog template information does not include entity constraint information, the controller 250 may randomly sample the second single-round dialog template to obtain the target single-round dialog template.

In some embodiments, if the single-turn dialog template information does not include entity constraint information, the controller 250 may also set a certain entity type by itself.

The controller 250 may obtain a domain entity type, where in this embodiment, the domain entity type is an entity type corresponding to a domain type, and may be that one of a plurality of entity types corresponding to a domain type is randomly selected as the domain entity type.

The controller 250 may match a second entity slot in a second single round dialog template based on the domain entity type. The controller 250 may obtain a target single-turn conversation template including a second entity slot therein.

In some embodiments, the controller 250 may decide whether self-limiting entity information is required based on the dialog initiator of the single-round dialog. Entity information may be defined in a user session, considering that some information may be indicated in the user initiated session. The dialog of the display device 200 may be a reply execution result and thus may not define entity information.

Upon detecting that no entity constraint information is included in the single-turn dialog template information, the controller 250 may obtain a dialog initiator based on the dialog behavior type.

If the conversation initiator is the display device 200, the controller 250 may randomly sample the second single-round conversation template to obtain the target single-round conversation template.

If the conversation initiator is a user, the controller 250 may acquire a domain entity type and filter the second single-round conversation template according to the domain entity type, thereby obtaining a target single-round conversation template.

In some embodiments, after the target single-round dialog template is obtained, the controller 250 may physically fill the slots to be filled, considering that the target single-round dialog template includes a number of slots to be filled.

The controller 250 may obtain an entity type corresponding to the slot to be filled, which is referred to as a target entity type in the embodiment of the present application. The controller 250 may acquire the target entity according to the target entity type.

In some embodiments, when acquiring the target entity according to the target entity type, the controller 250 may traverse the preset entity database to acquire the entity data corresponding to the target entity type.

The display device 200 may be preset with an entity database, where a plurality of entities corresponding to a plurality of entity types are stored. The controller 250 may screen the entity types in the entity database, so as to obtain entity data corresponding to the target entity type.

The controller 250 may acquire the target entity based on the entity data.

In some embodiments, when acquiring the target entity based on the entity data, the controller 250 may perform a random sampling process on the entity data to obtain the target entity.

In some embodiments, the controller 250 may increase the number of entity samples by obtaining a plurality of different target entities to increase the number of corpora.

The controller 250 may first sample the entity data to obtain the first entity.

The controller 250 may perform entity link query processing on the first entity based on a preset entity knowledge graph to obtain an entity prediction result. The entity prediction result comprises a plurality of second entities and entity prediction probabilities of the second entities.

The entity knowledge graph refers to a knowledge base describing various entities or concepts and their relationships existing in the real world, and may be a YAGO multilingual knowledge graph. The YAGO contains both entities (e.g., movies, characters, cities, countries, etc.) and relationships between these entities (who plays in which movie, which city is in which country, etc.). The entities of the YAGO contain names and aliases of the respective languages, and the YAGO is stored in a standard resource description framework "RDF", the data of which consists of triples, each of which consists of a subject, a predicate (also called a "relationship" or "attribute"), and an object. The YAGO divides these entities into different classes, such as people, cities, etc. There are also inclusion and inclusion relationships between the classes, e.g., a city class is a subclass of a residential class, which is in turn a subclass of a geographic location class. YAGO also defines relationships between entities, e.g., there may be birth relationships between the entity's people and places.

The entity link refers to obtaining, for any entity, a plurality of other entities associated with the entity in a preset multilingual knowledge graph, where each other entity may be used as an associated entity of the target entity, and in this embodiment, the second entity is referred to as a second entity. The second entity may be used to interpret the first entity, or may be an entity that has a relationship with the first entity. For example, for a person a, a may be the first entity, the second entity queried may be "actor a" to represent the occupation of a, or the second entity may be a name of a asset to represent the asset that a has participated in.

When the entity link inquiry processing is carried out, the link result can be recalled by a rule-based method such as minimum editing distance, cross ratio, entity length, entity heat and the like. The entity link query result includes the second entity and the classification probability (which may also be considered as the score of the entity tag) of the second entity, which is referred to as the entity prediction probability in this embodiment of the present application. The entity prediction probability may represent a degree of association between the second entity and the first entity, the higher the degree of association, the higher the score.

The controller 250 may sample the second entity based on the entity prediction probability to obtain the target entity. The controller 250 may determine the sampling rate based on the entity prediction probabilities so that a plurality of target entities are collected.

In some embodiments, controller 250 may perform entity population on the target single-turn dialog template based on the target entity, resulting in a single-turn dialog corpus. The controller 250 combines the single-turn dialog corpus to obtain a multi-turn dialog corpus.

In some embodiments, after the multiple rounds of dialogue corpus are obtained, the controller 250 may perform semantic annotation processing on the multiple rounds of dialogue corpus to obtain training data. The semantic annotation processing comprises intent annotation, field annotation and entity annotation.

Controller 250 may construct a speech recognition model based on the training data. The speech recognition model may be a semantic analysis model for analyzing a user's speech dialog to generate a reply dialog.

In some embodiments, the display device 200 may be provided with a voice interaction mode, and the user may control the display device 200 to enter voice instructions into the voice interaction mode.

In some embodiments, the user may send voice interaction mode instructions to the display device 200 by operating specified keys of the remote control. And pre-binding the corresponding relation between the voice interaction mode instruction and the remote controller key in the actual application process. For example, a voice interaction mode button is provided on the remote controller, and when the user touches the button, the remote controller sends a voice interaction mode command to the controller 250, and at this time, the controller 250 controls the display device 200 to enter the voice interaction mode.

In some embodiments, voice interaction mode instructions may also be sent to the display device 200 when the user controls the display device 200 using a smart device, such as using a cell phone. In the actual application process, a control may be set in the mobile phone operation interface, and whether to enter the voice interaction mode may be selected by the control, so as to send a voice interaction mode instruction to the controller 250, where the controller 250 may control the display device 200 to enter the voice interaction mode.

A voice recognition mode option may also be set in the UI interface of the display device, and when the user clicks the option, the display device may be controlled to enter or exit the voice recognition mode. FIG. 10 illustrates a schematic diagram of a system setup UI interface in some embodiments. As shown in fig. 10, the system settings include a screen setting, a sound setting, a voice recognition setting, a network setting, and a factory setting restoration. The user may click on a speech recognition control to control the display device 200 to enter or exit speech recognition mode.

In some embodiments, to prevent the user from false triggering the voice interaction mode, when the controller 250 receives the voice interaction mode command, the display 260 may be controlled to display voice interaction mode confirmation information, so that the user performs a secondary confirmation as to whether to control the display apparatus 200 to enter the voice interaction mode. FIG. 11 illustrates a schematic diagram of voice interaction pattern confirmation information in some embodiments.

In some embodiments, the user may send a voice command to the display device 200, and the controller 250 may control the sound collector to collect the voice command input by the user. In response to the voice instruction, the controller 250 may recognize the voice instruction based on the voice recognition model to generate reply dialog information.

The controller 250 may transmit the received voice data to a voice recognition service. The speech recognition service is a web service that may be deployed on the display device 200 and may include a speech recognition module and a semantic analysis module. Wherein the speech recognition service is used to recognize audio as text. The speech recognition model may be stored in a semantic analysis module for semantic parsing of text to understand the intent of the user and executing the speech instructions to achieve the corresponding functionality.

In some embodiments, the display device 200 may also include a third party speech recognition interface. Upon receiving a voice command input by a user, the controller 250 may transmit voice data to a third party voice recognition interface, convert the voice command of the user into voice text using a third party voice recognition device or the like. After the voice text is obtained, the controller 250 may parse the voice text and execute the voice instruction.

In some embodiments, the generating of the multi-round dialog corpus may be performed by a server. The server may obtain and send multiple rounds of dialog corpus to the display device 200 according to the above steps. The display device 200 may construct a speech recognition model from the multi-turn dialog corpus to recognize user speech instructions to implement various multi-turn dialog scenarios.

The embodiment of the application also provides a multi-round dialogue expected generation method which is applied to the display equipment and comprises the following steps:

step 101, acquiring a multi-round dialogue template from a multi-round dialogue template database; the multi-round dialogue template comprises a plurality of single-round dialogue template information;

step 102, selecting a target single-round dialogue template based on the single-round dialogue template information; the target single-round dialogue template comprises a groove to be filled;

step 103, obtaining a target entity type corresponding to the slot to be filled, and obtaining a target entity according to the target entity type;

step 104, performing entity filling on the target single-round dialogue template based on the target entity to obtain single-round dialogue corpus;

step 105, combining the single-round dialogue corpus to obtain multi-round dialogue corpus.

In some embodiments, the multi-round dialog template is obtained in a multi-round dialog template database, further comprising:

Probability distribution information of the multi-round dialogue templates in the multi-round dialogue template database is obtained. And calculating the sampling proportion of the multi-round dialogue templates based on the probability distribution information. Based on the multi-round dialogue template sampling proportion, a multi-round dialogue template is acquired from a multi-round dialogue template database.

In some embodiments, selecting the target single-turn dialog template based on the single-turn dialog template information further comprises:

a dialog behavior type is obtained based on the single-round dialog template information, the dialog behavior type being used to characterize dialog intents. Traversing the single-round dialogue template database to obtain a first single-round dialogue template corresponding to the dialogue action type; a plurality of single-turn conversation templates are stored in a single-turn conversation template database. And acquiring the domain type based on the single-round dialogue template information. And screening the target single-round dialogue templates from the first single-round dialogue templates according to the field types.

In some embodiments, the screening the first single-round dialog template based on the domain type to obtain a target single-round dialog template further includes:

and acquiring a second single-round dialogue template, wherein the second single-round dialogue template is a first single-round dialogue template related to the domain type. Entity constraint information in single-round dialog template information is detected. And if the single-round dialogue template information contains entity constraint information, acquiring a first entity type corresponding to the entity constraint information. The first entity slot is matched in a second single round dialog template based on the first entity type. And acquiring a target single-round dialogue template based on the first entity slot, wherein the target single-round dialogue template comprises the first entity slot.

In some embodiments, after detecting entity constraint information in the single-round dialog template information, further includes:

and if the single-round dialogue template information does not contain entity constraint information, randomly sampling the second single-round dialogue template to obtain the target single-round dialogue template.

Or acquiring a domain entity type, wherein the domain entity type is an entity type corresponding to the domain type; based on the domain entity type, matching a second entity slot in a second single-round dialogue template; and acquiring a target single-round dialogue template, wherein the target single-round dialogue template comprises a second entity slot.

In some embodiments, the dialog behavior type is also used to characterize a dialog initiator, which includes a display device and a user. If the single-round dialogue template information does not contain entity constraint information, the method further comprises the following steps:

the dialog initiator is obtained based on the dialog behavior type. If the dialog initiator is a display device, a step of randomly sampling a second single-round dialog template is performed. If the dialog initiator is a user, a step of obtaining the domain entity type is performed.

In some embodiments, obtaining the target entity according to the target entity type further comprises:

traversing a preset entity database to obtain entity data corresponding to the target entity type; the entity database includes a plurality of entities. The target entity is obtained based on the entity data.

In some embodiments, obtaining the target entity based on the entity data further comprises:

sampling entity data to obtain a first entity; performing entity link query processing on the first entity based on a preset entity knowledge graph to obtain an entity prediction result, wherein the entity prediction result comprises a plurality of second entities and entity prediction probabilities of the second entities; and sampling the second entity based on the entity prediction probability to obtain a target entity.

Or, performing random sampling processing on the entity data to obtain the target entity.

In some embodiments, after combining the single-round dialog corpus to obtain the multi-round dialog corpus, the method further includes:

semantic annotation processing is carried out on the multi-round dialogue corpus to obtain training data; the semantic annotation processing comprises intent annotation, field annotation and entity annotation. A speech recognition model is constructed based on the training data. In response to the voice command, the voice command is identified based on the voice recognition model to generate reply dialog information.

The same and similar parts of the embodiments in this specification are referred to each other, and are not described herein.

It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in parts contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments or some parts of the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, characterized by comprising:

a display;

a controller configured to:

2. The display device of claim 1, wherein the controller performs retrieving a multi-round dialog template in a multi-round dialog template database, further configured to:

acquiring probability distribution information of a plurality of rounds of dialogue templates in the multi-round dialogue template database;

calculating the sampling proportion of the multi-round dialogue templates based on the probability distribution information;

and acquiring a plurality of rounds of dialogue templates from the multi-round dialogue template database based on the sampling proportion of the plurality of rounds of dialogue templates.

3. The display device of claim 1, wherein the controller performing selecting a target single-turn dialog template based on the single-turn dialog template information is further configured to:

Acquiring dialogue action types based on the single-round dialogue template information, wherein the dialogue action types are used for representing dialogue intents;

traversing a single-round dialogue template database to obtain a first single-round dialogue template corresponding to the dialogue action type; the single-round dialogue template database stores a plurality of single-round dialogue templates;

acquiring a domain type based on the single-round dialogue template information;

and screening target single-round dialogue templates from the first single-round dialogue templates according to the field types.

4. The display device of claim 3, wherein the controller performs screening of the first single-turn dialog template based on the domain type to obtain a target single-turn dialog template, further configured to:

acquiring a second single-round dialogue template, wherein the second single-round dialogue template is a first single-round dialogue template associated with the field type;

detecting entity constraint information in the single-round dialogue template information;

if the single-round dialogue template information contains entity constraint information, acquiring a first entity type corresponding to the entity constraint information;

based on the first entity type, matching a first entity slot in the second single-round dialogue template;

And acquiring a target single-round dialogue template based on the first entity slot position, wherein the target single-round dialogue template comprises the first entity slot position.

5. The display device of claim 4, wherein the controller, upon executing the detection of entity constraint information in the single round of dialog template information, is further configured to:

if the single-round dialogue template information does not contain entity constraint information, randomly sampling the second single-round dialogue template to obtain a target single-round dialogue template;

or acquiring a domain entity type, wherein the domain entity type is an entity type corresponding to the domain type; based on the domain entity type, matching a second entity slot in the second single-round dialogue template; and obtaining a target single-round dialogue template, wherein the target single-round dialogue template comprises the second entity slot.

6. The display device of claim 5, wherein the dialog behavior type is further used to characterize a dialog initiator, the dialog initiator comprising a display device and a user; if no entity constraint information is included in the single round dialog template information, the controller is further configured to:

Acquiring a dialogue initiator based on the dialogue action type;

if the dialogue initiator is a display device, executing the step of randomly sampling the second single-round dialogue template;

and if the dialogue initiator is a user, executing the step of acquiring the domain entity type.

7. The display device of claim 4, wherein the controller performs the obtaining of the target entity according to the target entity type, further configured to:

traversing a preset entity database to obtain entity data corresponding to the target entity type; the entity database comprises a plurality of entities;

and acquiring a target entity based on the entity data.

8. The display device of claim 7, wherein the controller performs the obtaining of the target entity based on the entity data, further configured to:

sampling the entity data to obtain a first entity; performing entity link query processing on the first entity based on a preset entity knowledge graph to obtain an entity prediction result, wherein the entity prediction result comprises a plurality of second entities and entity prediction probabilities of the second entities; sampling the second entity based on the entity prediction probability to obtain a target entity;

Or performing random sampling processing on the entity data to obtain a target entity.

9. The display device according to claim 1, further comprising: an audio input interface configured to connect to a sound collector for collecting user speech to generate a speech instruction;

the controller is configured to combine the single-turn dialogue corpus to obtain a multi-turn dialogue corpus:

carrying out semantic annotation processing on the multi-round dialogue corpus to obtain training data; the semantic annotation processing comprises intention annotation, field annotation and entity annotation;

constructing a voice recognition model based on the training data;

responsive to the voice instructions, the voice instructions are identified based on the voice recognition model to generate reply dialog information.

10. A multi-round dialog prediction generation method applied to a display device, the method comprising: