CN117809649A

CN117809649A - Display device and semantic analysis method

Info

Publication number: CN117809649A
Application number: CN202310124921.8A
Authority: CN
Inventors: 胡胜元; 彭超; 胡仁林
Original assignee: Vidaa Netherlands International Holdings BV
Current assignee: Vidaa Netherlands International Holdings BV
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2024-04-02

Abstract

Some embodiments of the present application provide a display device and a semantic analysis method. After the voice collector collects the voice command, the display device can recognize a first voice text corresponding to the voice command. The display equipment detects historical voice instructions collected by the sound collector in a preset period and recognizes second voice texts corresponding to the historical voice instructions. The display device obtains continuous instruction information based on the first voice text and the second voice text, obtains entities in the continuous instruction information, and obtains entity tags corresponding to the entities. The display device performs information fusion processing on the continuous instruction information based on the entity tag to obtain fused instruction information, and performs semantic analysis on the fused instruction information, so as to determine and execute a control instruction corresponding to the voice instruction. The display device can analyze the current voice command by combining the historical voice command of the user, so that the semantics can be accurately obtained, the voice command can be accurately responded, and the use experience of the user is improved.

Description

Display device and semantic analysis method

Technical Field

The application relates to the technical field of display equipment, in particular to display equipment and a semantic analysis method.

Background

With the development of artificial intelligence technology, the voice interaction function gradually enters various fields in people's life. People can utilize the voice interaction function to realize voice control of the display device. People can utilize the voice interaction function to perform a series of operations such as watching video, listening to music, looking up weather, controlling equipment and the like.

For a display device, in the process of implementing a voice interaction function, a voice recognition module generally recognizes a voice instruction input by a user as a text, and then a semantic analysis module performs lexical syntax and semantic analysis on the text, thereby analyzing the intention of the user. And finally, the control end controls the intelligent electronic equipment to perform corresponding operation according to the intention of the user.

However, there may be instances of continuous dialog when the user is engaged in a voice interaction with the display device. For example, when the user needs to adjust the volume of the display device, "adjust volume to 30" may be spoken first, and "adjust to 50" may be continued if not satisfied. For the voice command 'adjusting to 50', if the display device only analyzes the voice command, the control intention of the user may not be determined, so that the semantics corresponding to the text may not be obtained accurately, and the voice command of the user may not be responded accurately, thereby seriously affecting the use experience of the user.

Disclosure of Invention

The application provides a display device and a semantic analysis method. The method solves the problems that in the related technology, semantics corresponding to the text cannot be obtained accurately, so that voice instructions of a user cannot be responded accurately, and the use experience of the user is seriously affected.

In a first aspect, some embodiments of the present application provide a display device including a display, an audio input interface, and a controller. Wherein the audio input interface is configured to be connected with a sound collector, and the sound collector is used for collecting user voice; the controller is configured to perform the steps of:

responding to the voice command acquired by the voice acquisition unit, and identifying a first voice text corresponding to the voice command;

detecting a historical voice instruction collected by the sound collector in a preset period, and identifying a second voice text corresponding to the historical voice instruction;

acquiring continuous instruction information based on the first voice text and the second voice text;

acquiring an entity in the continuous instruction information and acquiring an entity tag corresponding to the entity;

performing information fusion processing on the continuous instruction information based on the entity tag to obtain fused instruction information;

And executing semantic analysis on the fused instruction information.

In a second aspect, some embodiments of the present application provide a semantic analysis method, applied to a display device, including:

responding to a voice instruction acquired by a sound collector, and identifying a first voice text corresponding to the voice instruction;

and executing semantic analysis on the fused instruction information.

According to the technical scheme, some embodiments of the application provide a display device and a semantic analysis method. After the voice collector collects the voice command, the display device can recognize a first voice text corresponding to the voice command. The display equipment detects historical voice instructions collected by the sound collector in a preset period and recognizes second voice texts corresponding to the historical voice instructions. The display device obtains continuous instruction information based on the first voice text and the second voice text, obtains entities in the continuous instruction information, and obtains entity tags corresponding to the entities. The display device performs information fusion processing on the continuous instruction information based on the entity tag to obtain fused instruction information, and performs semantic analysis on the fused instruction information, so as to determine and execute a control instruction corresponding to the voice instruction. The display device can analyze the current voice command by combining the historical voice command of the user, so that the semantics can be accurately obtained, the voice command can be accurately responded, and the use experience of the user is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 illustrates a usage scenario of a display device according to some embodiments;

fig. 2 shows a hardware configuration block diagram of the control apparatus 100 according to some embodiments;

fig. 3 illustrates a hardware configuration block diagram of a display device 200 according to some embodiments;

FIG. 4 illustrates a software configuration diagram in a display device 200 according to some embodiments;

FIG. 5 illustrates a voice interaction network architecture diagram of a display device in some embodiments;

FIG. 6 is a schematic diagram illustrating display of historical instruction fusion pattern validation information in a display in some embodiments;

FIG. 7 illustrates an interactive flow diagram for components of a display device in some embodiments;

FIG. 8 illustrates a schematic diagram of a semantic analysis model in some embodiments;

FIG. 9 illustrates a schematic view of a scenario in which a user and a display device interact with each other in some embodiments;

FIG. 10 illustrates a schematic diagram of a display device displaying a search interface in some embodiments;

FIG. 11 is a schematic diagram illustrating media asset detail pages in some embodiments;

FIG. 12 illustrates a schematic diagram of a display device displaying a hint in some embodiments.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The display device provided in the embodiment of the application may have various implementation forms, for example, may be a television, an intelligent television, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), an electronic desktop (electronic table), and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.

In some embodiments, a smart device 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.

In some embodiments, the display device may receive instructions not using the smart device or control device described above, but rather receive control of the user by touch or gesture, or the like.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.

As shown in fig. 3, the display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.

The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.

A user interface, which may be used to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, or the like. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

The modem 210 receives broadcast television signals through a wired or wireless reception manner, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, graphics processor (Graphics Processing Unit, GPU), RAM Random Access Memory, RAM), ROM (Read-Only Memory, ROM), first to nth interfaces for input/output, a communication Bus (Bus), and the like.

The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user, which enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

As shown in fig. 4, the system of the display device is divided into three layers, an application layer, a middleware layer, and a hardware layer, from top to bottom.

The application layer mainly comprises common applications on the television, and an application framework (Application Framework), wherein the common applications are mainly applications developed based on Browser, such as: HTML5 APPs; native applications (Native APPs);

the application framework (Application Framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange, and the interface for the use of these functions (toolbar, status column, menu, dialog box).

Native applications (Native APPs) may support online or offline, message pushing, or local resource access.

The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.

The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

Fig. 5 illustrates a voice interaction network architecture diagram of a display device in some embodiments. As shown in fig. 5, the display device 200 may receive input information such as sound and output a processing result of the information. The speech recognition module is deployed with a speech recognition service (Automatic Speech Recognition, ASR) for recognizing audio as text; the semantic understanding module is provided with semantic understanding service (Natural Language Understanding, NLU) for carrying out semantic analysis on the text; the service management module is deployed with a service instruction management service such as session management (Dialog Management, DM) for providing service instructions; the language generation module is provided with a language generation service (Natural Language Understanding, NLG) for converting instructions for instructing the display device to execute into a text language; the voice synthesis module is provided with a voice synthesis (TextToSpeech, TTS) service, and is used for processing the text language corresponding to the instruction and then sending the processed text language to a loudspeaker for broadcasting. Multiple entity service devices with different service services can exist in the voice interaction network architecture, and one or more functional services can be integrated in one or more entity service devices.

Taking the information input to the display device 200 as a query sentence input by voice as an example:

And (3) voice recognition: the display apparatus 200 may perform noise reduction processing and feature extraction on the audio of the query sentence after receiving the query sentence input through the voice, where the noise reduction processing may include steps of removing echo and environmental noise, and the like.

Semantic understanding: natural language understanding is performed on the identified candidate text and associated context information. The text is parsed into structured, machine-readable information, business fields, intents, word slots, etc., to express semantics, etc., resulting in executable intent determination intent confidence scores, and the semantic understanding module selects one or more candidate executable intents based on the determined intent confidence scores.

And (3) service management: the semantic understanding module issues a query instruction to the corresponding service management module according to the semantic analysis result of the text of the query statement to acquire the query result given by the service, performs actions required by the user to finish the final request, and feeds back the device execution instruction corresponding to the query result.

Language generation: is configured to generate the information or instructions into language text. The method can be divided into boring type, task type, knowledge question-answering type and recommendation type. The NLG in the chat type dialogue carries out intention recognition, emotion analysis and the like according to the context, and then generates an openness reply; in the task type dialogue, dialogue reply is generated according to the learned strategy, and general reply comprises clarification requirement, guidance user, inquiry, confirmation, dialogue ending language and the like; generating knowledge (knowledge, entity, fragment, etc.) required by a user according to the recognition and classification of question types, information retrieval or text matching in the knowledge question-answering type dialogue; and in the recommended dialogue system, interest matching and candidate recommended content sorting are carried out according to the hobbies of the user, and then the recommended content for the user is generated.

And (3) speech synthesis: configured to present a speech output to a user. The speech synthesis processing module synthesizes a speech output based on text provided by the digital assistant. For example, the generated dialog response is in the form of a text string. The speech synthesis module converts the text string into audible speech output.

It should be noted that the architecture shown in fig. 5 is only an example, and is not intended to limit the scope of the present application. Other architectures may also be employed to achieve similar functionality in embodiments of the present application, for example: all or part of the above process may be performed by the display device 200, which is not described herein.

In some embodiments, after the user controls the display device to power on, the controller 250 may control the display 260 to display a user interface. The user interface may be a specific target image, for example, various media materials obtained from a network signal source, including video, pictures and the like. The user interface may also be some UI interface of the display device, such as a system recommendation page or the like. The user can control the display device to realize various services, such as playing media assets, entertainment games, video chat, and the like. The user may also control the display device using a voice recognition function.

In some embodiments, the display device may be provided with an audio input interface configured to connect to a sound collector for collecting signals, such as user speech. The display device may also incorporate a sound collector. The voice recognition function may be implemented by the cooperation of the sound collector and the controller 250, and the semantic function may be implemented by the controller 250.

The user may control the display device 200 using a control apparatus, such as a remote controller, for example, for a smart tv, and the user may control the tv to play media or adjust the volume using the remote controller, thereby controlling the smart tv.

The user may also send voice instructions to the display device 200 by means of voice input while controlling the display device 200. The display device can recognize the voice command of the user and convert the voice command into the control command, so that corresponding operation of the control command is executed to realize the function required by the user.

In some embodiments, the controller 250 may control the sound collector to collect voice instructions input by the user. After the voice collector collects the voice command, the controller 250 may parse the voice command to obtain a voice text.

The controller 250 may transmit the received voice data to a voice recognition service, thereby converting into text information, resulting in voice text. The speech recognition service is a web service that may be deployed on the display device 200 and may include a speech recognition module and a semantic analysis module. The voice recognition service is used for recognizing the audio as text, and the semantic service is used for carrying out semantic analysis on the text. For example, the speech recognition module may parse a voice command input by a user to thereby recognize a voice text. And then the semantic analysis module analyzes the lexical syntax and the semantics of the voice text so as to understand the intention of the user and execute the voice instruction to realize the corresponding function.

It should be noted that, when the voice recognition function is used, the user may perform a continuous dialogue with the display device, that is, may continuously send a voice command to the display device, and consider that the display device has already known the previously sent command, so that a part of key information may be absent in the voice command sent later by the user. For example, when the user needs to adjust the volume of the display device, "adjust volume to 30" may be spoken first, and "adjust to 50" may be continued if not satisfied. From the speech continuously transmitted by the user, the user wishes the display device to adjust the volume to 50. However, if the display device only performs semantic analysis on the voice command sent by the user each time, the voice is obviously adjusted to 50, so that key information is obviously lacking, the semantics cannot be accurately obtained, the display device cannot determine what the real intention of the user is, and the voice command of the user cannot be accurately responded, so that the use experience of the user is seriously affected.

In some embodiments, when the display device is unable to respond to the user's voice command, the display 260 may be controlled to display a prompt, for example, "the problem is too difficult, and a small x is still learning", to prompt the user that the display device is currently unable to respond to the user's voice command. The user may again enter voice instructions to supplement the content so that the display device can respond.

However, when the user inputs the instruction again, it may be considered that it has not been clarified before, and the primitive sentence is re-spoken again, so that the display device still cannot recognize the voice instruction. Or even if the user can supplement the content, the operation of the user is complicated, and the use experience for the user is poor.

Therefore, the display device provided by the embodiment of the application has a history instruction fusion function. When a user sends a voice command to the display device, the display device can detect the historical voice command input by the user, and semantic analysis is performed after the historical voice command and the current voice command are fused, so that information such as user intention can be accurately analyzed, the voice command of the user can be accurately responded, and the use experience of the user is improved.

The display device may be provided with a history instruction fusion mode. When the display device enters the history instruction fusion mode, the display device can start the history instruction fusion function to detect the built-in hard disk.

In some embodiments, the user may send the history instruction fusion mode instruction to the display device by operating a designated key of the remote control. And pre-binding the corresponding relation between the history instruction fusion mode instruction and the remote controller key in the actual application process. For example, a history command fusion mode key is set on the remote controller, when the user touches the key, the remote controller sends a history command fusion mode command to the controller 250, and at this time, the controller 250 controls the display device to enter the history command fusion mode. When the user touches the key again, the controller 250 may control the display device to exit the history command fusion mode.

In some embodiments, the user may send a history command fusion mode command to the display device by way of voice input using a sound collector of the display device, such as a microphone, to control the display device to enter a history command fusion mode. The display device can be provided with an intelligent voice system, and the intelligent voice system can recognize the voice of the user so as to extract instruction content input by the user.

In some embodiments, the history instruction fusion mode instruction may also be sent to the display device when the user controls the display device using the smart device, such as using a cell phone. In the actual application process, a control can be set in the mobile phone, whether to enter the historical instruction fusion mode can be selected through the control, so that a historical instruction fusion mode instruction is sent to the controller 250, and the controller 250 can control the display device to enter the historical instruction fusion mode.

The user can also set a history instruction fusion mode option in the UI interface of the display device, and when clicking the option, the user can control the display device to enter or exit the history instruction fusion mode.

In some embodiments, to prevent the user from triggering the historical command fusion mode by mistake, when the controller 250 receives the historical command fusion mode command, the display 260 may be controlled to display the historical command fusion mode confirmation information, so that the user performs a secondary confirmation as to whether to control the display device to enter the historical command fusion mode. FIG. 6 illustrates a schematic diagram of display of historical instruction fusion pattern validation information in display 260 in some embodiments.

FIG. 7 illustrates an interactive flow diagram for components of a display device in some embodiments. As shown in fig. 7, the method comprises the following steps:

s101, in response to the voice command acquired by the audio acquirer, the controller 250 recognizes a first voice text corresponding to the voice command.

S102, the controller 250 detects a historical voice instruction collected by the audio collector in a preset period, and identifies a second voice text corresponding to the historical voice instruction.

S103, the controller 250 acquires continuous instruction information based on the first voice text and the second voice text.

S104, the controller 250 acquires the entity in the continuous instruction information and acquires the entity label corresponding to the entity.

S105, the controller 250 performs information fusion processing on the continuous instruction information based on the entity tag to obtain fused instruction information.

S106, the controller 250 performs semantic analysis on the fused instruction information.

In some embodiments, the user may turn on the voice recognition function of the display device and the controller may control the sound collector to collect the signal. After the user inputs the voice command, the voice collector can collect the voice command and send the voice command to the display device.

In response to the voice command collected by the sound collector, the controller can identify the voice command to obtain a text corresponding to the voice command, which is called a first voice text in the embodiment of the application. The controller may recognize the first voice text corresponding to the voice instruction using a voice recognition service in the display device.

In some embodiments, the display device may also include a third party speech recognition interface. After receiving the voice command input by the user, the controller may send the voice data to the third party voice recognition interface, and convert the voice command of the user into the first voice text by using the third party voice recognition device and the like.

The controller may also send voice instructions to the server. The server may generate a first phonetic text according to the phonetic instruction and feed the first phonetic text back to the display device.

In some embodiments, when receiving a voice command sent by the sound collector, the controller may determine the current time, which may be the time when the voice command is collected by the audio collector, which is referred to as a collection time in this embodiment of the present application.

The controller can acquire the historical voice command input by the user before the acquisition time, so that the current voice command and the historical voice command are subjected to fusion analysis, and the current command semantics of the user are accurately analyzed.

The controller can acquire the voice acquisition state of the sound collector within a preset period before the acquisition time, and the preset period can be one minute. And detecting whether the voice collector collects voice instructions of the user in a preset period, so as to determine whether the user continuously sends the voice instructions.

If the voice collector collects the historical voice command, the controller can identify the historical voice command to obtain text information corresponding to the historical voice command, and the text information is called a second voice text in the embodiment of the application.

If the sound collector does not collect the historical voice instruction, the controller can determine the preset text as the second voice text, and the preset text can be blank text. It should be noted that, under the condition that no historical voice command exists, the preset text and the first voice text are fused, so that the data format of the subsequent input can be ensured to be uniform.

In some embodiments, the controller may obtain the continuous instruction information based on the first voice text and the second voice text. That is, the controller can perform fusion processing on the first voice text and the second voice text, so as to comprehensively analyze the current voice instruction of the user.

The controller may first obtain the historical intent and instruction execution status of the historical voice instruction. The historical intent refers to user intent corresponding to a historical voice instruction, and the historical intent may be user intent obtained after analysis of the historical voice instruction based on a historical instruction fusion function of the display device in the embodiment of the application. The instruction execution state refers to an execution state of a history voice instruction by the display device, such as execution success or execution failure.

The controller may fuse the above. The controller can splice the first voice text, the historical intent, the instruction execution state and the second voice text based on a preset template to obtain a spliced text.

The controller may token (tokenize) the above and fuse it according to a preset template. The preset templates may be: [ CLS ] A [ B ] [ C ] [ SEP ] D [ SEP ].

Wherein [ CLS ] (global feature aggregation) is a sentence token in the template for representing sentence information of the entire template. The [ SEP ] represents the end of one phonetic text, which is used as a segmentation symbol, and the two preceding and following [ SEP ] characterize the end of the second phonetic text and the first phonetic text, respectively. A represents a second voice text, B represents a history intention, C represents an instruction execution state, and D represents a first voice text.

It should be noted that, for [ CLS ], [ SEP ], the history intent, and the instruction execution status, they can be regarded as a single token. For the first voice text and the second voice text, each word is taken as a token.

The fusion of the contents can be realized by adding the token corresponding to the first voice text, the historical intent, the instruction execution state and the second voice text into a preset template, so that a spliced text is obtained.

For example, the second voice text corresponding to the historical voice instruction input by the user is open YouTube. After the controller inquires, the historical intention of the previous round is app.open, and meanwhile the display device successfully opens the YouTube application, namely the instruction execution state is expressed as an apiscices.

The first voice text corresponding to the voice command currently input by the user is success.

The controller can add the token corresponding to the information into a preset template, and the obtained spliced text is:

[CLS]open YouTube[app.open][apisuccess][SEP]Succession[SEP]

wherein, include 8 token, respectively: "[ CLS ]", "open", "YouTube" "," [ app open ] "," [ apiscices ] "," [ SEP ] "," "success", "[ SEP ]". Each token may be considered a word of the stitched text.

In some embodiments, the historical intent and the instruction execution status may also be blank information when considering the situation that there is no historical voice instruction in the preset period, that is, the second voice text is blank text. At this time, in order to ensure that the data formats of the spliced texts are consistent, when fusion processing is performed on these, the history intention, the instruction execution state, and the token corresponding to the second voice text may be set to [ PAD ]. [ PAD ] represents a filler symbol, and is not particularly limited.

For example: the first voice text of the user's current voice instruction is search for spider man and there is no historical voice instruction. That is, the user does not continuously input voice instructions, and the acquired spliced text may be:

[CLS]PAD[PAD][PAD][SEP]search for spider man[SEP]

wherein the historical intent, instruction execution status, and second phonetic text are all represented by [ PAD ].

After the spliced text is obtained, the controller may encode each word segment in the spliced text based on a preset format to obtain a word vector corresponding to each word segment, which is referred to as a first word vector in this embodiment of the present application.

Each token in the spliced text is a word, and the preset format can be 768 in dimension for subsequent model processing.

The controller may process each word segment using an encoding method, such as one-hot encoding onehot, to obtain a first word vector. The first word vectors of all the word segmentation jointly form continuous instruction information corresponding to the current voice instruction.

In some embodiments, after fusing the historical user instructions, the controller may further analyze some entity contents in the instructions, and fuse meanings corresponding to the entity contents together, so as to better analyze semantics of the user instructions. For example, the user's voice instruction may be to instruct searching for assets, where the entity may be "success" or "fade" or other asset information. Without knowledge of its corresponding knowledge, it is difficult to translate this portion of the request reasonably into instructions and respond correctly, so that the content of the entity representation therein can be identified.

The controller may first obtain the entity in the continuous instruction information, and obtain the entity tag corresponding to the entity.

Note that, the continuous instruction information includes, in addition to the voice text, the history intention and the instruction execution state, but the entity existing in the history intention and the instruction execution state cannot represent the content indicated by the user, so that the history intention and the instruction execution state can be ignored, and only the entity existing in the first voice text and the second voice text can be obtained.

When the entity is obtained, the controller may perform word segmentation processing on the voice text first, and determine the entity based on the part of speech. In this embodiment, the entities are noun words contained in the voice text, such as movie names, person names, and the like, and these entities are all existing nouns. The controller may perform part-of-speech tagging and entity recognition on the voice text, may employ, for example, stanza-NLP (stanford natural language processing toolkit), and may use other lexical analysis tools. After deriving the part of speech for each word, the controller may filter out the words that are named entities and noun parts of speech and act as entities in the continuous instruction information.

In some embodiments, the controller may obtain the entity tag corresponding to the entity based on a preset knowledge graph.

Knowledge maps refer to a knowledge base describing various entities or concepts and their relationships that exist in the real world. In the embodiment of the application, a multilingual knowledge graph including relationships among different entities of multiple languages may be generated in advance, and the multilingual knowledge graph may be a YAGO multilingual knowledge graph.

The YAGO contains both entities (e.g., movies, characters, cities, countries, etc.) and relationships between these entities (who plays in which movie, which city is in which country, etc.). The entities of the YAGO contain names and aliases of the respective languages, and the YAGO is stored in a standard resource description framework "RDF", the data of which consists of triples, each of which consists of a subject, a predicate (also called a "relationship" or "attribute"), and an object. The YAGO divides these entities into different classes, such as people, cities, etc. There are also inclusion and inclusion relationships between the classes, e.g., a city class is a subclass of a residential class, which is in turn a subclass of a geographic location class. YAGO also defines relationships between entities, e.g., there may be birth relationships between the entity's people and places.

The controller may perform query processing on the entity according to a preset knowledge graph to obtain an entity tag result, where the entity tag result includes a plurality of entity tags and a classification probability of the entity tags.

In the knowledge graph, a plurality of other entities associated with the entity can be obtained, and each other entity can be used as an entity tag of the entity and can be a KB knowledge tag. For example, for a person a, a may be the target entity, and the queried entity tag may be "actor" for representing the occupation of a, or the entity tag may be a name of a asset, which represents the asset that a has participated in.

Each entity tag can correspond to a classification probability, and the entity tag with the highest classification probability can be determined as the entity tag corresponding to the entity.

For each entity, a BIO label mode in slot labeling can be adopted to give a corresponding start position label B and end label I, and meanwhile, for a word, a plurality of entities can be input, and the corresponding labels can be given to the entities. For the request statement "Beyond Essays" this statement is a movie name, while Beyond is a band name, so its physical label can be typed as follows:

{“Beyond”:“B_movie，B_musician”；“Essays”:“I_movie”}

in some embodiments, after the entity tag is obtained, the controller may perform information fusion processing on the continuous instruction information based on the entity tag to obtain fused instruction information.

The controller may perform encoding processing on the physical tag first, or may perform single-hot encoding to obtain a first physical tag vector.

A fixed number of entities, for example 53 entities, may be set in the knowledge-graph, each entity corresponding to both B and I tags, so there are 106 entity tags in the knowledge-graph. Thus, the length of the one-hot code may be set to 106, where the presence of the entity tag is indicated by 1 and the absence thereof is indicated by 0, thereby obtaining the one-hot code corresponding to the entity tag. Thus, the first entity tag vector is also 106 in length.

The controller may convert the first entity tag vector into a second entity tag vector in a predetermined format. Considering the dimension 768 of the first word vector, the first entity tag vector may be converted into a vector with the dimension 768, resulting in a second entity tag vector.

The controller may obtain a second word vector for each word segment in the concatenated text based on the first word vector and the second entity tag vector. The controller may add the first word vector and the second entity tag vector to obtain a second word vector.

In some embodiments, the controller may perform feature analysis on the semantics using a multilingual training model, and the controller may process the second word vector in consideration of the input of the model.

The controller may obtain a location vector and a text ID vector for each word segment. The position vector is a vector corresponding to the position of the segmentation word in the spliced text, and the text ID vector is a vector corresponding to the text ID of the segmentation word in the spliced text. The analysis was performed using the splice text "[ CLS ] set the 30[ volume.set ] [ sys.success ] [ SEP ]50[ SEP ]" as an example, which contained 10 token words, namely, 10 partial words, respectively "[ CLS ]," set "," the "," volume "," 30"," [ volume set ], "[ sys ]," [ SEP ], "50", "[ SEP ]". The location ID of each word segment may be set to 0-9 in turn.

The [ SEP ] is used to represent the end symbol of the phonetic text, so that by means of the [ SEP ] it can be determined to which text each word belongs. Wherein, "set", "the", "volume", "30", "[ volume. Set ]", "[ sys success ]", "[ SEP ]" belong to the second phonetic text and "50", "[ SEP ]" belong to the first phonetic text. It should be noted that, [ CLS ] is used to represent sentence information of the whole of the spliced text, and the purpose of the [ CLS ] is to analyze the current voice command, so that the [ CLS ] can be attributed to the first voice text corresponding to the current voice command. The controller may preset a text ID of the first voice text and thus the second voice text.

The controller may encode the position ID and the text ID of each word segment to obtain a position vector and a text ID vector of each word segment.

The controller may obtain a fusion vector for each word segment based on the second word vector, the location vector, and the text ID vector. The controller may add the second word vector, the position vector, and the text ID vector to obtain a fusion vector. And the fusion vector communication of all the word segmentation forms fusion instruction information.

In some embodiments, the controller may perform semantic analysis on the fused instruction information.

The controller can process the fusion instruction information based on a preset multilingual training model so as to analyze the characteristic condition of each word and obtain the characteristic vector of each word.

The multilingual training model may be a LaBSE (Language-agnostic BERT Sentence Embedding, multilingual BERT embedded vector model), such as a transducer model, or the like. The pre-training LaBSE contains alignment of word vectors and alignment of sentence vectors. Multilingual BERT (pre-trained language characterization model) is employed to train on the alignment of word vectors on corpora in multiple languages. Wherein, in order to map the codes of different languages to the same space and realize the alignment of word vectors of a plurality of languages, the mixed pre-training of MMLM (Multilingual Masked Language Model, multi-language masking language model) and TLM (Translation Language Model) and the translation voice modeling is adopted to realize the representation of the codes of different languages to the same semantic space. The sentence alignment is further trained on the multi-language parallel corpus according to LaBSE by adopting a contrast learning method, so that sentence vectors of different languages are aligned in a unified semantic space. The LaBSE adopts double encoders to encode the source language and the target language respectively during training, and the two encoders share parameters and initialize the parameters by using the BERT model pre-trained by MMLM and TLM methods.

In some embodiments, the controller may process the feature vector based on a preset semantic understanding model to obtain semantic information.

The semantic understanding model may be a prediction classifier including a domain classifier, an intent classifier, and a slot prediction classifier. The domain classifier and the intention classifier can respectively analyze the feature vectors corresponding to the [ CLS ] word segmentation, so that the user intention and the domain type corresponding to the current voice instruction are obtained. The slot prediction classifier can analyze the feature vectors corresponding to the residual segmentation words, so that slot information is obtained.

Three types of slots can be set for each slot: B. i, O. Wherein, the B slot represents the beginning of the entity of the slot parameter corresponding to the slot, the I slot represents the subsequent part of the entity of the slot parameter corresponding to the slot, the O slot represents the non-entity of the slot parameter corresponding to the slot, and the I slot is generally verb and other parts of speech.

Therefore, the controller can acquire semantic information corresponding to the current voice instruction based on the feature vector, wherein the semantic information comprises user intention, field type and slot position information and is used for representing the semantics of the user voice instruction.

In some embodiments, the controller may generate control instructions based on the user intent, the domain type, and the slot information and control the display device to execute the control instructions in response to the voice instructions of the user.

The controller can also package user intention, field type and slot position information and send the package information to the server, and the server can generate a control instruction according to the package information and send the control instruction to the display device. After receiving the control instruction sent by the server, the controller can control the display device to execute the control instruction, so that the requirement of a user is met.

In some embodiments, it is contemplated that the user's current voice command may be a repeated description of the historical voice command, rather than a continuous command, where the voice command and the historical voice command are relatively similar, an accurate control command may not be generated. Thus, the controller can detect whether the current voice command and the historical voice command are repeated commands.

The controller can acquire the instruction similarity of the voice instruction and the historical voice instruction and judge whether the instruction similarity meets the preset similarity condition.

If the command similarity does not meet the preset similarity condition, the current voice command and the historical voice command are not repeated commands, and the controller can generate control commands based on user intention, field type and slot position information.

And if the instruction similarity meets a preset similarity condition, indicating that the current voice instruction and the historical voice instruction are repeated instructions. The controller may continue to obtain the instruction type of the get voice instruction.

If the instruction type is a preset display device control instruction, the user is instructed to adjust the basic functions of the display. In this embodiment of the present application, the preset display device control instruction refers to an instruction for controlling a basic function of the display device, where the basic function includes a display device parameter adjustment, such as adjusting volume and brightness, and a confirmation instruction, a return instruction, and the like. It should be noted that, the basic functions of the display device do not include functions that need to be executed by a specific application, such as media searching. At this time, the controller may generate the control instruction based on the user intention, the field type, and the slot information.

If the instruction type is not a preset display device control instruction, the controller may input a first voice text to the media asset search application to cause the media asset search application to search for response media assets in response to the user voice instruction. Alternatively, the controller may control the display device not to respond to the voice command.

In some embodiments, the instruction similarity includes text similarity and semantic similarity.

The preset similarity condition may be set as: the text similarity is greater than a first threshold and the semantic similarity is greater than a second threshold.

The controller may obtain the text similarity and the semantic similarity, respectively, and determine whether a similarity condition is satisfied.

When the text similarity is acquired, the controller may acquire a text repetition rate of the first and second voice texts and determine the text repetition rate as the text similarity. The text repetition rate can be obtained using equation (1).

Wherein pre_query and query represent a first phonetic text and a second phonetic text, respectively.

Set (pre_query) ≡set (query) | represents the number of words repeated in the first voice text and the second voice text.

When the semantic similarity is obtained, the controller may first determine the user intent corresponding to the voice command and the historical voice command.

If the user intention corresponding to the voice command and the historical voice command is the same, the controller can acquire the slot information similarity of the first voice text and the second voice text, and determine the slot information similarity as semantic similarity. The method for obtaining the similarity of the slot position information comprises the following steps:

/>

wherein slot is _i And preslot _i And respectively representing the ith slot position information corresponding to the first voice text and the second voice text.

similarity_slot indicates the similarity of slot information, and n indicates the number of slot information.

If the user intentions corresponding to the voice command and the historical voice command are different, determining a preset value as semantic similarity, wherein the preset value can be 0.

In some embodiments, the controller may also pre-generate the semantic analysis model. The semantic analysis model comprises a coding layer, a characteristic analysis layer and a prediction layer. FIG. 8 illustrates a schematic diagram of a semantic analysis model in some embodiments. As shown in fig. 8, the encoding layer is configured to perform fusion of the first voice text and the second voice text to obtain continuous instruction information, and continue to fuse the corresponding encoding of the entity tag, so as to obtain fused instruction information, that is, E1-E10 in the figure, which respectively represents a fusion vector corresponding to each word.

The feature analysis layer may be a transducer model, which is used to analyze the fusion vector of each word segment to obtain a feature vector.

The prediction layer comprises a domain classifier, an intention classifier and a slot prediction classifier and is used for acquiring semantic information.

In some embodiments, the controller may control the display device to execute control instructions in response to user voice instructions to meet user needs.

FIG. 9 illustrates a schematic view of a scenario in which a user and a display device interact with each other in some embodiments. As shown in fig. 9, the user inputs a voice command "search for XXX movies third season", the display device sends the voice command to the controller, and the controller feeds back to the display device control command for searching for related media assets. The display device may execute the control instruction and prompt the user "video about XXX has been recommended for you" by voice.

In some embodiments, the display device may present a search interface for media that the user wants to search. FIG. 10 illustrates a schematic diagram of a display device displaying a search interface in some embodiments.

When a user selects a certain target media asset, the display device may display a media asset detail page of the target media asset. FIG. 11 is a schematic diagram of a media asset detail page in some embodiments, as shown in FIG. 11, which may include a video preview window for displaying a video view of a target media asset; media asset introduction including media asset type and staffing information, etc.; a play list for displaying the number of the media assets; a play control, i.e. "full screen play" in fig. 11; and the related recommendation area is used for displaying other media assets. The user can touch the play control, so that the display device is controlled to display the target media asset in a full-screen mode.

If the display equipment does not search the related media assets, preset prompt information can be displayed, and the prompt information is used for prompting the user that the related media assets are not searched. FIG. 12 illustrates a schematic diagram of a display device displaying a hint in some embodiments.

The embodiment of the application also provides a semantic analysis method applied to the display equipment, which comprises the following steps:

Step 1301, responding to a voice instruction collected by the sound collector, and identifying a first voice text corresponding to the voice instruction.

Step 1302, detecting a historical voice instruction collected by the sound collector in a preset period, and identifying a second voice text corresponding to the historical voice instruction.

Step 1303, obtaining continuous instruction information based on the first voice text and the second voice text.

Step 1304, obtaining an entity in the continuous instruction information, and obtaining an entity tag corresponding to the entity.

Step 1305, performing information fusion processing on the continuous instruction information based on the entity tag to obtain fused instruction information;

step 1306, semantic analysis is performed on the fused instruction information.

In some embodiments, further comprising:

determining the acquisition time when the voice command is acquired by the audio acquisition device;

acquiring a voice acquisition state of the sound collector in a preset period before the acquisition time;

if the sound collector collects the historical voice instruction, identifying a second voice text corresponding to the historical voice instruction;

and if the sound collector does not collect the historical voice instruction, determining the preset text as a second voice text.

In some embodiments, obtaining continuous instruction information based on the first phonetic text and the second phonetic text further comprises:

acquiring a historical intent and an instruction execution state of a historical voice instruction;

splicing the first voice text, the historical intent, the instruction execution state and the second voice text based on a preset template to obtain a spliced text;

encoding the word segmentation of the spliced text based on a preset format to obtain a first word vector of the word segmentation;

the first word vector of the plurality of segmented words is formed into continuous instruction information.

In some embodiments, the information fusion processing is performed on the continuous instruction information based on the entity tag, and the method further includes:

performing single-hot coding on the entity tag to obtain a first entity tag vector;

converting the first entity tag vector into a second entity tag vector in a preset format;

acquiring a second word vector of the word segmentation based on the first word vector and the second entity tag vector;

obtaining a position vector and a text ID vector of a word segmentation;

acquiring fusion vectors of the word segmentation based on the second word vector, the position vector and the text ID vector;

and forming fusion instruction information by fusion vectors of a plurality of segmentation words.

In some embodiments, performing semantic analysis on the fused instruction information further comprises:

Generating feature vectors of the word segmentation according to the fusion instruction information by using a preset multilingual training model;

generating semantic information according to the feature vector by using a preset semantic understanding model; the semantic information includes user intent, domain type, and slot information.

In some embodiments, after performing semantic analysis on the fused instruction information, further comprising:

generating a control instruction based on the user intention, the field type and the slot position information;

executing the control instruction.

In some embodiments, further comprising:

obtaining the command similarity of the voice command and the historical voice command;

if the instruction similarity does not meet the preset similarity condition, executing a step of generating a control instruction based on user intention, field type and slot position information;

if the command similarity meets a preset similarity condition, acquiring the command type of the voice command; if the instruction type is a preset display equipment control instruction, executing a step of generating a control instruction based on user intention, field type and slot position information; if the instruction type is not the preset display equipment control instruction, inputting the first voice text into the media searching application so that the media searching application executes the voice instruction or does not execute the voice instruction.

In some embodiments, the instruction similarity includes text similarity and semantic similarity. The preset similarity conditions are as follows: the text similarity is greater than a first threshold and the semantic similarity is greater than a second threshold.

In some embodiments, obtaining the instruction similarity of the voice instruction and the historical voice instruction further comprises:

acquiring the text repetition rate of the first voice text and the second voice text, and determining the text repetition rate as the text similarity;

if the intention of the user corresponding to the voice command is the same as that of the user corresponding to the historical voice command, acquiring the similarity of the slot information, and determining the similarity of the slot information as semantic similarity;

and if the user intentions corresponding to the voice command and the historical voice command are different, determining the preset value as the semantic similarity.

In some embodiments, obtaining an entity tag corresponding to an entity further includes:

inquiring entity tag information of an entity based on a preset knowledge graph, wherein the entity tag information comprises a plurality of entity tags and classification probability of the entity tags;

and determining the entity label with the highest classification probability as the entity label corresponding to the entity.

The same and similar parts of the embodiments in this specification are referred to each other, and are not described herein.

It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in parts contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments or some parts of the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, characterized by comprising:

a display;

an audio input interface configured to connect to a sound collector for collecting user speech;

a controller configured to:

and executing semantic analysis on the fused instruction information.

2. The display device of claim 1, wherein the controller is further configured to:

acquiring the voice acquisition state of the sound collector in a preset period before the acquisition time;

3. The display device of claim 1, wherein the controller executing the obtaining of continuous instruction information based on the first and second phonetic texts is further configured to:

acquiring a historical intent and an instruction execution state of the historical voice instruction;

and forming a plurality of first word vectors of the word segmentation into continuous instruction information.

4. The display device according to claim 3, wherein the controller performs information fusion processing of the continuous instruction information based on the entity tag, and is further configured to:

performing single-heat coding on the entity tag to obtain a first entity tag vector;

converting the first entity tag vector into a second entity tag vector with a preset format;

acquiring a position vector and a text ID vector of the segmentation;

acquiring a fusion vector of the segmentation based on the second word vector, the position vector and the text ID vector;

and forming fusion instruction information by using a plurality of fusion vectors of the segmentation.

5. The display device of claim 4, wherein the controller performing semantic analysis on the fused instruction information is further configured to:

generating feature vectors of the segmentation according to the fusion instruction information by using a preset multilingual training model;

6. The display device of claim 5, wherein the controller, after performing semantic analysis on the fused instruction information, is further configured to:

generating a control instruction based on the user intent, the domain type and the slot information;

and executing the control instruction.

7. The display device of claim 6, wherein the controller is further configured to:

if the instruction similarity does not meet a preset similarity condition, executing a step of generating a control instruction based on the user intention, the field type and the slot position information;

if the instruction similarity meets a preset similarity condition, acquiring the instruction type of the voice instruction; if the instruction type is a preset display equipment control instruction, executing a step of generating a control instruction based on the user intention, the field type and the slot position information; and if the instruction type is not a preset display equipment control instruction, inputting the first voice text into a media resource searching application so that the media resource searching application executes the voice instruction or does not execute the voice instruction.

8. The display device of claim 7, wherein the instruction similarity includes text similarity and semantic similarity; the preset similarity condition is as follows: the text similarity is greater than a first threshold and the semantic similarity is greater than a second threshold;

the controller executing instructions that obtain the similarity of the voice instructions and the historical voice instructions is further configured to:

acquiring the text repetition rates of the first voice text and the second voice text, and determining the text repetition rates as text similarity;

if the intention of the user corresponding to the voice command is the same as that of the user corresponding to the historical voice command, acquiring the slot information similarity, and determining the slot information similarity as semantic similarity;

and if the user intention corresponding to the voice command and the historical voice command is different, determining a preset value as the semantic similarity.

9. The display device of claim 1, wherein the controller executing the obtaining the entity tag corresponding to the entity is further configured to:

inquiring entity tag information of the entity based on a preset knowledge graph, wherein the entity tag information comprises a plurality of entity tags and classification probability of the entity tags;

10. A semantic analysis method applied to a display device, the method comprising:

and executing semantic analysis on the fused instruction information.