CN117806454A

CN117806454A - Display device and semantic understanding method

Info

Publication number: CN117806454A
Application number: CN202310246562.3A
Authority: CN
Inventors: 胡仁林; 朱飞
Original assignee: Vidaa Netherlands International Holdings BV
Current assignee: Vidaa Netherlands International Holdings BV
Priority date: 2023-03-14
Filing date: 2023-03-14
Publication date: 2024-04-02

Abstract

The application provides a display device and a semantic understanding method, wherein the display device comprises: a display; a controller in communication with the display, the controller configured to: receiving a voice instruction input by a user; performing voice recognition on the voice command to obtain a user question; extracting the characteristics of the user question to obtain the characteristic vector of the user question; acquiring a semantic understanding result of the feature vector through a multilingual semantic understanding model; generating response data of the user question according to the semantic understanding result; responding according to the response data. According to the method and the device, multi-language semantic understanding is achieved, and voice interaction experience is improved.

Description

Display device and semantic understanding method

Technical Field

The application relates to the technical field of natural language understanding, in particular to a display device and a semantic understanding method.

Background

In recent years, along with the rapid development of deep learning research and the continuous promotion of global strategies of various companies, multilingual human-computer interaction becomes a research hotspot. In the related art, a method for multi-language man-machine interaction is to translate a new language instruction input by a user into a primitive instruction which can be identified by a semantic understanding model by using machine translation so as to facilitate semantic understanding, translate a semantic understanding result output by the semantic understanding model into a new language semantic understanding result, and respond to the user according to the translated semantic understanding result. However, machine translation has low accuracy in translating small languages, especially spoken languages, and the errors are accumulated and propagated continuously, which often results in poor final human-computer interaction effect.

Disclosure of Invention

In order to solve the technical problems, the application provides display equipment and a semantic understanding method.

In a first aspect, the present application provides a display device comprising:

a display;

a controller, in communication with the display, configured to:

receiving a voice instruction input by a user;

performing voice recognition on the voice command to obtain a user question;

extracting the characteristics of the user question to obtain the characteristic vector of the user question;

acquiring a semantic understanding result of the feature vector through a multilingual semantic understanding model;

generating response data of the user question according to the semantic understanding result;

responding according to the response data.

In some embodiments, the controller is further configured to:

collecting first training data of a source language and second training data of a target language;

and carrying out model training on the multilingual speech semantic understanding model through the first training data and the second training data alternately.

In some embodiments, the alternately model training the multilingual speech semantic understanding model by the first training data and the second training data comprises:

extracting features of the first training data through a feature extractor to generate a first feature vector;

respectively carrying out intention understanding and slot filling on the first feature vector through the multilingual language meaning understanding model to obtain a first meaning understanding result;

fitting the first training data according to the first semantic understanding result, and updating the multilingual speech semantic understanding model and the feature extractor;

performing feature extraction on the second training data through the feature extractor to generate a second feature vector;

and performing domain judgment on the second feature vector through a domain classifier, and updating the domain classifier and the feature extractor according to a judgment result.

In some embodiments, the loss function of the multilingual speech semantic understanding model comprises a cross entropy loss function, the loss function of the domain classifier comprises a squared difference loss function, and the loss function of the feature extractor comprises a difference function of the loss function of the multilingual speech semantic understanding model and the loss function of the domain classifier.

In some embodiments, the first training data includes an intent category label and a slot label, and the second training data does not include the intent category label and the slot label.

In some embodiments, the multilingual semantic understanding model includes an intention classifier and a slot label classifier, and the intention understanding and slot filling are performed on the first feature vector by the multilingual semantic understanding model respectively to obtain a first semantic understanding result, including:

carrying out intention understanding on the first feature vector through the intention classifier to obtain an intention category label;

and filling the first feature vector with slots through the slot label classifier to obtain slot labels, wherein the semantic understanding result of the first feature vector comprises the intention type labels and the slot labels.

In a second aspect, the present application provides a semantic understanding method, the method comprising:

collecting first training data of a source language provided with a mark and second training data of a target language not provided with a mark;

alternately carrying out model training on the multilingual speech semantic understanding model through training data of the source language and training data of the target language;

and inputting the user question of the target language into the multilingual semantic understanding model to obtain a semantic understanding result of the user instruction.

In some embodiments, the indicia of the first training data includes an intent category label and a slot label, and the second training data does not include the intent category label and the slot label.

The display device and the semantic understanding method have the beneficial effects that:

according to the display device provided by the embodiment of the application, the feature vectors corresponding to the voice instructions are extracted, the feature vectors are input into the multilingual semantic understanding model for semantic understanding, semantic understanding results are obtained, the multilingual semantic understanding model supports semantic understanding on the feature vectors of multiple languages, the accuracy of semantic understanding is high, and man-machine interaction experience is improved; according to the semantic understanding method, the semantic understanding model is trained into the multilingual language semantic understanding model with multilingual language semantic understanding capability through transfer learning, the related labeling corpus of a new language is not needed, dependence on training data of the new language is reduced, the problem of cold start of the new language is solved, and user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the implementation in the related art, a brief description will be given below of the drawings required for the embodiments or the related art descriptions, and it is apparent that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings for those of ordinary skill in the art.

A system architecture diagram of a speech recognition device according to some embodiments is schematically shown in fig. 1;

a hardware configuration block diagram of a smart device 200 according to some embodiments is illustrated in fig. 2;

a hardware configuration block diagram of a smart device 200 according to some embodiments is illustrated in fig. 3;

a logical architecture schematic of a smart television 200-1 according to some embodiments is illustrated in fig. 4;

a schematic structural diagram of a multilingual speech sense understanding model according to some embodiments is exemplarily shown in fig. 5;

a flow diagram of a semantic understanding method according to some embodiments is illustrated in fig. 6;

a flow diagram of a training method for a multilingual speech semantic understanding model according to some embodiments is schematically shown in fig. 7;

a schematic diagram of a voice interaction flow according to some embodiments is shown schematically in fig. 8;

a schematic diagram of a voice interaction interface according to some embodiments is shown schematically in fig. 9;

a schematic diagram of a voice interaction interface according to some embodiments is shown schematically in fig. 10;

a schematic diagram of a voice interaction interface according to some embodiments is shown schematically in fig. 11;

a schematic diagram of a voice interaction interface according to some embodiments is schematically shown in fig. 12.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

Fig. 1 shows an exemplary system architecture to which the speech recognition method and speech recognition apparatus of the present application may be applied. As shown in fig. 1, where 10 is a server, 200 is a terminal device, and exemplary includes (smart tv 200a, mobile device 200b, smart speaker 200 c).

The server 10 and the terminal device 200 in the present application perform data communication through various communication modes. The terminal device 200 may be permitted to make communication connection through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 10 may provide various contents and interactions to the terminal device 20. The terminal device 200 and the server 10 can transmit and receive information, and receive software program updates, for example.

The server 10 may be a server providing various services, such as a background server providing support for audio data collected by the terminal device 200. The background server may perform analysis and other processing on the received data such as audio, and feed back the processing result (e.g., endpoint information) to the terminal device. The server 10 may be a server cluster, or may be a plurality of server clusters, and may include one or more types of servers.

The terminal device 200 may be hardware or software. When the terminal device 200 is hardware, it may be various electronic devices having a sound collection function, including but not limited to a smart speaker, a smart phone, a television, a tablet computer, an electronic book reader, a smart watch, a player, a computer, an AI device, a robot, a smart vehicle, and the like. When the terminal apparatuses 200, 201, 202 are software, they can be installed in the above-listed electronic apparatuses. Which may be implemented as a plurality of software or software modules (e.g. for providing sound collection services) or as a single software or software module. The present invention is not particularly limited herein.

In some embodiments, the sound cloning method provided by the embodiments of the present application may be performed by the server 10.

Fig. 2 shows a block diagram of a hardware configuration of a smart device 200 in accordance with an exemplary embodiment. The smart device 200 as shown in fig. 2 includes at least one of a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller includes a central processing unit, an audio processor, a RAM, a ROM, and first to nth interfaces for input/output.

The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The smart device 200 may establish transmission and reception of control signals and data signals through the communicator 220 and the server 10.

A user interface operable to receive external control signals.

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

The sound collector may be a microphone, also called "microphone", which may be used to receive the sound of a user and to convert the sound signal into an electrical signal. The smart device 200 may be provided with at least one microphone. In other embodiments, the smart device 200 may be provided with two microphones, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the smart device 200 may also be provided with three, four, or more microphones to enable collection of sound signals, noise reduction, identification of sound sources, directional recording, etc.

In addition, the microphone may be built in the smart device 200, or the microphone may be connected to the smart device 200 by a wired or wireless method. Of course, the location of the microphone on the smart device 200 is not limited in this embodiment of the present application. Alternatively, the smart device 200 may not include a microphone, i.e., the microphone is not provided in the smart device 200. The smart device 200 may be coupled to a microphone (also referred to as a microphone) via an interface such as the USB interface 130. The external microphone may be secured to the smart device 200 by external fasteners such as a camera mount with a clip.

The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the smart device 200.

Illustratively, the controller includes at least one of a central processing unit (Central Processing Unit, CPU), an audio processor, RAM Random Access Memory, RAM, ROM (Read-Only Memory), first to nth interfaces for input/output, a communication Bus (Bus), and the like.

In some examples, the operating system of the smart device is an Android system, and as shown in fig. 3, the smart tv 200-1 may be logically divided into an application layer (Applications) 21, a kernel layer 22 and a hardware layer 23.

Wherein, as shown in fig. 3, the hardware layers may include the controller 250, the communicator 220, the detector 230, etc. shown in fig. 2. The application layer 21 includes one or more applications. The application may be a system application or a third party application. For example, the application layer 21 includes a voice recognition application that can provide a voice interactive interface and services for enabling connection of the smart television 200-1 with the server 10.

The kernel layer 22 acts as software middleware between the hardware layer and the application layer 21 for managing and controlling hardware and software resources.

In some examples, the kernel layer 22 includes a detector driver for sending voice data collected by the detector 230 to a voice recognition application. Illustratively, the voice recognition application in the smart device 200 is started, and in the case where the smart device 200 establishes a communication connection with the server 10, the detector driver is configured to send the voice data input by the user and collected by the detector 230 to the voice recognition application. The speech recognition application then sends the query information containing the speech data to the intent recognition module 202 in the server. The intent recognition module 202 is used to input voice data sent by the smart device 200 into the intent recognition model.

In order to clearly illustrate the embodiments of the present application, a voice recognition network architecture provided in the embodiments of the present application is described below with reference to fig. 4.

Referring to fig. 4, fig. 4 is a schematic diagram of a voice interaction network architecture according to an embodiment of the present application. In fig. 4, the smart device is configured to receive input information and output a processing result of the information. The voice recognition module is provided with a voice recognition service for recognizing the audio as a text; the semantic understanding module is provided with semantic understanding service for carrying out semantic analysis on the text; the business management module is deployed with business instruction management service for providing business instructions; the language generation module is deployed with a language generation service (NLG) for converting instructions indicating the intelligent device to execute into a text language; the voice synthesis module is provided with a voice synthesis (TTS) service, and is used for processing the text language corresponding to the instruction and then sending the processed text language to a loudspeaker for broadcasting. In one embodiment, there may be multiple entity service devices deployed with different service services in the architecture shown in fig. 4, and one or more entity service devices may also aggregate one or more functional services.

In some embodiments, the following describes an example of a process of processing information input to a smart device based on the architecture shown in fig. 4, where the information input to the smart device is a query sentence input through voice, for example:

[ Speech recognition ]

The intelligent device may perform noise reduction processing and feature extraction on the audio of the query sentence after receiving the query sentence input through voice, where the noise reduction processing may include steps of removing echo and environmental noise.

Semantic understanding

Natural language understanding is performed on the identified candidate text and associated context information using acoustic and language models, and the text is parsed into structured, machine-readable information, information such as business fields, intentions, word slots, etc., to express semantics, etc. The semantic understanding module selects one or more candidate actionable intents based on the determined intent confidence scores.

[ business management ]

The semantic understanding module issues a query instruction to the corresponding service management module according to the semantic analysis result of the text of the query statement to acquire the query result given by the service, performs actions required by the user to finish the final request, and feeds back the device execution instruction corresponding to the query result.

[ language Generation ]

Natural Language Generation (NLG) is configured to generate information or instructions into language text. The method can be divided into boring type, task type, knowledge question-answering type and recommendation type. The NLG in the chat type dialogue carries out intention recognition, emotion analysis and the like according to the context, and then generates an openness reply; in the task type dialogue, dialogue reply is generated according to the learned strategy, and general reply comprises clarification requirement, guidance user, inquiry, confirmation, dialogue ending language and the like; generating knowledge (knowledge, entity, fragment, etc.) required by a user according to the recognition and classification of question types, information retrieval or text matching in the knowledge question-answering type dialogue; and in the recommended dialogue system, interest matching and candidate recommended content sorting are carried out according to the hobbies of the user, and then the recommended content for the user is generated.

[ Speech Synthesis ]

The speech synthesis is configured as a speech output presented to the user. The speech synthesis processing module synthesizes a speech output based on text provided by the digital assistant. For example, the generated dialog response is in the form of a text string. The speech synthesis module converts the text string into audible speech output.

It should be noted that the architecture shown in fig. 4 is only an example, and is not intended to limit the scope of the present application. Other architectures may also be employed to achieve similar functionality in embodiments of the present application, for example: all or part of the above processes may be completed by the intelligent terminal, and will not be described herein.

In some embodiments, in order to meet the voice interaction requirements of users using different languages, semantic understanding models can be trained through labeling corpus of the different languages respectively to obtain semantic understanding models of multiple languages, after voice instructions of the users are received, the semantic understanding models of the corresponding languages are selected to carry out semantic understanding on the instructions of the users according to the languages of the voice instructions, and semantic understanding results are output. However, collecting a large number of labeling corpus in different languages has high cost, labeling quality cannot be guaranteed, and the generated pre-training language model occupies more resources; if less labeling corpus is adopted to train the semantic understanding model, the accuracy of the semantic understanding result output by the trained semantic understanding model is lower; if the machine translation method is adopted to translate the voice command of the user and then to perform semantic understanding, the translation accuracy of some spoken languages is very low, errors are continuously accumulated and propagated, and finally the voice interaction experience effect is poor.

In order to solve the problem of multi-language semantic understanding, the embodiment of the application provides a semantic understanding method based on transfer learning, and through transfer learning, an existing language semantic understanding model can be transferred to a new expanded language, so that accurate semantic understanding of the new expanded language can be realized, the problems of difficult corpus labeling, machine translation error propagation, thick and heavy pre-trained language models, difficult deployment and the like are avoided, and the problem of cold start of the new language on line is better solved.

Referring to fig. 5, a schematic structural diagram of a multilingual speech sense understanding model provided in an embodiment of the present application is shown in fig. 5, where the multilingual speech sense understanding model provided in the embodiment of the present application includes Feature Extractor (Generator), lable Predictor, and Domain Classifier (discriminant), and the label Predictor and the discriminant are connected to the feature extractor, respectively.

In some embodiments, during the training phase of the multilingual speech semantic understanding model, the input data of the feature extractor may include srcmonolingual corpus (source language), such as training data in chinese, and tgt monolingual corpus (target language), such as training data in english, with the training data in the source language alternating with the training data in the target language. The training data of the source language is first training data, which comprises labeling corpus of the language as the source language, the labeling of the first training data can comprise two types of labels, the first type of labels are intention type labels used for representing intention types, such as media resource searching types, television control types and the like, and the second type of labels are slot position labels comprising slot positions and corresponding position labels thereof; the training data of the target language is second training data, the second training data comprises unlabeled anticipation of the language as the target language, and the second training data does not comprise the intention type label and the slot position label.

In some embodiments, during the application phase of the multilingual semantic understanding model, the input data of the feature extractor may be input data of a target language, which is data that does not include the above-described intent category label and slot label.

The feature extractor is used for extracting features of the input data to generate feature vectors. Cross Lingual Word Embedding (cross-language word embedding) modules and Shared Encoder modules may be included. The feature extractor is used for generating a feature vector based on training data of a source language as a first feature vector, and the first feature vector needs to well support a label predictor to carry out intention recognition and slot extraction tasks; the feature extractor generates a feature vector based on training data of the target language as a second feature vector, which needs to well deceive the domain classifier, so that the feature vector input by the domain classifier cannot be distinguished as whether the feature vector is training data from the source language or training data of the target language.

By way of example, the feature extractor may employ a pre-trained language model based on a transducer architecture, such as the Bert model.

In some embodiments, the tag predictor is a multi-task learning module for performing intent understanding and slot parsing multi-task learning on the feature vectors output by the feature extractor, outputting intent category tags and slot tags. The tag predictor may include Intent Classification (intention classifier) for intention understanding and Slot filtering (Slot tag classifier) for Slot resolution.

In some embodiments, the domain classifier is equivalent to a discriminator in GAN (Generative Adversarial Network) and is used to distinguish whether the feature vector is derived from training data in the source language or training data in the target language, i.e., to determine whether the feature vector belongs to a source domain or a target domain, and if the feature vector is determined to be derived from the source domain, output class 0, and if the feature vector is determined to be derived from the target domain, output class 1.

In some embodiments, the network parameter of the tag predictor is θ _p The loss is L. The label predictor's loss function may have different manifestations according to different specific downstream tasks, for example, according to the set intention category including two categories of media resource searching category and television control category, the label predictor's executed label prediction task is a two-category task, the loss function may employ a square difference loss function, according to the set intention category including multiple categories of media resource searching category, television control category, vehicle identification category, heat calculation category, animal identification category, plant identification category, face recognition category, etc., the label predictor's executed label prediction task is a multi-category task, and the label predictor's loss function may employ a cross entropy loss function. Illustratively, the label prediction task performed by the label predictor is a multi-classification task and the loss function of the label predictor is a cross entropy loss function. The network parameter theta _p The smaller the loss L of the label predictor is, the more accurate the prediction result of the intention type label and the slot position label output by the label predictor is. The calculation formula of the loss L is as follows:

in some embodiments, the network parameter of the domain classifier is θ _d Loss of L _d . The task of the domain classifier is to correctly distinguish between source and target data, and therefore, the network parameters θ of the domain classifier _d So that loss L _d The smaller the field classifier, the higher the accuracy of distinguishing training data in the source language from training data in the target language. Because the domain classification task performed by the domain classifier is a classification task, the loss function of the domain classifier can use the square difference lossFunction, loss L _d The calculation formula of (2) is as follows:

in some embodiments, the network parameter of the feature extractor is θ _f It is necessary to combine the above losses L and L _d The purpose of generating extremely similar feature vectors aiming at training data of a target language and training data of the source language is achieved so as to achieve the purpose of deception field classifier, therefore, the loss function of the feature extractor can be the difference function of the loss function of the label predictor and the loss function of the field classifier, and the network parameter theta of the feature extractor _f The calculation formula of (2) is as follows:

based on the multilingual speech semantic understanding model, the embodiment of the present application provides a semantic understanding method, referring to fig. 6, which may include the following steps:

step S101: first training data in a source language and second training data in a target language are collected.

In some embodiments, a plurality of source language dialogue data with labels may be collected as first training data for a source language, and a plurality of target language dialogue data without labels may be collected as second training data for a target language, wherein the labels may include intent tags and slot tags.

Step S102: and carrying out model training on the multilingual speech sense understanding model through the first training data and the second training data alternately.

In some embodiments, training data in the source language and training data in the target language may be alternately input into a multilingual speech-sense understanding model, and model training repeated to continuously optimize the network parameters θ of the feature extractor _f Network parameters θ of tag predictor _p And network parameters θ of domain classifier _d Until the set training round number is reached, or the loss L and the loss L _d Less than the set corresponding threshold, model training may be stopped.

Step S103: and inputting the user question of the target language into the multilingual language semantic understanding model to obtain the semantic understanding result of the data to be measured.

In some embodiments, after the multilingual semantic understanding model is trained through the steps, the multilingual semantic understanding model has semantic understanding capabilities of both the source language and the target language, at this time, a user question in the target language is input into the multilingual semantic understanding model, a feature extractor is called to extract features in the user question, feature vectors are generated, the feature vectors are simultaneously input into an intention classifier and a slot label classifier in a label predictor, semantic understanding results containing intention category labels and slot labels are obtained, and semantic understanding of the user question is realized.

For further explanation of the semantic understanding method provided by the embodiment of the application, the training method of the multilingual speech semantic understanding model is described below.

Referring to fig. 7, a flow diagram of a training method for a multilingual speech sense understanding model according to some embodiments, as shown in fig. 7, may include the steps of:

step S201: and extracting the characteristics of the first training data through a characteristic extractor to generate a first characteristic vector.

In some embodiments, for the first training data of the source language, a feature vector of the first training data may be first generated by a feature extractor, where the feature vector is the first feature vector.

Step S202: and respectively carrying out intention understanding and slot filling on the first feature vector through the multilingual language meaning understanding model to obtain a first meaning understanding result.

In some embodiments, the intent understanding and slot filling multi-task learning is performed on the first feature vector by a tag predictor to obtain a first semantic understanding result, wherein the first semantic understanding result comprises an intent category tag and a slot tag.

Step S203: and fitting the first training data according to the first semantic understanding result, and updating the multilingual semantic understanding model and the feature extractor.

In some embodiments, the first semantic understanding result is fitted to training data to update the network parameters θ of the feature extractor _f And network parameters θ of the tag predictor _p . Wherein fitting training data means comparing the intention type label output by the label predictor with the intention type label in the first training data, comparing the slot position label output by the label predictor with the slot position label in the first training data, and correcting the network parameter theta of the feature extractor according to the comparison result _f And network parameters θ of the tag predictor _p 。

Step S204: and extracting the characteristics of the second training data through the characteristic extractor to generate a second characteristic vector.

In some embodiments, for training data of a target language, a feature vector of the training data may be first generated by a feature extractor, the feature vector being a second feature vector.

Step S205: and performing domain judgment on the second feature vector through a domain classifier, and updating the domain classifier and the feature extractor according to a judgment result.

In some embodiments, the second feature vector is input into the domain classifier to perform discriminant training on the domain classifier, and the network parameter θ of the feature extractor is updated according to the output result of the domain classifier _f And network parameters θ of domain classifier _d 。

After the multilingual semantic understanding model is trained for a plurality of times based on the method shown in fig. 7, the multilingual semantic understanding model can have the semantic understanding capability of the target language and the source language at the same time, and can accurately carry out semantic understanding on user questions in the source language and the target language.

In some embodiments, after the multilingual speech semantic understanding model has the semantic understanding capability of chinese and english based on the method shown in fig. 7, the multilingual speech semantic understanding model may be trained by using chinese or english as a source language, a new language such as french as a target language, a labeling corpus of chinese or english as first training data, and unlabeled data of french as second training data, so that the multilingual speech semantic understanding model has the french semantic understanding capability.

After training the multilingual speech semantic understanding model to have the semantic understanding capability supporting multiple languages, the flow of the voice interaction between the user and the display device can be seen in fig. 8, which includes the following steps:

step S301: and receiving a voice instruction input by a user.

In some embodiments, after the user inputs a wake-up word to the display device to wake up the voice assistant of the voice recognition application, a voice command may be input to the display device, where the voice command may be a user command such as a media search command or a television control command.

In some embodiments, the user may also input voice instructions to the display device through the control means of the display device, for example, after pressing voice keys on a remote control.

Step S302: and carrying out voice recognition on the voice command to obtain a user question.

In some embodiments, after receiving a voice command input by a user, the display device may perform voice recognition on the voice command to obtain a text corresponding to the voice command, where the text may be referred to as a user question.

Step S303: and extracting the characteristics of the user question to obtain the characteristic vector of the user question.

In some embodiments, the display device may input the user question into a feature extractor of the multilingual semantic understanding model, through which the user question is feature extracted, outputting a feature vector. The multilingual speech semantic understanding model may be integrated into a client of the speech recognition application or may be disposed on a server, which is not particularly limited in the embodiment of the present application.

Step S304: and acquiring a semantic understanding result of the feature vector through a multilingual semantic understanding model.

In some embodiments, after obtaining the feature vector of the user question, the feature vector may be input into the intention classifier and the slot label classifier in the label predictor, respectively, to obtain the semantic understanding result corresponding to the feature vector and containing the intention label and the slot label.

Step S305: and generating response data of the user question according to the semantic understanding result.

In some embodiments, after the semantic understanding result is obtained, the user question may be processed through a corresponding service module based on the intention category, so as to generate response data of the user question.

For example, if the intention category is movie searching, inquiring media data corresponding to the slot label in the media searching library according to the slot label, generating media searching data according to the inquired media data, and generating response data according to the media searching data and default reply data corresponding to the intention category of movie searching, wherein the default reply data can be a prompt of a corresponding language, such as: "find the following media for you".

For example, if the intention type is television control, the volume is indicated and increased according to the slot label, a volume adjustment command is generated, and the television is controlled to increase the volume according to the volume adjustment command.

Step S306: responding according to the response data.

In some embodiments, after obtaining the response data corresponding to the question of the user, the display device may execute a corresponding action according to the response data, so as to complete the voice interaction process.

Referring to fig. 9-12, for a schematic view of a scenario in which a user performs voice interaction with a display device according to some embodiments, as shown in fig. 9, a voice command input by the user to the display device may be an english command indicating to increase the volume, the display device or a server performs semantic understanding through a multilingual speech sense understanding model according to text corresponding to the english command, generates response data according to a semantic understanding result, referring to fig. 10, and the display device increases the volume according to the response data and feeds back the volume increase result.

Referring to fig. 11, the user may further input a chinese instruction representing a volume reduction to the display device, the display device or the server performs semantic understanding through the multilingual semantic understanding model according to a text corresponding to the chinese instruction, generates response data according to a semantic understanding result, referring to fig. 12, the display device performs a volume reduction according to the response data, and feeds back a volume increase result.

In some embodiments, after the display device converts the voice command into the text, the language code corresponding to the voice command may be stored, and after the response data of the voice command is received, the text in the response data may be converted into the language according to the language code, so that the user may receive the feedback content in the same language as the voice command, and the voice interaction experience is improved. For example, if the voice command of the user is a chinese command, the language of the feedback content of the voice command by the display device is also chinese, and if the voice command of the user is an english command, the language of the feedback content of the voice command by the display device is also english.

According to the embodiment, through transfer learning of the semantic understanding model, the semantic understanding model is trained into the multilingual semantic understanding model with multilingual semantic understanding capability, relevant annotation corpus of new languages is not needed, dependence on training data of the new languages is reduced, the problem of cold start of the new languages is solved, and user experience is improved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, characterized by comprising:

a display;

a controller, in communication with the display, configured to:

receiving a voice instruction input by a user;

performing voice recognition on the voice command to obtain a user question;

responding according to the response data.

2. The display device of claim 1, wherein the controller is further configured to:

3. The display device of claim 2, wherein the model training of the multilingual semantic understanding model by the first training data and the second training data alternately comprises:

4. A display device as claimed in claim 3, characterized in that the loss function of the multilingual speech semantic understanding model comprises a cross entropy loss function, the loss function of the domain classifier comprises a squared difference loss function, and the loss function of the feature extractor comprises a difference function of the loss function of the multilingual speech semantic understanding model and the loss function of the domain classifier.

5. The display device of claim 2, wherein the first training data includes an intent category label and a slot label and the second training data does not include the intent category label and the slot label.

6. The display device of claim 3, wherein the multilingual semantic understanding model includes an intention classifier and a slot label classifier, and wherein the intention understanding and the slot filling of the first feature vector by the multilingual semantic understanding model respectively obtain a first semantic understanding result includes:

7. A semantic understanding method, comprising:

8. The semantic understanding method according to claim 7, wherein the model training of the multilingual semantic understanding model by the first training data and the second training data alternately comprises:

respectively carrying out intention understanding and slot filling on the first feature vector through a multilingual semantic understanding model to obtain a first semantic understanding result;

9. The semantic understanding method according to claim 8, wherein the loss function of the multilingual semantic understanding model comprises a cross entropy loss function, the loss function of the domain classifier comprises a squared difference loss function, and the loss function of the feature extractor comprises a difference function of the loss function of the multilingual semantic understanding model and the loss function of the domain classifier.

10. The semantic understanding method according to claim 7, wherein the first training data tag includes an intent category tag and a slot tag, and the second training data does not include the intent category tag and the slot tag.