CN115602167A

CN115602167A - Display device and voice recognition method

Info

Publication number: CN115602167A
Application number: CN202211203844.7A
Authority: CN
Inventors: 请求不公布姓名; 曹晚霞
Original assignee: Vidaa Netherlands International Holdings BV
Current assignee: Vidaa Netherlands International Holdings BV
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2023-01-13

Abstract

Some embodiments of the present application provide a display device and a voice recognition method. The display equipment obtains the sentences to be translated based on the entities to be translated and the pre-stored sentence templates, and translates the sentences to be translated into target sentences of the preset language. The display equipment obtains a target entity corresponding to the entity to be translated based on the target statement, and generates a voice recognition model according to the target entity and the statement template. And for the collected user voice, the display equipment identifies the user voice based on the voice identification model, obtains a control instruction corresponding to the user voice and executes the control instruction. For user voices of different languages, the display equipment can recognize the voices by self instead of adopting an transliteration mode, so that the voices of the users can be recognized accurately, and the use experience of the users is improved.

Description

Display device and voice recognition method

Technical Field

The present application relates to the field of display device technologies, and in particular, to a display device and a voice recognition method.

Background

The display equipment is terminal equipment capable of outputting specific display pictures, along with the rapid development of the display equipment, the functions of the display equipment are more and more abundant, the performance is more and more powerful, the bidirectional human-computer interaction function can be realized, and various functions such as audio and video, entertainment, data and the like are integrated into a whole for meeting the diversified and personalized requirements of users. The user can also utilize the voice recognition function of the display device to realize voice control of the display device.

When the user controls the display device by using the voice, the display device needs to recognize the voice of the user to determine the control instruction of the user. Generally, some corpus information corresponding to the local language, such as some sentence templates and word entities, may be stored in the display device. According to the corpus information, some voices corresponding to the local language can be identified. However, the user may have a need to control the display device using other languages. In this regard, the display device may translate the stored corpus information into the language currently used by the user and recognize the user's voice using the translated corpus information.

However, in the process of translating the expected information, the display device usually translates the expected information in an intuitive manner, and a result obtained by the intuitive manner may deviate from the original meaning of the current expected information, so that the translation result is not accurate, the voice of the user cannot be accurately recognized, and the use experience of the user is seriously affected.

Disclosure of Invention

Some embodiments of the present application provide a display device and a voice recognition method. The method and the device solve the problems that in the related technology, the linguistic data information is translated in an intuitive translation mode, so that the translation result is inaccurate, the voice of a user cannot be accurately identified, and the user use experience is seriously influenced.

In a first aspect, some embodiments of the present application provide a display device including a display, a sound collector, and a controller. Wherein the sound collector is configured to collect voice input by a user; the controller is configured to perform the steps of:

obtaining a sentence to be translated based on the entity to be translated and a sentence template pre-stored in display equipment;

translating the sentence to be translated into a target sentence of a preset language;

acquiring a target entity corresponding to the entity to be translated based on the target statement;

generating a speech recognition model based on the target entity and the pre-stored sentence template;

and responding to the user voice collected by the voice collector, identifying the user voice based on the voice identification model, obtaining a control instruction corresponding to the user voice and executing the control instruction.

In a second aspect, some embodiments of the present application provide a speech recognition method, applied to a display device, including:

translating the statement to be translated into a target statement of a preset language;

It can be seen from the above technical solutions that some embodiments of the present application provide a display device and a voice recognition method. The display device obtains the sentences to be translated based on the entities to be translated and the pre-stored sentence templates, and translates the sentences to be translated into the target sentences of the preset language. The display equipment obtains a target entity corresponding to the entity to be translated based on the target statement, and generates a voice recognition model according to the target entity and the statement template. And for the collected user voice, the display equipment identifies the user voice based on the voice identification model, obtains a control instruction corresponding to the user voice and executes the control instruction. For user voices of different languages, the display device can recognize the voices of the users by self instead of adopting an transliteration mode, so that the voices of the users can be recognized accurately, and the use experience of the users is improved.

Drawings

In order to more clearly describe the technical solution of the present application, the drawings required to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.

FIG. 1 illustrates a usage scenario of a display device according to some embodiments;

fig. 2 illustrates a block diagram of a hardware configuration of a control apparatus according to some embodiments;

fig. 3 illustrates a hardware configuration block diagram of a display device according to some embodiments;

FIG. 4 illustrates a diagram of a software configuration in a display device according to some embodiments;

FIG. 5 shows a schematic diagram of an application panel in some embodiments;

FIG. 6 is a diagram illustrating a voice interaction network architecture of a display device in some embodiments;

FIG. 7 illustrates a schematic diagram of a system settings UI interface in some embodiments;

FIG. 8 is a diagram illustrating the display of speech recognition mode confirmation information in the display in some embodiments;

FIG. 9 illustrates an interaction flow diagram for components of a display device in some embodiments;

FIG. 10 is a schematic diagram that illustrates a language selection interface in some embodiments;

FIG. 11 is a diagram illustrating a scenario in which a user interacts with a display device in some embodiments;

FIG. 12 is a schematic diagram that illustrates a display device displaying a search interface in some embodiments;

FIG. 13 is a schematic diagram that illustrates a hint information in some embodiments;

FIG. 14 illustrates a flow diagram of a speech recognition method in some embodiments.

Detailed Description

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for convenience of understanding of the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the foregoing drawings are used for distinguishing between similar or analogous objects or entities and are not necessarily intended to limit the order or sequence in which they are presented unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to all of the elements explicitly listed, but may include other elements not expressly listed or inherent to such product or device.

The display device provided by the embodiment of the present application may have various implementation forms, and for example, the display device may be a television, a smart television, a laser projection device, a display (monitor), an electronic whiteboard (electronic whiteboard), an electronic desktop (electronic table), and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display apparatus 200 through the smart device 300 or the control device 100.

In some embodiments, the control device 100 may be a remote controller, and the communication between the remote controller and the display device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, and the display device 200 is controlled by a wireless or wired method. The user may input a user instruction through a key on a remote controller, voice input, control panel input, etc., to control the display apparatus 200.

In some embodiments, the smart device 300 (e.g., mobile terminal, tablet, computer, laptop, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application program running on the smart device.

In some embodiments, the display device may not receive instructions using the smart device or control device described above, but may receive user control through touch or gestures, or the like.

In some embodiments, the display device 200 may also be controlled in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received by a module configured inside the display device 200 to obtain a voice command, or may be received by a voice control device provided outside the display device 200.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, serving as an interaction intermediary between the user and the display device 200.

As shown in fig. 3, the display apparatus 200 includes at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller comprises a processor, a video processor, an audio processor, a graphics processor, a RAM, a ROM, a first interface to an nth interface for input/output.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving image display, a component for receiving an image signal from the controller output, performing display of video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.

The communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The display apparatus 200 may establish transmission and reception of control signals and data signals with the external control apparatus 100 or the server 400 through the communicator 220.

A user interface for receiving control signals for controlling the apparatus 100 (e.g., an infrared remote control, etc.).

The detector 230 is used to collect signals of an external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which can be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 230 includes a sound collector, such as a microphone, which is used to receive external sounds.

The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, and the like. Or may be a composite input/output interface formed by the plurality of interfaces.

The tuner demodulator 210 receives a broadcast television signal through a wired or wireless reception manner, and demodulates an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

The controller 250 controls the operation of the display device and responds to the user's operation through various software control programs stored in the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command for selecting a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the controller includes at least one of a Central Processing Unit (CPU), a video processor, an audio processor, a Graphic Processing Unit (GPU), a RAM Random Access Memory (RAM), a ROM (Read-Only Memory), a first interface to an nth interface for input/output, a communication Bus (Bus), and the like.

A user may input a user command on a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

A "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables the conversion of the internal form of information to a form acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

As shown in fig. 4, the system of the display device is divided into three layers, i.e., an application layer, a middleware layer and a hardware layer from top to bottom.

The Application layer mainly includes common applications on the television and an Application Framework (Application Framework), wherein the common applications are mainly applications developed based on the Browser, such as: HTML5 APPs; and Native APPs (Native APPs);

an Application Framework (Application Framework) is a complete program model, and has all basic functions required by standard Application software, such as: file access, data exchange, and interfaces to use these functions (toolbars, status lists, menus, dialog boxes).

Native APPs (Native APPs) may support online or offline, message push, or local resource access.

The middleware layer comprises various television protocols, multimedia protocols, system components and other middleware. The middleware can use basic service (function) provided by system software to connect each part of an application system or different applications on a network, and can achieve the purposes of resource sharing and function sharing.

The hardware layer mainly comprises an HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for butting all the television chips, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

In some embodiments of the present application, a user interface may be displayed on display 260. The user interface may be a specific target image, such as various media assets acquired from a network signal source, including video, pictures, and other content. The user interface may also be some UI interface of the display device, such as a system recommendation page or the like.

The display device may have various functions such as playing media, entertainment games, video chatting, etc., thereby providing users with various services.

In some embodiments, the controller 250 may control the display 260 to display the user interface when the user controls the display device to be powered on. A "My applications" control may be included in the user interface. The user may trigger entry into the corresponding application panel by clicking on the "my applications" control to enter a display instruction for the application panel page. It should be noted that the user may also input a selection operation on the functionality control in other manners to trigger entry into the application panel. For example, control is passed to the application panel page using a voice control function, a search function, or the like.

The user can view the applications that the display device has installed, i.e., the functions supported by the display device, through the application panel. The user can select one of the application programs and open the application program to realize the functions of the application. It should be noted that the application installed on the display device may be a system application or a third-party application. The user controls the display device to realize the corresponding function of the application program by opening the application program. FIG. 5 illustrates a schematic diagram of an application panel in some embodiments. As shown in fig. 5, three controls of "player", "cable tv", and "video chat" are included in the application panel. Wherein, the user can control the display device to open the player application by clicking the 'player' control. The user can perform corresponding operations in the player, such as searching for media assets and the like. The user may click on the "cable" control to view a number of asset channels using the display device, including various asset programs offered by the cable provider. The user can click the video chat control, so that the video chat is carried out by utilizing the display device.

A user may input an instruction to the display apparatus using a control device, such as a remote controller, a mobile terminal, or the like, to control the display apparatus to implement various functions. The user may control the focus in the display to move using the control device to select a different control to open the control. The user can also use the control device to input some text to the display device, for example, the name of the medium resource can be input when searching the medium resource.

In some embodiments, the display device has a voice recognition function to allow the user to input control instructions to the display device by using voice input to realize voice interaction, considering the user experience.

FIG. 6 illustrates a voice interaction network architecture diagram of a display device in some embodiments. As shown in fig. 6, the display device 200 is used to receive input information such as sound and output a processing result of the information. The voice Recognition module is deployed with voice Recognition service (ASR) and used for recognizing the audio as text; the semantic Understanding module is deployed with a semantic Understanding service (NLU) and used for performing semantic parsing on the text; the business Management module is deployed with a business instruction Management service such as Dialog Management (DM) for providing business instructions; the Language generation module is deployed with a Natural Language Understanding (NLG) for converting an instruction executed by the instruction display device into a text Language; the voice synthesis module is deployed with a voice synthesis (texttostech, TTS) service, and is used for processing a text language corresponding to the instruction and then sending the processed text language to a loudspeaker for broadcasting. A plurality of entity service devices deployed with different business services may exist in the voice interaction network architecture, and one or more function services may also be aggregated in one or more entity service devices.

In some embodiments, the following describes an example of a process for processing information input to the display device 200 based on the architecture shown in fig. 6, taking the information input to the display device 200 as an example of a query statement input by voice:

and (3) voice recognition: after the display device 200 receives the query sentence input by voice, the display device 200 may perform noise reduction processing and feature extraction on the audio of the query sentence, where the noise reduction processing may include removing echo and ambient noise.

Semantic understanding: natural language understanding of the identified candidate text and associated contextual information. The text is parsed into structured, machine-readable information, business areas, intents, word slots, etc. information to express semantics, etc. resulting in an actionable intent determination intent confidence score, and the semantic understanding module selects one or more candidate actionable intents based on the determined intent confidence score.

And (4) service management: the semantic understanding module issues a query instruction to the corresponding business management module according to the semantic analysis result of the text of the query statement to acquire the query result given by the business service, executes the action required by the final request of the user, and feeds back the equipment execution instruction corresponding to the query result.

Language generation: configured to generate the information or instructions into language text. The method can be divided into a chatting type, a task type, a knowledge question and answer type and a recommendation type. The NLG in the chatting dialogue carries out intention recognition, emotion analysis and the like according to context and then generates an open reply; in the task type conversation, a conversation reply is generated according to a learned strategy, and the general reply comprises clarification requirements, user guidance, inquiry, confirmation, a conversation end language and the like; knowledge (knowledge, entities, fragments and the like) required by a user is generated in a knowledge question-and-answer type dialogue according to question sentence type identification and classification, information retrieval or text matching; and in the recommendation type dialogue system, interest matching and candidate recommended content sorting are carried out according to the preference of the user, and then the recommended content is generated for the user.

And (3) voice synthesis: a speech output configured to be presented to a user. The speech synthesis processing module synthesizes a speech output based on the text provided by the digital assistant. For example, the generated dialog response is in the form of a text string. The speech synthesis module converts the text string into audible speech output.

It should be noted that the architecture shown in fig. 6 is only an example, and does not limit the scope of the present application. In the embodiment of the present application, other architectures may also be adopted to implement similar functions, for example: all or part of the above process may be performed by the display device 200, which is not described herein.

In some embodiments, the voice recognition function may be implemented by the sound collector and the controller 250 provided on the display device in cooperation, and the semantic function may be implemented by the controller 250 of the display device.

The user may control the display apparatus 200 by using a control device, such as a remote controller, for example, in the case of a smart tv, the user may control the tv playing media or adjust the volume by using the remote controller, so as to control the smart tv.

In some embodiments, the display device 200 has voice recognition functionality. When the display device 200 starts the voice recognition function, when the user uses the display device 200, the user may send a voice instruction to the display device 200 in a voice input manner, so that the display device 200 implements a corresponding function. To this end, the display apparatus 200 may be provided with a voice recognition mode.

In some embodiments, the user may send a voice recognition mode instruction to the display device by operating a designated key of the remote controller. And binding the corresponding relation between the voice recognition mode command and the remote controller key in advance in the actual application process. For example, a voice recognition mode button is set on the remote controller, and when the user touches the button, the remote controller sends a voice recognition mode command to the controller 250, and at this time, the controller 250 controls the display device to enter a voice recognition mode. When the user touches the key again, the controller 250 may control the display device to exit the voice recognition mode.

In some embodiments, when the user controls the display device using the smart device, for example using a cell phone, a voice recognition mode instruction may also be sent to the display device. In the process of practical application, a control may be set in the mobile phone, and whether to enter the voice recognition mode may be selected through the control, so as to send a voice recognition mode instruction to the controller 250, and at this time, the controller 250 may control the display device to enter the voice recognition mode.

In some embodiments, when the user uses the mobile phone to control the display device, a continuous click command may be issued to the mobile phone. The continuous click command refers to: in a preset period, the number of times that a user clicks the same area of the mobile phone touch screen exceeds a preset threshold value. For example: when the user continuously clicks a certain area of the mobile phone touch screen for 3 times within 1s, the user is regarded as a continuous clicking instruction. After receiving the continuous click command, the mobile phone may send a voice recognition mode command to the display device, so that the controller 250 controls the display device to enter the voice recognition mode.

In some embodiments, when the user uses the mobile phone to control the display device, the following may be set: when detecting that a touch pressure value of a certain area of the mobile phone touch screen by a user exceeds a preset pressure threshold value, the mobile phone can send a voice recognition mode instruction to the display device.

A voice recognition mode option may also be provided in the UI interface of the display device, and when the user clicks the option, the display device may be controlled to enter or exit the voice recognition mode. FIG. 7 illustrates a schematic diagram of a system settings UI interface in some embodiments. As shown in fig. 7, the system settings include screen settings, sound settings, voice recognition settings, network settings, and factory reset settings. The user may click on the voice recognition control to control the display device 200 to enter or exit the voice recognition mode.

In some embodiments, to prevent the user from triggering the voice recognition mode by mistake, when the controller 250 receives a voice recognition mode instruction, the display 260 may be controlled to display voice recognition mode confirmation information, so that the user performs a secondary confirmation whether to control the display device to enter the voice recognition mode. FIG. 8 illustrates a schematic diagram of displaying speech recognition mode confirmation information in a display in some embodiments.

In some embodiments, after the display device 200 enters the speech recognition mode, the user may send instructions directly to the display device 200 by way of a voice input. After receiving the voice input by the user, the display device 200 may recognize the voice of the user, determine a control instruction of the user, and perform a corresponding operation to implement a function required by the user.

In some embodiments, the controller 250 may control a sound collector, which may be a microphone, to collect a voice signal input by the user. After the voice collector collects the user voice, the controller 250 may analyze the user voice to obtain a voice text corresponding to the user voice. The controller may perform semantic analysis on the speech text to determine control instructions for the user.

In some embodiments, the display device 200 may also include a third party speech recognition interface. Upon receiving the voice input by the user, the controller 250 may transmit the voice data to a third party voice recognition interface, convert the voice of the user into a voice text using a third party voice recognition device, and the like. After obtaining the voice text, the controller 250 may parse the voice text and execute the voice command.

In some embodiments, controller 250 may also send voice instructions to the server. The server can generate a voice text according to the voice instruction and feed the voice text back to the display device.

When the display device recognizes the user voice, and specifically performs semantic analysis on the voice text corresponding to the user voice, some expected information stored in the display device may be used for analysis. Some high-quality sentence templates and word entities can be stored in the display equipment, and the voice condition input by the user frequently can be reflected. The predictive information may be used as training data to construct a speech recognition model to recognize the user's speech using the speech recognition model.

It should be noted that, since the display device is usually fixed in a certain area for use, the expected information stored in the display device may be only some expected information of a local language or a corresponding language produced by the display device. However, users may use different languages, and thus may have a need to control the display device using other languages. The language corresponding to the user's voice may be different from the voice corresponding to the expectation information stored in the display device, and the user's voice cannot be recognized only by the expectation information stored in the display device. In this regard, the display device may translate the stored corpus information into a language currently used by the user, thereby recognizing the user's voice using the translated corpus information.

The display device may use existing translation software to translate the desired information. However, these translation software may translate using transliteration. The result obtained by the translation mode can deviate from the original meaning of the currently expected information, so that the translation result is inaccurate. For example, for the sentence "wake me twelve", where the word entity "twelve", it is obvious that the meaning of time is. If only the entity "twelve points" is interpreted, "twelves points" may be obtained, denoted as twelve points, and not in a temporal sense. The translated result can not accurately identify the voice of the user, and the use experience of the user is seriously influenced.

Therefore, the display device provided by the embodiment of the application can accurately translate expected information such as various entities, so that the voice of the user can be accurately recognized.

In some embodiments, the display device may pre-translate stored anticipation information, such as various word entities, into various different languages before a user uses the speech recognition functionality of the display device. Or the user may specify one or more languages and the display device translates the desired information into the corresponding language.

The display device can generate a voice recognition model according to the translated expected information, and recognize the voice of the user by using the voice recognition model so as to respond to the control instruction of the user.

Figure 9 illustrates a flow diagram for interaction of components of a display device in some embodiments. As shown in fig. 9, the method comprises the following steps:

s101, the controller 250 acquires the sentence to be translated based on the entity to be translated and the sentence template pre-stored in the display device.

S102, the controller 250 translates the sentence to be translated into a target sentence of a preset language.

S103, the controller 250 acquires a target entity corresponding to the entity to be translated based on the target statement.

S104, the controller 250 generates a voice recognition model based on the target entity and the pre-stored statement template.

And S105, in response to the user voice collected by the voice collector, the controller 250 identifies the user voice based on the voice identification model, obtains a control instruction corresponding to the user voice and executes the control instruction.

In some embodiments, the display device may store anticipation information in advance, where the anticipation information includes information such as a sentence template and a word entity.

A manual acquisition method can be adopted, and some high-frequency and high-quality statement templates and entities are acquired firstly. For example, it may be possible to count texts corresponding to voices input more frequently by different users when using the voice-controlled display device, and determine some sentence templates and entities according to the texts. Taking the language corresponding to the region where the display device is located as the Chinese language as an example, when the user uses the voice control display device, the functions of searching media resources, playing media resources or setting an alarm clock and the like may be performed. In the embodiment of the application, the process of translating the expected information is described by taking an alarm clock function as an example.

The user can utilize the voice control display equipment to set the alarm clock function, and according to the use condition of the user, some sentence templates can be summarized at first, and the language corresponding to the sentence templates is Chinese. The statement template may contain a fixed entity and a slot to fill. The fixed entity refers to some instruction text of a user when controlling the display device to implement a corresponding function, for example, when controlling the display device to implement an alarm clock function, the fixed entity may be "wake me" or "wake me". The slot to be filled is a slot to be filled with the function parameter in the statement template, and different types of entities, such as a time type entity or a date type entity, may be filled in different slots to be filled.

In the embodiment of the present application, a fixed entity in a sentence template corresponding to a chinese language is referred to as a first fixed entity. The sentence template is specifically described as follows:

sentence template A: '@ { StartDate } @ { StartTime } wakes up me'. Wherein, "wake me up" is the first fixed entity, "@ { StartDate }" is a slot to be filled for filling a date type entity, "@ { StartTime }" is another slot to be filled for filling a time type entity.

Statement template B: alarm clock' definite number @ { StartDate } @ { StartTime }. The alarm clock of the' definite.. Is a first fixed entity, "@ { StartDate }" and "@ { StartTime }" are a slot to be filled of a filling date type entity and a slot to be filled of a filling time type entity respectively.

Statement template C: ' @ { StartDate } @ { StartTime } please wake me with @ { Song }. The "please use and wake me" is a first fixed entity, the "@ { StartDate }" and "@ { StartTime }" are a slot to be filled of a filling date type entity and a slot to be filled of a filling time type entity respectively, and the "@ { Song }" is a slot to be filled for filling a Song type entity.

Sentence template D: '@ { StartDate } @ { StartTime } an alarm clock' is set by @ { Song } of @ { Singer }. The method comprises the steps that an alarm clock is used and set as a first fixed entity, "@ { StartDate }" and "@ { StartTime }" are a slot position to be filled of a date filling type entity and a slot position to be filled of a time filling type entity respectively, the "@ { Singer }" is a slot position to be filled for filling a Singer type entity, and the "@ { Song }" is a slot position to be filled for filling a Song type entity.

When the user inputs the voices corresponding to the four sentence templates to the display device, the display device can determine the sentence templates corresponding to the voices of the user and the entities in the slots to be filled. For example, when the user voice is "four points wake me tomorrow", it may be determined that the user voice corresponds to the sentence template a, and the entity information in "@ { StartDate }" is "tomorrow", and the entity information in "@ { StartTime }" is "four points", so as to determine that the user needs to order the display device to implement the alarm clock function at four points tomorrow.

In the embodiment of the present application, the sentence template is also referred to as a seed template, and may be a high quality template obtained manually, which can reflect a voice template frequently used by a user.

In some embodiments, for the slot to be filled existing in the statement template, some entities corresponding to each slot to be filled may also be predetermined. Some high frequency and high quality entities may be determined by means of manual acquisition. It should be noted that each slot to be filled may match an entity type, for example, "@ { StartDate }" and "@ { StartTime }" are a matching date type and a time type, respectively. For each slot to be filled, some entities, referred to as seed entities in this embodiment, may be obtained in advance. For example: the seed entity corresponding to "@ { StartDate }" may be "tomorrow", "@ { StartTime }" may be "morning", "@ { Singer }" may be "lao eagle", and "@ { Song }" may be "hotel".

In some embodiments, the display device needs to be able to recognize user speech in other languages, given that the user has a need to control the display device using other languages. Therefore, for the seed templates and seed entities described above, translation into other languages may be performed in advance. In the embodiment of the present application, the example of translating the chinese language into the english language is described.

For the seed template, after translating into english, the seed template corresponding to english can be obtained, which includes:

sentence template A: 'wake me up at @ { StartTime } @ { StartDate }'.

Statement template B: 'set an alert for @ { StartTime } @ { StartDate }'.

Statement template C: 'wake me up with { Song } at @ { StartTime } @ { StartDate }'.

Sentence template D: ' set an alarm clock for @ { StartTime } @ { StartDate } with @ { Singer }'s @ { Song } '.

When translating the seed template, a manual translation mode may be used. After translation, the slot positions to be filled do not change, but the fixed entity translates into English. For both the seed template and the seed entity, translation may be performed.

In some embodiments, a large amount of entity information may be required as training data in order to build a speech recognition model. The workload is large considering that a large number of entities are all translated manually. Therefore, a plurality of entities can be manually screened out to be used as seed entities for translation, and the rest entities are used as training data and can be automatically translated into other languages by the display device. In order to avoid the problem that translation results are generated before the step of translating the entity by using the existing translation software in an transliteration mode, the display device in the embodiment of the application can translate the entity to be translated by itself. In the embodiment of the present application, the process of translating a chinese entity into an english entity is described as an example.

The controller 250 of the display device may first obtain the sentence to be translated based on the entity to be translated and the sentence template stored in the display device in advance.

Specifically, a database may be preset in the display device to store specific data. After the seed template and the seed entity, and the translated seed template and the translated seed entity are obtained, the corpus information can be stored in a preset database for subsequent application.

Controller 250 may obtain the target entity type of the entity to be translated. For example, for the entity "twelve dots", since it is the corresponding entity information of the user in the process of controlling the display device, the entity type of the entity "twelve dots" is a time type, i.e., the target entity type is time.

After determining the target entity type, the controller 250 may obtain a seed template corresponding to the target entity type.

In the database, a plurality of sentence templates, i.e., seed templates, are stored in advance. The controller 250 may filter the sentence templates in the database based on the target entity type to obtain the sentence template corresponding to the target entity type, which is referred to as a target sentence template in this embodiment of the present application. It should be noted that, in the screening process, all seed templates corresponding to the initial language, such as the chinese language, corresponding to the entity to be translated need to be determined first. And screening the Chinese seed templates to obtain a target statement template corresponding to the target entity type.

In some embodiments, when filtering the sentence templates, the controller 250 may filter the sentence templates in the database based on a preset filtering condition.

Because the seed template corresponding to the target entity type is to be obtained, the seed template needs to include the slot to be filled corresponding to the target entity type.

The statement templates corresponding to the initial language comprise a first fixed entity and a plurality of slots to be filled. And a matching relation exists between the slot position to be filled and the entity type, namely the slot position to be filled corresponding to the target entity type is matched with the target entity type. For example, the time type matches the "@ { StartTime }" slot and the date type matches the "@ { StartDate }" slot.

Therefore, the preset screening conditions may be set as: if at least one slot to be filled exists in a certain statement template and the target entity type is matched, the statement template can be determined to be a target statement template. That is, the controller 250 may obtain all statement templates including the slot to be filled, which is matched with the target entity type, from the statement templates corresponding to the initial language in the database, and use the obtained statement templates as the target statement templates.

In some embodiments, after obtaining the target sentence template, the controller 250 may obtain the sentence to be translated based on the entity to be translated and the target sentence template. Each target sentence template corresponds to a sentence to be translated.

The controller 250 may fill the slot to be filled in the target sentence template to obtain a complete sentence. Therefore, the controller 250 may first analyze the slot to be filled in each target sentence template to determine the slot corresponding to the entity to be translated and other slots.

The controller 250 may determine a slot corresponding to an entity to be translated in the target sentence template, which is referred to as a target slot to be filled in this embodiment of the present application. The controller 250 may also determine slot positions other than the target slot to be filled, which are referred to as remaining slot to be filled in this embodiment of the present application. The controller 250 may fill the target slot to be filled and the remaining slots to be filled to obtain a complete sentence.

When the slot to be filled is filled, the entity to be translated needs to be filled into the corresponding target slot to be filled, and other related entities are respectively filled into the remaining slots to be filled, so that the controller 250 needs to obtain the seed entities corresponding to the remaining slots to be filled.

In some embodiments, the controller 250 may obtain the seed entity pairs corresponding to the remaining slots to be filled in the database. A pair of seed entities refers to different representations of the same seed entity in two languages. The seed entity pair may include a first seed entity in an initial language corresponding to the entity to be translated and a second seed entity in a preset language. The initial language is a language corresponding to the entity to be translated, and the preset language is a target language required to translate the entity to be translated. Taking the example of translating a Chinese entity to be translated into English, the initial language is the Chinese language, and the preset language is the target language.

The controller 250 may determine entity types corresponding to all remaining slots to be filled in the target statement template, acquire a seed entity of the corresponding entity type from the database, and may acquire two representations of the seed entity in the initial language and the preset language, that is, acquire a first seed entity and a second seed entity, to obtain a seed entity pair corresponding to each slot to be filled.

The controller 250 may fill the to-be-translated entity into the target to-be-filled slot, and simultaneously fill the first seed entity into the remaining to-be-filled slots to obtain a complete sentence, which is referred to as a to-be-translated sentence in this embodiment of the present application.

And for each target statement template, acquiring a corresponding statement to be translated. Taking the statement template a as an example, the statement to be translated obtained after filling the slot position is: in tomorrow, twelve people wake up me.

In some embodiments, after obtaining the sentences to be translated, the controller 250 may translate all the sentences to be translated into the target sentences of the preset language.

Taking the sentence to be translated "Wake me tomorrow up twelve tomorrow" as an example, the target sentence obtained after translation may be "Wake me up at tweeve o' clock tomorrow".

For the sentence template B, the corresponding sentence to be translated is "determine an alarm clock of twelve tomorrow". In the translation process, the translation result obtained for the same entity in different sentences, such as the entity to be translated, namely 'twelve points', is different. Therefore, the sentence to be translated is "set an tomorrow's twelve-point alarm clock", and the target sentence obtained after translation may be "set an alarm for 12o' clock tomorrow". That is, the entity to be translated "twelve dots" can be translated into "twelve o 'clock", possibly "12o' clock", and even into a pure numeric representation "12" after being translated in different sentences.

The controller 250 may obtain a target entity corresponding to the entity to be translated according to the target statement, where the target entity is an expression of the entity to be translated in a preset language.

In some embodiments, the controller 250 may perform a word segmentation process on the target sentence to obtain a word segmentation result. Wherein each target sentence has a word segmentation result.

Taking the target sentence "Wake me up at twelve o' clock tomorrow" as an example, the corresponding word segmentation result is: wake, me, up, at, twelve, o' clock, tomorrow.

The controller 250 may obtain a candidate entity corresponding to each target sentence based on the word segmentation result. Each target sentence comprises a candidate entity, and the candidate entity is the representation of the entity to be translated in the target sentence.

In some embodiments, when acquiring the candidate entities, the controller 250 may filter the word segmentation result to determine the candidate entities corresponding to the entities to be translated.

The controller 250 may obtain, in a database, a representation form of a target statement template corresponding to an entity to be translated in a preset language, which is referred to as a matching statement template in this embodiment of the present application. Namely, the target sentence template and the matching sentence template are two representation forms under an initial language and a preset language respectively.

For statement template A, the text form is '@ { StartDate } @ { StartTime }' to wake me, and the English form is 'wake me up at @ { StartTime } @ { StartDate }'. The two templates are the corresponding conditions of the target statement template and the matching statement template.

And the matched statement template simultaneously comprises a fixed entity and a plurality of slots to be filled, wherein the fixed entity in the matched statement template is a preset language, and the fixed entity in the target statement template is an initial language. In the embodiment of the present application, a fixed entity in the matching statement template is referred to as a second fixed entity, and the second fixed entity is an entity corresponding to the first fixed entity in the preset language.

The controller 250 may filter the segmentation result. And obtaining the remaining participles except the second fixed entity and the second seed entity in the participle result, so that the remaining participles are determined as candidate entities corresponding to each target sentence. For example, for the word segmentation result "wake, me, up, at, twelve, o 'clock, tomorrow", where "wake, me, up, at" is the second fixed entity and "tomorrow" is the second seed entity, then "twelve o' clock" is the candidate entity.

In some embodiments, the controller 250 may extract the candidate entity corresponding to the entity to be translated based on a reward and punishment mechanism.

Controller 250 may assign an initial score, which may be 1, to all entities in the word result.

Therefore, the initial scores of all entities in the word segmentation result are respectively: wake:1, me:1, up:1, at:1, twelve:1, o' clock:1, tomorrow:1.

Controller 250 may penalize the fixed entity, i.e., the second fixed entity in the matching statement template, which may be 1 point less each. The score cases for all entities at this time are: wake 0, me 0, up 0, at 0, twelve 1, o' clock 1, tomorrow 1.

Controller 250 may continue to penalize the seed entity, i.e., the second seed entity, "tomorrow," which may be minus 1 point. Note that the lowest score may be set to 1, and when the score is already 0, the penalized score remains 0. At this point, the score cases for all entities are: wake 0, me 0, up 0, at 0, twelve 1, o' clock 1, tomorrow 0.

The controller 250 may acquire all entities whose scores have not changed (score 1) and determine candidate entities. Thus, the candidate entity is twelve o' clock.

In some embodiments, after the word segmentation results corresponding to all the target sentences are screened, a candidate entity corresponding to each target sentence can be obtained. Because the results of the same entity in different sentences may be different after translation, the candidate entities corresponding to different target sentences may be the same or different. Controller 250 may sort the candidate entities corresponding to all target sentences to form a candidate entity set. The set of candidate entities for some embodiments are [ '12', '12o' clock ',' twalve o 'clock' ]. Which may contain multiple identical candidate entities.

Controller 250 may screen all candidate entities to obtain the closest candidate entity to the entity to be translated as the final translation result.

Controller 250 may translate all candidate entities into corresponding entities in the initial language, referred to as initial language candidate entities in this embodiment of the application. I.e. each candidate entity is translated to obtain each initial language candidate entity.

The controller 250 may acquire the edit distance of each initial language candidate entity and the entity to be translated, respectively. The edit distance, also called Levenshtein distance (Levenshtein), is a quantitative measure of the difference between two strings (e.g., english characters) by how many times a string needs to be changed into another string. Edit distance can be used in natural language processing, for example spell checking can determine which word(s) are more likely based on the edit distance of a misspelled word and other correct words.

Meanwhile, the controller 250 may acquire the number of occurrences of each candidate entity among all candidate entities.

The controller 250 may calculate a translation score of each candidate entity based on the edit distance, the number of occurrences, and a preset weight coefficient, and determine a candidate entity with the highest translation score as a target entity corresponding to the entity to be translated.

The edit distance and the number of occurrences may be given a weighting factor in advance, for example, the weight factor of the edit distance may be 0.3, and the weight factor of the number of occurrences may be 0.7.

Controller 250 may calculate the translation score for each candidate entity according to equation (1).

F＝a*S+b*G (1)

Wherein, the first and the second end of the pipe are connected with each other,

f represents the translation score, S represents the editing distance between the initial language candidate entity corresponding to the candidate entity and the entity to be translated, and G represents the occurrence frequency of the candidate entity in the candidate entity set.

a represents a weight coefficient of edit distance, b represents a weight coefficient of number of occurrences, and a + b =1.

After the translation score of each candidate entity is calculated, the candidate entity with the highest score can be determined as a target entity, namely, an entity of the entity to be translated in a preset language.

In some embodiments, after the target entity corresponding to the entity to be translated is obtained, the entity to be translated, the target entity and the sentence template may be used as training data, and a speech recognition model is generated according to the training data to recognize the speech of the user.

In some embodiments, one speech recognition template may be generated for each language.

Controller 250 may first determine the initial language of the entity to be translated.

The controller 250 may obtain all sentence templates corresponding to the initial language, which are referred to as the first sentence template in this embodiment, in the database. All sentence templates corresponding to the preset language can also be obtained, which are referred to as second sentence templates in the embodiment of the application.

The controller 250 may generate a first speech recognition model for recognizing a user speech corresponding to an initial language based on the entity to be translated and the first sentence template. The controller 250 may generate a second speech recognition model for recognizing the user speech corresponding to the preset language based on the target entity and the second sentence template.

In some embodiments, controller 250 may synthesize the training data for all languages to generate a total speech recognition model for recognizing the user's speech for all languages.

The controller 250 may generate a third speech recognition model for recognizing the user speech corresponding to the preset language and the initial language based on the target entity, the entity to be translated, and all sentence templates.

In some embodiments, the controller 250 may determine the preset language according to the user's needs. For example, the user controls the display device to call up a language selection interface so as to set a language that the display device can recognize by itself, and the controller 250 may determine the language set by the user as a preset language. The controller 250 then translates the entity to be translated into a target entity in a predetermined language, thereby generating a speech recognition model. FIG. 10 illustrates a schematic view of a language selection interface in some embodiments. As shown in fig. 10, the language selection interface may have a plurality of language controls therein, each of which is used to characterize a language that the display device supports recognition, including english, french, german, spanish, and chinese, among others. When a user selects a language, the controller 250 may translate the entity to be translated into a target entity corresponding to the language, thereby obtaining training data and constructing a speech recognition model. After the user inputs speech in the language, the controller 250 may recognize the user speech.

In some embodiments, controller 250 may recognize the user's speech according to a speech recognition model.

The controller 250 may control the sound collector to collect the voice input by the user, and after the user inputs the voice to the display device, the controller 250 may recognize the voice of the user based on the voice recognition model, obtain a control instruction corresponding to the voice of the user, and execute the control instruction, so as to implement a corresponding function.

The controller 250 may first invoke the third-party speech recognition interface to convert the user speech into a speech text, and then determine the meaning of the speech text by using the speech recognition model to generate and execute the control command.

In some embodiments, controller 250 may also prompt the user after executing the control instructions. FIG. 11 illustrates a schematic diagram of a scenario in which a user interacts with a display device in a voice manner in some embodiments. As shown in fig. 11, the user inputs a voice instruction "wake me tomorrow twelve o' clock", and the controller 250 may recognize the user voice and execute it. Controller 250 may set an alarm clock at twelve tomorrow and prompt the user by voice that "an alarm clock at twelve tomorrow has been set for you".

In some embodiments, the user may search for assets using a voice-controlled display device. For example, the user enters the speech instruction "search for XXX movie season three". For assets that the user wants to search, controller 250, after searching for relevant assets, can present a search interface while audibly prompting the user that "video about XXX has been recommended for you". FIG. 12 illustrates a schematic diagram that shows the display device displaying the search interface in some embodiments. When the user selects a target asset, the controller 250 may control the display 260 to display an asset detail page of the target asset, so that the user plays the asset for viewing.

If the controller 250 does not search the related assets, preset prompt information can be displayed, and the prompt information is used for prompting the user that the related assets are not searched and prompting the user through voice. FIG. 13 is a diagram that illustrates a hint in some embodiments.

An embodiment of the present application further provides a speech recognition method, as shown in fig. 14, the method includes:

and 1401, obtaining the sentence to be translated based on the entity to be translated and the sentence template pre-stored in the display device.

And 1402, translating the sentence to be translated into a target sentence of a preset language.

And 1403, acquiring a target entity corresponding to the entity to be translated based on the target statement.

Step 1404, generating a speech recognition model based on the target entity and the pre-stored sentence template.

And step 1405, responding to the user voice collected by the voice collector, identifying the user voice based on the voice identification model, obtaining a control instruction corresponding to the user voice and executing the control instruction.

In some embodiments, obtaining the sentence to be translated based on the entity to be translated and a sentence template pre-stored in the display device further includes:

acquiring a target entity type of an entity to be translated;

in a preset database, screening a sentence template based on a target entity type to obtain a target sentence template corresponding to the target entity type; a plurality of statement templates are stored in a database in advance;

and acquiring the sentence to be translated based on the entity to be translated and the target sentence template.

In some embodiments, the statement template includes a first fixed entity and a plurality of slots to be filled, and there is a matching relationship between the slots to be filled and the entity type. Screening the statement template based on the target entity type, further comprising:

screening statement templates in the database based on preset screening conditions; the preset screening conditions are as follows: and if at least one slot to be filled in the statement template is matched with the target entity type, determining the statement template as the target statement template.

In some embodiments, obtaining the sentence to be translated based on the entity to be translated and the target sentence template further includes:

determining a target slot to be filled corresponding to an entity to be translated in a target statement template and the rest slot to be filled except the target slot to be filled; in a database, acquiring a seed entity pair corresponding to the residual slot position to be filled, wherein the seed entity pair comprises a first seed entity of an initial language corresponding to the entity to be translated and a second seed entity of a preset language; and filling the entity to be translated into the target slot to be filled, and filling the first seed entity into the rest slot to be filled to obtain the statement to be translated.

In some embodiments, obtaining a target entity corresponding to an entity to be translated based on a target statement further includes:

performing word segmentation processing on the target sentence to obtain a word segmentation result; each target sentence corresponds to a word segmentation result; acquiring a candidate entity corresponding to each target sentence based on the word segmentation result; and screening the candidate entities to obtain target entities corresponding to the entities to be translated.

In some embodiments, obtaining the candidate entity corresponding to each target sentence based on the word segmentation result further includes:

in a database, acquiring a matching statement template corresponding to a target statement template, wherein the matching statement template comprises a second fixed entity and a plurality of slots to be filled, and the second fixed entity is an entity corresponding to the first fixed entity under a preset language; and in the word segmentation result, obtaining the remaining word segments except the second fixed entity and the second seed entity, and determining the remaining word segments as candidate entities corresponding to each target sentence.

In some embodiments, screening the candidate entities to obtain a target entity corresponding to the entity to be translated further includes:

translating the candidate entities into corresponding initial language candidate entities in the initial language; acquiring the editing distance between an initial language candidate entity and an entity to be translated; acquiring the occurrence frequency of each candidate entity in all the candidate entities; and calculating the translation score of the candidate entity based on the editing distance, the occurrence frequency and a preset weight coefficient, and determining the candidate entity with the highest translation score as a target entity corresponding to the entity to be translated.

In some embodiments, generating the speech recognition model based on the target entity and the pre-stored sentence template further comprises:

determining an initial language of an entity to be translated; acquiring a first statement template corresponding to an initial language and a second statement template corresponding to a preset language; generating a first voice recognition model based on the entity to be translated and the first sentence template, and generating a second voice recognition model based on the target entity and the second sentence template; the first voice recognition model is used for recognizing user voice corresponding to the initial language, and the second voice recognition model is used for recognizing user voice corresponding to the preset language.

In some embodiments, generating the speech recognition model based on the target entity and a pre-stored sentence template further comprises:

and generating a third voice recognition model based on the target entity, the entity to be translated and the pre-stored statement template, wherein the third voice recognition model is used for recognizing the user voice corresponding to the preset language and the initial language.

The same and similar parts in the embodiments in this specification may be referred to one another, and are not described herein again.

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be substantially or partially embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the method of the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, comprising:

a display;

a sound collector configured to collect a voice input by a user;

a controller configured to:

2. The display device of claim 1, wherein the controller performs the obtaining of the sentence to be translated based on the entity to be translated and a sentence template pre-stored in the display device, and is further configured to:

acquiring a target entity type of the entity to be translated;

screening the statement template based on the target entity type in a preset database to obtain a target statement template corresponding to the target entity type; a plurality of statement templates are stored in the database in advance;

3. The display device according to claim 2, wherein the sentence template comprises a first fixed entity and a plurality of slots to be filled, and a matching relationship exists between the slots to be filled and the entity types;

the controller performs filtering of the statement template based on the target entity type, and is further configured to:

screening the statement templates in the database based on preset screening conditions; the preset screening conditions are as follows: and if at least one slot to be filled exists in the statement template and the target entity type is matched, determining the statement template as a target statement template.

4. The display device according to claim 3, wherein the controller executes obtaining a sentence to be translated based on the entity to be translated and the target sentence template, and is further configured to:

determining a target slot to be filled corresponding to the entity to be translated in the target statement template and the remaining slot to be filled except the target slot to be filled;

in the database, obtaining a seed entity pair corresponding to the residual slot position to be filled, wherein the seed entity pair comprises a first seed entity of an initial language corresponding to the entity to be translated and a second seed entity of the preset language;

and filling the entity to be translated into the target slot to be filled, and filling the first seed entity into the rest slot to be filled to obtain the statement to be translated.

5. The display device according to claim 4, wherein the controller executes obtaining a target entity corresponding to the entity to be translated based on the target sentence, and is further configured to:

performing word segmentation processing on the target sentence to obtain a word segmentation result; each target sentence corresponds to a word segmentation result;

acquiring a candidate entity corresponding to each target sentence based on the word segmentation result;

and screening the candidate entity to obtain a target entity corresponding to the entity to be translated.

6. The device according to claim 5, wherein the controller performs obtaining a candidate entity corresponding to each of the target sentences based on the word segmentation result, and is further configured to:

acquiring a matching statement template corresponding to the target statement template in the database, wherein the matching statement template comprises a second fixed entity and the plurality of slots to be filled, and the second fixed entity is an entity corresponding to the first fixed entity in the preset language;

and obtaining the remaining participles except the second fixed entity and the second seed entity in the participle result, and determining the remaining participles as candidate entities corresponding to each target sentence.

7. The display device according to claim 5, wherein the controller performs filtering on the candidate entities to obtain a target entity corresponding to the entity to be translated, and is further configured to:

translating the candidate entities into corresponding initial language candidate entities in the initial language;

acquiring the editing distance between the initial language candidate entity and the entity to be translated; acquiring the occurrence frequency of each candidate entity in all the candidate entities;

and calculating the translation score of the candidate entity based on the editing distance, the occurrence frequency and a preset weight coefficient, and determining the candidate entity with the highest translation score as a target entity corresponding to the entity to be translated.

8. The display device of claim 1, wherein the controller performs generating a speech recognition model based on the target entity and the pre-stored sentence template, and is further configured to:

determining an initial language of the entity to be translated;

acquiring a first statement template corresponding to the initial language and a second statement template corresponding to the preset language;

generating a first speech recognition model based on the entity to be translated and the first sentence template, and generating a second speech recognition model based on the target entity and the second sentence template; the first voice recognition model is used for recognizing the user voice corresponding to the initial language, and the second voice recognition model is used for recognizing the user voice corresponding to the preset language.

9. The display device of claim 1, wherein the controller performs generating a speech recognition model based on the target entity and the pre-stored sentence template, and is further configured to:

determining an initial language of the entity to be translated;

10. A voice recognition method is applied to a display device, and is characterized by comprising the following steps: