CN117809658A

CN117809658A - Server, terminal and voice recognition method

Info

Publication number: CN117809658A
Application number: CN202311079252.3A
Authority: CN
Inventors: 张晓明; 张宝军
Original assignee: Hisense Electronic Technology Wuhan Co ltd
Current assignee: Hisense Electronic Technology Wuhan Co ltd
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2024-04-02

Abstract

Some embodiments of the present application show a server, a terminal, and a voice recognition method, where the method includes: after receiving user data uploaded by a terminal, constructing a first language model corresponding to a user identifier based on the user data, wherein the user data comprises the user identifier and user history data; after receiving voice data and user identification uploaded by a terminal, determining a recognition result corresponding to the voice data based on a second language model and a first language model corresponding to the user identification, wherein the second language model is a general language model; and sending the identification result to the terminal. According to the method and the device for identifying the voice data, the personalized language model is built for the user through the user historical data uploaded by the receiving terminal, and after the voice data input by the user are received, the identification result corresponding to the voice data is determined together based on the personalized language model and the universal language model, so that the accuracy of voice identification is improved.

Description

Server, terminal and voice recognition method

Technical Field

The application relates to the technical field of display equipment, in particular to a server, a terminal and a voice recognition method.

Background

Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR). Its objective is to convert the content contained in the speech signal into a computer readable input, such as a text sequence or the like. The mainstream methods of speech recognition in the industry include hybrid model (Hybird) based methods and End-to-End model (End-to-End) based methods. The hybrid model-based approach is to implement the acoustic model and the language model as two separate modules. The acoustic model is mainly responsible for converting speech signals into text, while the language model is responsible for modifying and perfecting the generated text. The end-to-end model based approach is to complete the input of speech signals to the output of text by a neural network.

A speech recognition system of a hybrid model is generally composed of an acoustic model and a language model together. The language model used in the smart tv scene is usually trained by various tv related instructions, such as "change a channel", "play a western-style game", etc., of the instruction text sent by the user against the tv. This language model is the same for different television users. If the user's input voice command conflicts with (pronounces the same as or close to) a more common voice command, the user's input voice command will be replaced by the more common voice command, and the user's intention cannot be correctly recognized. For example, the user says that "i want to see the hip-hop" conflicts with "i want to see the western-tour" that "i want to see the hip-hop" is difficult to be correctly recognized, resulting in a decrease in the accuracy of speech recognition.

Disclosure of Invention

Some embodiments of the present application provide a server, a terminal, and a voice recognition method, which construct a personalized language model for a user by receiving user history data uploaded by the terminal, and after receiving voice data input by the user, determine a recognition result corresponding to the voice data based on the personalized language model and a general language model together, thereby improving accuracy of voice recognition.

In a first aspect, some embodiments of the present application provide a server configured to:

after receiving user data uploaded by a terminal, constructing a first language model corresponding to a user identifier based on the user data, wherein the user data comprises the user identifier and user history data;

after receiving voice data and user identification uploaded by a terminal, determining a recognition result corresponding to the voice data based on a second language model and a first language model corresponding to the user identification, wherein the second language model is a general language model;

and sending the identification result to the terminal.

In some embodiments, the server is configured to:

and after receiving the user interface data uploaded by the terminal, constructing a third language model based on the user interface data.

In some embodiments, the server performs determining the recognition result corresponding to the voice data based on a second language model and a first language model corresponding to the user identification, and is further configured to:

and determining a recognition result corresponding to the voice data based on the second language model, the third language model and the first language model corresponding to the user identifier.

In some embodiments, the server is configured to:

after detecting that the entity database has update data, updating the second language model based on the update data.

In some embodiments, the update data includes entity data and a data type identifier corresponding to the entity data, and the server performs updating the second language model based on the update data, and is further configured to:

and transmitting the entity data into an entity list corresponding to the data type identifier so as to update the second language model, wherein the entity data in the entity list is used for filling the slot corresponding to the data type identifier.

In some embodiments, the server is configured to:

after receiving a scene identifier uploaded by a terminal, creating a fourth language model corresponding to the scene identifier based on data corresponding to the scene identifier;

And determining a recognition result corresponding to the voice data based on the second language model and the fourth language model corresponding to the scene identifier.

determining a composite score of at least one first candidate text based on the first language model and the acoustic model, the composite score of the first candidate text being a sum of the language and acoustic scores of the first candidate text;

determining a composite score of at least one second candidate text based on the second language model and the acoustic model, the composite score of the second candidate text being a sum of the language and acoustic scores of the second candidate text;

and determining the candidate text with the highest comprehensive score in the second candidate text and the first candidate text as a recognition result.

In a second aspect, some embodiments of the present application provide a terminal, including:

a controller configured to:

receiving voice data input by a user;

the voice data and the user identification are sent to a server, so that the server determines a recognition result corresponding to the voice data based on a second language model and a first language model corresponding to the user identification, wherein the first language model is created based on user data uploaded by a terminal, the user data comprises the user identification and user history data, and the second language model is a general language model;

And receiving the identification result issued by the server.

In a third aspect, some embodiments of the present application provide a voice recognition method, which is applied to a server, including:

and sending the identification result to the terminal.

In a fourth aspect, some embodiments of the present application provide a voice recognition method, which is applied to a terminal, including:

receiving voice data input by a user;

And receiving the identification result issued by the server.

Some embodiments of the application provide a server, a terminal and a voice recognition method. After receiving user data uploaded by a terminal, a server constructs a first language model corresponding to a user identifier based on the user data. The user data comprises a user identifier and user history data. After receiving the voice data and the user identification uploaded by the terminal, determining a recognition result corresponding to the voice data based on the second language model and the first language model corresponding to the user identification. Wherein the second language model is a generic language model. And sending the identification result to the terminal. According to the method and the device for identifying the voice data, the personalized language model is built for the user through the user historical data uploaded by the receiving terminal, and after the voice data input by the user are received, the identification result corresponding to the voice data is determined together based on the personalized language model and the universal language model, so that the accuracy of voice identification is improved.

Drawings

FIG. 1 illustrates an operational scenario between a display device and a control apparatus according to some embodiments;

FIG. 2 illustrates a hardware configuration block diagram of a control device according to some embodiments;

FIG. 3 illustrates a hardware configuration block diagram of a display device according to some embodiments;

FIG. 4 illustrates a software configuration diagram in a display device according to some embodiments;

FIG. 5 illustrates a flow chart of a method for a server to perform speech recognition, provided in accordance with some embodiments;

FIG. 6 illustrates a schematic diagram of an acoustic model structure provided in accordance with some embodiments;

FIG. 7 illustrates a flow chart of an arpa file format provided in accordance with some embodiments;

FIG. 8 illustrates a schematic diagram of a WFST format provided in accordance with some embodiments;

FIG. 9 illustrates a schematic diagram of a Grammar FST format provided in accordance with some embodiments;

FIG. 10 illustrates a flowchart of a display device performing a voice recognition method, provided in accordance with some embodiments;

fig. 11 illustrates a timing diagram for a speech recognition method provided in accordance with some embodiments.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first and second and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The terminal provided by the embodiment of the application can have various implementation forms, for example, a display device, an intelligent device and the like. The intelligent equipment comprises a mobile terminal, a tablet personal computer, a notebook computer, an intelligent vehicle-mounted intelligent sound box and the like.

The display device provided in the embodiment of the application may have various implementation forms, for example, may be a television, an intelligent television, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), an electronic desktop (electronic table), and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.

In some embodiments, a smart device 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.

In some embodiments, the display device may receive instructions not using the smart device or control device described above, but rather receive control of the user by touch or gesture, or the like.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.

As shown in fig. 3, the display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.

The display 260 further includes a touch screen, and the touch screen is used for receiving an action input control instruction such as sliding or clicking of a finger of a user on the touch screen.

The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.

A user interface, which may be used to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, etc. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

The modem 210 receives broadcast television signals through a wired or wireless reception manner, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, graphics processor (Graphics Processing Unit, GPU), RAM (Random Access Memory, RAM), ROM (Read-Only Memory, ROM), first to nth interfaces for input/output, a communication Bus (Bus), etc.

The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user, which enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

As shown in fig. 4, the system of the display device is divided into three layers, an application layer, a middleware layer, and a hardware layer, from top to bottom.

The application layer mainly comprises common applications on the television, and an application framework (Application Framework), wherein the common applications are mainly applications developed based on Browser, such as: HTML5 APPs; native applications (Native APPs);

the application framework (Application Framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange, and the interface for the use of these functions (toolbar, status column, menu, dialog box).

Native applications (Native APPs) may support online or offline, message pushing, or local resource access.

The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.

The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

Speech recognition technology, also known as automatic speech recognition. Its objective is to convert the content contained in the speech signal into a computer readable input, such as a text sequence or the like. The mainstream methods of the industry include methods based on hybrid models and methods based on end-to-end models. The hybrid model-based approach is to implement the acoustic model and the language model as two separate modules. The acoustic model is mainly responsible for converting speech signals into text, while the language model is responsible for modifying and perfecting the generated text. The end-to-end model based approach is to complete the input of speech signals to the output of text by a neural network.

A speech recognition system of a hybrid model is generally composed of an acoustic model and a language model together. The language model used in the smart tv scene is usually trained by various tv related instructions, such as "change a channel", "play a western-style game", etc., of the instruction text sent by the user against the tv. This language model is the same for different television users. If the voice command input by the user conflicts with (pronounces the same as or close to) a more common voice command, the voice command input by the user is replaced by the more common voice command, so that the intention of the user cannot be correctly recognized, and the accuracy of voice recognition is reduced.

For example, the user says x i y ou2 j i4, and corresponds to two movie contents of "hip-hop" and "hip-hop". If the user speaks w o y ao2 k an4 x i y ou2 j i4 to the television, the recognition result is "I want to see the West tour". Because in a large number of text statistics "western bearing" occurs more frequently than "hip-hop". But is difficult to recognize by voice control if a user actually wants to see "hip-hop".

In some embodiments, the speech instructions for the same or close pronunciation are left to be resolved by a subsequent natural language processing (Natural Language Processing, NLP) module or corrected by a post-processing error correction module, which is more complex to operate.

In order to solve the technical problems, the embodiment of the application provides a server, and some functions of the server are further improved. As shown in fig. 5, the server performs the steps of:

step S501: and receiving user data uploaded by the terminal, wherein the user data comprises a user identifier and user history data.

Wherein the user identification is used for representing identity information of the user. The user identifier may be a user account number or a device identifier of the terminal. The user history data is used to characterize the user's history behavior.

In some embodiments, the user history data includes user favorite movie listings, play movie listings, and the like.

In some embodiments, the user history data includes the movie theatrical related entity data within a user's favorite movie listing, a play movie listing, and the like. For example, the movie list of the user collection includes movie a, and the user history data includes movie a's movie name, actor name, character name, and the like.

In some embodiments, the user history data includes data of a user favorite music list, a played music list, and the like.

In some embodiments, the user history data includes the music related entity data within a list of user favorite music lists, played music lists, and the like. For example, song B is included in the user play music list, and the user history data includes song name, artist name, album name, lyrics, and the like of song B.

In some embodiments, the user history data includes data of a user-installed application or a used application, and the like.

In some embodiments, the user history data includes data such as searched address locations or frequently accessed address locations. The address information comprises business places such as restaurants, banks, gas stations and the like besides cities.

Step S502: constructing a first language model corresponding to the user identifier based on the user data;

after receiving user data uploaded by a terminal, constructing a first language model based on user data history data, and correspondingly storing the first language model and a user identifier.

In some embodiments, the user history data includes at least one first entity data, and the first language model may be directly constructed based on the first entity data and stored corresponding to the user identifier.

In some embodiments, the user history data includes at least one first entity data, after receiving the user data, obtaining second entity data associated with the first entity data, constructing a first language model based on the first entity data and the second entity data, and storing the first language model corresponding to the user identifier.

Illustratively, the user history data includes the name of movie theatrical a and the song name of music B, and entity data such as actor names and character names associated with movie theatrical a is obtained. Entity data such as singer name, album name, lyrics, etc. associated with song B is acquired. A first language model is constructed based on the play name, actor name and character name of the movie play A, and entity data such as song name, singer name, album name and lyrics associated with song B.

In some embodiments, the terminal may send only the data content in the playlist or the favorites list, such as a data name, to the server, or may send entity data related to the data content in the playlist or the favorites list to the server. And determining whether the server needs to expand the first entity data or not by setting an expansion identifier. The user data also comprises an expansion identifier, and the server judges whether the expansion identifier is 0 after receiving the expansion identifier; if the expansion identifier is 0, a first language model can be directly constructed based on the user data; if the expansion identifier is 1, acquiring second entity data associated with the first entity data, and constructing a first language model directly based on the first entity data and the second entity data.

In some embodiments, the user identifier is a device identifier of the terminal, and the first language model is stored corresponding to the terminal device identifier, so that other terminals cannot use the first language model corresponding to the terminal.

In some embodiments, the user identifier is a user account, and the first language model corresponding to the user account is stored, so that the first language model corresponding to the user account can be used even if other terminals log in the user account.

Step S503: receiving voice data and user identification uploaded by a terminal;

the voice data is voice data input to the terminal by the user. After receiving voice data input by a user, the terminal sends the voice data to the server.

Step S504: and determining a recognition result corresponding to the voice data based on a second language model and a first language model corresponding to the user identifier, wherein the second language model is a universal language model.

In some embodiments, only the voice data uploaded by the terminal is received and the user identifier uploaded by the terminal is not received, and then the recognition result corresponding to the voice data is determined directly based on the second language model.

In some embodiments, after receiving the voice data and the user identifier uploaded by the terminal, judging whether a first language model corresponding to the user identifier exists;

if the first language model corresponding to the user identifier exists, determining a recognition result corresponding to the voice data based on the second language model and the first language model corresponding to the user identifier;

and if the first language model corresponding to the user identification does not exist, determining a recognition result corresponding to the voice data based on the second language model.

The embodiment of the application determines that the recognition result corresponding to the voice data is completed based on the mixed model voice recognition engine. The mixed model speech recognition engine includes an acoustic model and a language model.

The input of the acoustic model is an original speech frame sequence sliced according to a fixed time length, which is typically 10-30 ms, and the output is the probability that each frame of speech corresponds to an acoustic modeling unit, i.e. the acoustic score. Modeling elements commonly used for Chinese Mandarin recognition have several options, including initial consonants (collectively referred to as phone), syllables (syllabic), chinese characters (char), and the like. The acoustic model is typically implemented using a deep neural network (Deep Neural Networks, DNN), a simple acoustic model structure can be as shown in fig. 6.

The language model functions to output linguistic scores for different text sequences. The acoustic model and the language model are combined together, so that the input voice can be converted into a series of possible candidate texts, and each text sequence gives a corresponding probability value, and the probability comprises the probability of the acoustic model and the probability of the language model. The language model is typically in a format that records the probability of a single word, two words, three words, etc. being joined together, and a typical language model is typically in a format stored as an arpa file. The contents of this file are shown in fig. 7, for example. The file records the probabilities that words such as "don't care", "will APP" are linked together. The probability that these words are connected together in a large number of texts is obtained, and an arpa file is generated. Where the probability comes from statistics of a large amount of text, and "large amount of text" generally depends on the problem area to be solved by speech recognition. For the scene of the intelligent television, the common practice is to collect a large number of utterances (texts) of the user voice control television, for example, "i want to see central one", "change channels", "volume is a little bit bigger", "play the tenth set of western pleasure", etc. And for the texts, word segmentation is carried out by using a word segmentation tool, word frequency is counted, and the probability that different words are connected together is counted.

In practical engineering implementations, the language model stored in the arpa file is usually converted into a format of WFST (Weighted fixed-State Transducers, weighted Finite state machine) diagram. In the process of voice decoding (i.e. recognition), the WFST graph receives an input phone sequence (or a symble, etc.), a path with the optimal score is found in the decoding graph, and a text sequence corresponding to the path is the final recognition result.

Illustratively, there is a piece of speech that sounds like n i h ao3, with an acoustic score of 0.7. Also like n in3 h ao3, the acoustic score is 0.2 score, l i h ao3, and the acoustic score is 0.05 score. This acoustic score is calculated from the acoustic model. The final recognition result cannot simply adopt the sentence with the best acoustic score, because the speaker has the influence of accent, noise, etc., which can lead to a correct result, the acoustic score is not necessarily optimal. Searching on the WFST diagram shown in fig. 8, the text corresponding to n i h ao3 has "hello", "Li Hao", etc., and the text corresponding to n in3 h ao3 has "hello". The language scores of three sentences of "hello", "Li Hao", "hello" are 1.2, 1.8 and 1.5 respectively. The end result is "hello" because its acoustical score plus linguistic score composite score is best.

Embodiments of the present application relate to two language models, a first language model and a second language model. When the voice recognition is carried out, the WFST graph is generated by the first language model and the WFST graph is connected in parallel, a phone sequence (or a symble sequence) given by the acoustic model is searched on the graph, and if a path in the first language model is hit, a recognition result of the first language model is output. If the path in the second language model is hit, the recognition result of the second language model is output.

In some embodiments, the composite scores of the candidate texts of the two language models are calculated respectively, and the candidate text with the highest composite score of each of the two language models is screened out. And comparing the two screened candidate texts, and screening out the one with higher comprehensive score as a recognition result.

In some embodiments, the comprehensive scores of the candidate texts of the two language models are calculated respectively, and one candidate text with a higher comprehensive score is screened out as a recognition result.

In some embodiments, the step of determining the recognition result corresponding to the voice data based on the second language model and the first language model corresponding to the user identifier includes:

it should be noted that, compared to the second language model, the language score of the text with the same or similar pronunciation is higher in the first language model.

Illustratively, the user speaks "w o y ao2 k an4 x i y ou2 j i4", determines that the language score and the acoustic score of "i want to see the hip-hop" are 3.9 and 0.8, respectively, and the overall score is 4.7 based on the first language model and the acoustic model. The language score and the acoustic score of "i want to see the western tour" are determined to be 2.5 and 0.8 respectively based on the second language model and the acoustic model, and the comprehensive score is 3.3. Thus, the recognition result is "i want to see hip-hop".

The user may have the same pronunciation in different scenarios, and need to identify the scenarios of different results. For example, when the user browses photos on the terminal, the user says that x ia4 y i1 zh ang1 needs to be identified as "next" if the user is reading a book or a novel, x ia4 y i1 zh ang1 needs to be identified as "next chapter".

In some embodiments, the voice command with the same or close pronunciation is left for a subsequent module such as natural language processing, or corrected by a post-processing error correction module, so that the operation is more complex and the accuracy is not high.

To solve the above technical problem, in some embodiments, a server receives voice data and user interface data uploaded by a terminal;

the user interface data refers to characters contained in a current user interface of a user.

Constructing a third language model based on the user interface data;

and determining a recognition result corresponding to the voice data based on the second language model and the third language model.

In some embodiments, when receiving the voice data and the user identifier uploaded by the terminal, the method also receives the user interface data uploaded by the terminal, and updates the first language model corresponding to the user identifier based on the user interface data;

and determining a recognition result corresponding to the voice data based on the second language model and the first language model corresponding to the user identifier.

After the user switches the user interface, the receiving terminal uploads new user interface data and then continues to update the first language model based on the user interface data. According to the method and the device for constructing the first language model, the historical user interface data of the user can be used as a data source for constructing the first language model, and the accuracy of voice recognition is improved.

In some embodiments, when receiving the voice data and the user identifier uploaded by the terminal, the method also receives the user interface data uploaded by the terminal, and builds a third language model based on the user interface data;

After the user switches the user interface, the receiving terminal uploads new user interface data, clears the current third language model and reconstructs the third language model based on the user interface data. According to the embodiment of the application, different third language models can be constructed according to different user interface data, and the relevance between the voice recognition and the user interface is enhanced, so that the accuracy of the voice recognition is improved.

In some embodiments, the step of determining the recognition result corresponding to the voice data based on the second language model, the third language model, and the first language model corresponding to the user identifier includes:

determining a composite score of at least one third candidate text based on the third language model and the acoustic model, the composite score of the third candidate text being a sum of the language and acoustic scores of the third candidate text;

it should be noted that, compared to the first language model, the language score of the text with the same or similar pronunciation is higher in the third language model.

And determining the candidate text with the highest comprehensive score in the first candidate text, the second candidate text and the third candidate text as a recognition result.

Illustratively, the user's current user interface includes a next button or the like, and a third language model is created based on the user interface data. The user speaks "x ia4 y i1 zh ang1", and determines that the language score and the acoustic score of "next chapter" are 1.0 and 0.7, respectively, and the overall score is 1.7 based on the first language model and the acoustic model. The language score and the acoustic score of the "next chapter" are determined to be 1.1 and 0.7, respectively, and the overall score is 1.8 based on the second language model and the acoustic model. The language score and the acoustic score of the "next" are determined to be 3.0 and 0.7, respectively, and the overall score is 3.7 based on the third language model and the acoustic model. Therefore, the recognition result is "next sheet".

In some embodiments, the server receives the voice data uploaded by the terminal and a scene identifier, where the scene identifier is used to characterize an application program currently running by the terminal, a channel currently connected, or a specific state of the terminal, such as a video playing state.

After receiving the voice data and the scene identifier uploaded by the terminal, the server judges whether a fourth language model corresponding to the scene identifier exists or not;

if the fourth language model corresponding to the scene identifier does not exist, acquiring data corresponding to the scene identifier, and creating the fourth language model corresponding to the scene identifier based on the data corresponding to the scene identifier;

If a fourth language model corresponding to the scene identification exists, the recognition result corresponding to the voice data can be determined directly based on the second language model and the fourth language model corresponding to the scene identification.

In some embodiments, the step of determining the recognition result corresponding to the voice data based on the second language model and the fourth language model corresponding to the scene identifier includes:

determining a composite score of at least one fourth candidate text based on the fourth language model and the acoustic model, the composite score of the fourth candidate text being a sum of the language and acoustic scores of the fourth candidate text;

it should be noted that, compared to the second language model, the language score of the text with the same or similar pronunciation is higher in the fourth language model.

And determining the candidate text with the highest comprehensive score in the second candidate text and the fourth candidate text as a recognition result.

According to the method and the device for identifying the scene of the user, the scene identification corresponding to the current scene and the voice data can be sent to the server together when the scene of the user is identified, the server can search together with the universal language model through the language model corresponding to the scene identification, and a result with higher comprehensive score is obtained, so that the accuracy of voice identification is improved.

In some embodiments, after receiving the voice data, the user identifier and the scene identifier uploaded by the terminal, the server acquires data corresponding to the scene identifier, and updates the first language model based on the data corresponding to the scene identifier; and determining a recognition result corresponding to the voice data based on the second language model and the first language model corresponding to the user identifier.

In some embodiments, after receiving the voice data, the user identifier and the scene identifier uploaded by the terminal, determining a recognition result corresponding to the voice data based on the second language model, the first language model corresponding to the user identifier and the fourth language model corresponding to the scene identifier. The specific mode is as follows: calculating the comprehensive scores of the three language model candidate texts, screening the candidate texts with the highest comprehensive scores of the three language models, comparing the three screened candidate texts, and screening one with the higher comprehensive scores as a recognition result. Or respectively calculating the comprehensive scores of the three language model candidate texts, and screening out one candidate text with higher comprehensive score as a recognition result.

In some embodiments, after receiving the voice data, the user identifier, the scene identifier and the user interface data uploaded by the terminal, constructing a third language model based on the user interface data; and determining a recognition result corresponding to the voice data based on the second language model, the third language model, the first language model corresponding to the user identifier and the fourth language model corresponding to the scene identifier. The specific mode is as follows: calculating the comprehensive scores of the candidate texts of the four language models, screening the candidate texts with the highest comprehensive scores of the four language models, comparing the screened four candidate texts, and screening one with the higher comprehensive scores as a recognition result. Or respectively calculating the comprehensive scores of the candidate texts of the four language models, and screening out one candidate text with higher comprehensive score as a recognition result.

In some embodiments, after receiving the voice data, the user identifier, the scene identifier and the user interface data uploaded by the terminal, updating a first language model corresponding to the user identifier based on the user interface data; and determining a recognition result corresponding to the voice data based on the second language model, the first language model corresponding to the user identifier and the fourth language model corresponding to the scene identifier.

In some embodiments, after receiving voice data, a user identifier, a scene identifier and user interface data uploaded by a terminal, constructing a third language model based on the user interface data, acquiring scene data corresponding to the scene identifier, and updating a first language model corresponding to the user identifier based on the scene data; and determining a recognition result corresponding to the voice data based on the second language model, the third language model and the first language model corresponding to the user identifier.

In some embodiments, after receiving voice data, user identification, scene identification and user interface data uploaded by a terminal, acquiring scene data corresponding to the scene identification, and updating a first language model corresponding to the user identification based on the user interface data and the scene data; and determining a recognition result corresponding to the voice data based on the second language model and the first language model corresponding to the user identifier.

In some embodiments, the user is provided with different speech recognition reference options for determining the data content that the terminal sends to the server. Wherein the voice data is mandatory content. If the user chooses to use his own history data as a reference, the user identification needs to be transmitted simultaneously when transmitting the voice data. If the user chooses to reference the current user interface, the user interface data needs to be transmitted simultaneously when voice data is transmitted. If the user chooses to reference the current scene, the scene identification needs to be sent simultaneously when sending the voice data. The option may be selectable. The server performs different processing according to the received data. The processing manner is described above and is not described in detail herein.

The video and music data are updated every week or even every day, and new video and music names appear. Many of these movie names are not common nouns prior to play. For example, coconje, if the pronunciation of "zhen 1 h uan2 zh uan4" is recognized before the television is broadcast, the pronunciation cannot be correctly corresponding to the words of "Zhen transmission".

In some embodiments, the problem of updating the language model in real time is mainly that the cost of updating the language model once is high, the language model needs to be retrained, the detailed test is required after training, the whole process generally needs days or even weeks, and the updating is difficult to achieve in time.

In order to solve the above problems, in the embodiments of the present application, a server detects whether update data exists in an entity database;

in some embodiments, the step of detecting whether the entity database has updated data comprises:

acquiring update data of an entity database interface at regular time or at intervals of a preset time length;

if the updated data is empty, determining that the media resource entity data is not updated;

and if the update data is not null, determining that the media asset entity data is updated.

after determining that the entity database is updated, the data updating module of the server sends an updating message to the voice recognition engine;

after receiving the update message, the speech recognition engine acquires the update data of the entity database interface.

The data of the entity database is used to characterize the name associated with a particular data. The specific data includes media asset entity data and location entity data. The media asset entity data includes video entity data, audio entity data, and the like. The video entity data includes a movie title related to a movie, a movie actor name, a movie character name, and the like. The audio entity data includes song names, album names, singer names, lyrics, and the like associated with music. The location entity data includes restaurant names, restaurant addresses, and featured dishes, etc. associated with the restaurants.

And setting entity data update of different contents according to different application scenes. The intelligent television only needs to acquire the update data of the media asset entity data when the media asset entity data is updated. The intelligent sound box only needs to acquire the update data of the audio entity data when the audio entity data is updated. The map application of the mobile terminal only needs to acquire the update data of the place entity data when the place entity data is updated.

The second language model in the embodiment of the application adopts a Grammar FST (Finite State Transducers, finite State transducer) language model. The main idea of the Grammar FST is that compared with a traditional language model FST diagram, the Grammar FST reserves a slot (slot) in the diagram and can be used for expansion. If there is updated entity data, the content in the slots may be populated with the entity list. The conventional language model FST is actually the probability of including sentences such as "i want to watch the movie of Liu Dehua", "i want to watch the movie of Zhang Xueyou", and the like. And the Grammar FST is the probability of containing "I want to see the movie of xxx", which can be filled with extensions by entity data of the entity list.

In some embodiments, the update data includes entity data and a data type identification, and the step of updating the second language model based on the update data includes:

and transmitting the entity data into an entity list corresponding to the data type identifier so as to update the second language model, wherein the entity data of the entity list is used for filling the slot corresponding to the data type identifier.

Illustratively, grammar FST contains "put xxx songs" as shown in FIG. 9. The singer slot positions correspond to the singer's xxx songs, after the updated singer's 'small A' and 'small B' are transmitted to the entity list corresponding to the singer, the singer slot positions can be replaced by 'small A' and 'small B', and the 'singing of the small A' and the 'singing of the small B' can be identified in the second language model.

Step S505: and sending the identification result to the terminal.

The user data also comprises a terminal equipment identifier, and the server returns the identification result to the sending terminal through the terminal equipment identifier.

The embodiment of the application provides a terminal, which takes a display device as an example, and the embodiment of the application further improves some functions of the display device. As shown in fig. 10, the controller of the display device performs the steps of:

Step S1001: detecting whether user history data is updated;

in some embodiments, voice assistant software installed in the display device may detect whether the user history data is updated. The user history data includes collected media asset data, played media asset data, and the like.

In some embodiments, the step of detecting whether the user history data is updated comprises:

acquiring user history data at regular time, at intervals of preset time or when a preset condition is triggered;

the timing may be that user history data is obtained at the same time every day, the interval preset time period may be that user history data is obtained once every 4 hours, and the trigger preset condition may be that user history data is obtained once every time when the device is started.

Judging whether the user history data acquired at this time is the same as the user history data acquired last time;

if the user history data acquired at this time is the same as the user history data acquired last time, determining that the user history information is updated, and determining updated data by comparison;

if the user history data acquired at this time is different from the user history data acquired last time, determining that the user history information is not updated.

After detecting that the user plays or collects the media data, judging whether the media data played or collected is the same as the media data in the user history data;

and if the media asset data of the playing or collecting media asset data are different from the media asset data in the historical data of the user, storing the media asset data of the playing or collecting media asset data and an update identifier into the historical data of the user, wherein the update identifier is 1.

If the playing or collecting media asset data is the same as one of the user history data, the operation of storing the playing or collecting media asset data and the update identification into the user history data is not performed.

Acquiring an update identifier of the user history data at regular time, at intervals for a preset time period or when a preset condition is triggered;

if the update identifications are all 0, determining that the user history information is not updated;

if at least one of the update identifications is 1, it is determined that the user history data has been updated. The data with update identification 1 is update data.

After detecting the update of the user history data, step S1002 is executed: uploading user data to a server so that the server builds a first language model corresponding to the user identification based on the user data, wherein the user data comprises the user identification and user history data.

In some embodiments, after detecting that the user history data has been updated, the user history data (all data) and the user identification are uploaded to the server to cause the server to create the first language based on the user history data.

In some embodiments, the user history data (update data) and the user identification are uploaded to the server upon detecting that the user history data has been updated. It should be noted that, the user history data uploaded each time is saved, and after the new user history data is uploaded, the first language model is created based on the saved user history data and the newly uploaded user history data.

Step S1003: receiving voice data input by a user;

in some embodiments, the step of receiving voice data input by a user comprises:

voice data input by a user by pressing a voice key of the control device is received.

after the user wakes up the voice assistant through the far-field wake-up word, voice data input by the user is received.

Step S1004: the voice data and the user identification are sent to a server, so that the server determines a recognition result corresponding to the voice data based on a first language model and a second language model corresponding to the user identification, wherein the second language model is a universal language model;

Step S1005: and receiving the identification result issued by the server.

After the identification result is obtained, executing the operation corresponding to the identification result.

In some embodiments, after receiving voice data input by a user, acquiring user interface data, wherein the user interface data is data of a current user interface;

and sending the voice data and the user interface data to a server so that the server builds a third language model based on the user interface data, and determining a recognition result corresponding to the voice data based on the second language model, the third language model and the first language model corresponding to the user identifier. Or updating the first language model based on the user interface data, and determining the recognition result corresponding to the voice data based on the second language model and the first language model corresponding to the user identifier.

In some embodiments, for web interfaces, user interface data may be obtained through a JS script.

In some embodiments, text data corresponding to a current user interface control may be directly obtained for a user interface of the application software.

In some embodiments, after receiving voice data input by a user, determining a current scene;

And sending the voice data and the scene identifier corresponding to the current scene to a server so that the server determines the recognition result corresponding to the voice data based on the second language model and the fourth language model corresponding to the scene identifier.

In some embodiments, as shown in FIG. 11, the server, upon detecting that the entity database has update data, updates the generic language model based on the update data. When the terminal detects that the user history data is updated, the terminal sends the user history data to a server, and the server builds a personalized language model belonging to the user based on the user history data. The terminal receives voice data input by a user, sends the voice data to the server, and the server determines a recognition result of the voice data based on the universal language model and the personalized language model and sends the recognition result to the terminal. And after receiving the identification result, the terminal executes the operation corresponding to the identification result.

Some embodiments of the present application provide a speech recognition method, the method being applicable to a server configured to: after receiving user data uploaded by a terminal, constructing a first language model corresponding to a user identifier based on the user data, wherein the user data comprises the user identifier and user history data; after receiving voice data and user identification uploaded by a terminal, determining a recognition result corresponding to the voice data based on a second language model and a first language model corresponding to the user identification, wherein the second language model is a general language model; and sending the identification result to the terminal. According to the method and the device for identifying the voice data, the personalized language model is built for the user through the user historical data uploaded by the receiving terminal, after the voice data input by the user are received, the identification result corresponding to the voice data is determined together based on the personalized language model and the universal language model, so that the accuracy of voice identification is improved, and the overall identification rate can be improved by about 10%.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A server, configured to:

and sending the identification result to the terminal.

2. The server of claim 1, wherein the server is configured to:

3. The server of claim 2, wherein the server performs determining the recognition result corresponding to the voice data based on a second language model and a first language model corresponding to the user identification, and is further configured to:

4. The server of claim 1, wherein the server is configured to:

5. The server of claim 4, wherein the update data includes entity data and a data type identifier corresponding to the entity data, the server performing updating the second language model based on the update data, further configured to:

6. The server of claim 1, wherein the server is configured to:

7. The server of claim 1, wherein the server performs determining the recognition result corresponding to the voice data based on a second language model and a first language model corresponding to the user identification, and is further configured to:

8. A terminal, comprising:

a controller configured to:

receiving voice data input by a user;

and receiving the identification result issued by the server.

9. A voice recognition method applied to a server, comprising:

and sending the identification result to the terminal.

10. A voice recognition method applied to a terminal, comprising:

receiving voice data input by a user;

and receiving the identification result issued by the server.