CN117812323A

CN117812323A - Display device, voice recognition method, voice recognition device and storage medium

Info

Publication number: CN117812323A
Application number: CN202311130206.1A
Authority: CN
Inventors: 任晓楠; 崔保磊; 冯瑞平; 张大钊
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2024-04-02

Abstract

The embodiment of the application relates to the technical field of display and discloses a display device, a voice recognition method, a device and a storage medium, wherein the display device comprises: the sound collector is configured to receive a voice instruction input by a user; a controller configured to: responding to the voice command, and acquiring a screenshot; performing image recognition and pinyin conversion on the screenshot to obtain first mapping data; performing voice recognition on the voice command to obtain a first voice recognition result, wherein the first voice recognition result comprises a second keyword; and under the condition that the target pinyin exists in the first mapping data, replacing the second keyword in the first voice recognition result with the target keyword corresponding to the target pinyin to obtain a second voice recognition result, wherein the target pinyin is the pinyin matched with the second pinyin corresponding to the second keyword in at least one first pinyin. By applying the technical scheme, the accuracy of voice recognition can be provided.

Description

Display device, voice recognition method, voice recognition device and storage medium

Technical Field

The embodiment of the application relates to the technical field of display, in particular to a display device, a voice recognition method, a voice recognition device and a storage medium.

Background

With the development of voice recognition technology, the application scene of voice interaction is more and more popular. For example, in the process of using the smart television, a user can input a voice command through a voice assistant of the smart television, so that the smart television can recognize and analyze the voice command to recognize a search result corresponding to the voice command and display the search result to the user.

However, in the process of speech recognition, homophone recognition is easily confused because the relationship between the characters and the pinyin is not one-to-one, and the accuracy of speech recognition is affected. Therefore, the accuracy of speech recognition needs to be improved.

Disclosure of Invention

In view of the above problems, embodiments of the present application provide a display device, a method, an apparatus, and a storage medium for voice recognition, which are used to solve the problem in the prior art that the accuracy of voice recognition is low.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

according to a first aspect of embodiments of the present application, there is provided a display device, including: the sound collector is configured to receive a voice instruction input by a user; a controller coupled to the sound collector configured to: responding to the voice command, and acquiring a screenshot; performing image recognition and pinyin conversion on the screenshot to obtain first mapping data, wherein the first mapping data comprises at least one first keyword and at least one first pinyin, and one first pinyin corresponds to one first keyword; performing voice recognition on the voice command to obtain a first voice recognition result, wherein the first voice recognition result comprises a second keyword; and under the condition that the target pinyin exists in the first mapping data, replacing the second keyword in the first voice recognition result with the target keyword corresponding to the target pinyin to obtain a second voice recognition result, wherein the target pinyin is the pinyin matched with the second pinyin corresponding to the second keyword in at least one first pinyin.

In an alternative mode, the first keyword comprises a plurality of first words, the first pinyin comprises a plurality of third pinyin arranged according to a first sequence, the plurality of third pinyin corresponds to the plurality of first words one by one, and the first sequence is the sequence of position indication of the plurality of first words in the first keyword; the second keyword comprises a plurality of second words, the second pinyin comprises a plurality of fourth pinyin arranged according to a second sequence, the plurality of fourth pinyin corresponds to the plurality of second words one by one, and the second sequence is the sequence of position indication of the plurality of second words in the second keyword; a controller, further configured to: and determining that the target pinyin exists in the first mapping data and comprises a plurality of third pinyins corresponding to the fourth pinyins when the third pinyins corresponding to the fourth pinyins exist in the first mapping data and the sequence of the third pinyins is consistent with the sequence of the fourth pinyins.

In an alternative way, the controller is specifically configured to: extracting features of the screen shots to obtain image features corresponding to the screen shots; text detection is carried out on image features corresponding to the screen shots based on a text detection algorithm, and at least one text area in the screen shots is determined; based on a text recognition algorithm, performing text recognition on at least one text region in the screenshot to obtain at least one first keyword; and performing pinyin conversion on the at least one first keyword to obtain at least one first pinyin corresponding to the at least one first keyword.

In an alternative way, the controller is specifically configured to: extracting the characteristics of the voice instruction to obtain audio characteristics; and processing the audio features through the target acoustic model and the target language model to obtain a first voice recognition result.

In an alternative, the controller is further configured to: word segmentation processing is carried out on the first voice recognition result to obtain a second keyword; and performing pinyin conversion on the second keyword to obtain a second pinyin corresponding to the second keyword.

In an alternative form, the display device further comprises a display, the controller further configured to: determining multimedia resources corresponding to the second voice recognition result according to the second voice recognition result; and controlling the display to display the multimedia resources corresponding to the second voice recognition result.

In an alternative form, the display device further comprises a display, the controller further configured to: under the condition that the target pinyin does not exist in the first mapping data, determining a multimedia resource corresponding to the first voice recognition result according to the first voice recognition result; and controlling the display to display the multimedia resource corresponding to the first voice recognition result.

According to a second aspect of embodiments of the present application, there is provided a voice recognition method, applied to the display device of the first aspect of embodiments of the present application, including: responding to the voice command, and acquiring a screenshot; performing image recognition and pinyin conversion on the screenshot to obtain first mapping data, wherein the first mapping data comprises at least one first keyword and at least one first pinyin, and one first pinyin corresponds to one first keyword; performing voice recognition on the voice command to obtain a first voice recognition result, wherein the first voice recognition result comprises a second keyword; and under the condition that the target pinyin exists in the first mapping data, replacing the second keyword in the first voice recognition result with the target keyword corresponding to the target pinyin to obtain a second voice recognition result, wherein the target pinyin is the pinyin matched with the second pinyin corresponding to the second keyword in at least one first pinyin.

In an alternative mode, the first keyword comprises a plurality of first words, the first pinyin comprises a plurality of third pinyin arranged according to a first sequence, the plurality of third pinyin corresponds to the plurality of first words one by one, and the first sequence is the sequence of position indication of the plurality of first words in the first keyword; the second keyword comprises a plurality of second words, the second pinyin comprises a plurality of fourth pinyin arranged according to a second sequence, the plurality of fourth pinyin corresponds to the plurality of second words one by one, and the second sequence is the sequence of position indication of the plurality of second words in the second keyword; under the condition that target pinyin exists in the first mapping data, before replacing the second keyword in the first voice recognition result with the target keyword corresponding to the target pinyin, the method further comprises the following steps: and determining that the target pinyin exists in the first mapping data and comprises a plurality of third pinyins corresponding to the fourth pinyins when the third pinyins corresponding to the fourth pinyins exist in the first mapping data and the sequence of the third pinyins is consistent with the sequence of the fourth pinyins.

In an alternative manner, performing image recognition and pinyin conversion on the screenshot to obtain first mapping data, including: extracting features of the screen shots to obtain image features corresponding to the screen shots; text detection is carried out on image features corresponding to the screen shots based on a text detection algorithm, and at least one text area in the screen shots is determined; based on a text recognition algorithm, performing text recognition on at least one text region in the screenshot to obtain at least one first keyword; and performing pinyin conversion on the at least one first keyword to obtain at least one first pinyin corresponding to the at least one first keyword.

In an alternative manner, performing speech recognition on the speech instruction to obtain a first speech recognition result, including: extracting the characteristics of the voice command to obtain audio characteristics; and processing the audio features through a target acoustic model and a target language model to obtain the first voice recognition result.

In an alternative manner, after performing voice recognition on the voice command to obtain a first voice recognition result, the method further includes: word segmentation processing is carried out on the first voice recognition result to obtain a second keyword; and performing pinyin conversion on the second keyword to obtain a second pinyin corresponding to the second keyword.

In an optional manner, after replacing the second keyword in the first speech recognition result with the target keyword corresponding to the target pinyin to obtain the second speech recognition result, the method further includes: determining multimedia resources corresponding to the second voice recognition result according to the second voice recognition result; and controlling the display to display the multimedia resources corresponding to the second voice recognition result.

In an alternative manner, after performing voice recognition on the voice command to obtain a first voice recognition result, the method further includes: under the condition that the target pinyin does not exist in the first mapping data, determining a multimedia resource corresponding to the first voice recognition result according to the first voice recognition result; and controlling the display to display the multimedia resource corresponding to the first voice recognition result.

According to a third aspect of embodiments of the present application, there is provided a speech recognition apparatus, the apparatus comprising: the acquisition module is used for responding to the voice instruction and acquiring a screenshot; the processing module is used for carrying out image recognition and pinyin conversion on the screen shot to obtain first mapping data, wherein the first mapping data comprises at least one first keyword and at least one first pinyin, and one first pinyin corresponds to one first keyword; the voice recognition module is used for carrying out voice recognition on the voice instruction to obtain a first voice recognition result, wherein the first voice recognition result comprises a second keyword; and the voice recognition correction module is used for replacing the second keyword in the first voice recognition result with the target keyword corresponding to the target pinyin under the condition that the target pinyin exists in the first mapping data to obtain a second voice recognition result, wherein the target pinyin is the pinyin matched with the second pinyin corresponding to the second keyword in at least one first pinyin.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored therein at least one executable instruction that, when executed on a display device, causes the display device to perform the operations of the above-described speech recognition method.

In the embodiment of the present application, since the user interface displayed by the display device may include the multimedia resource recommended according to the usage habit of the user and the newly heated multimedia resource, when the voice command input by the user is received, the screenshot may be obtained, so that the display device may perform image recognition and pinyin conversion on the screenshot, and obtain the first mapping data reflecting the correspondence between the at least one first keyword and the first pinyin corresponding to each first keyword. Meanwhile, the display device can determine a first voice recognition result according to the voice command, and correct a second keyword included in the first voice recognition result by utilizing the first mapping data obtained based on the screen capture so as to obtain a more accurate second voice recognition result. That is, the voice recognition method provided by the embodiment of the application can solve the problems that homonym recognition is easy to confuse and the error recognition rate of new hot content is high, so that the accuracy of voice recognition of display equipment can be improved, and the user experience is better.

The foregoing description is only an overview of the technical solutions of the embodiments of the present application, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present application can be more clearly understood, and the following detailed description of the present application will be presented in order to make the foregoing and other objects, features and advantages of the embodiments of the present application more understandable.

Drawings

The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 shows an interaction schematic diagram of a display device and a control device provided in an embodiment of the present application;

fig. 2 shows a block diagram of a configuration of a control device in an embodiment of the present application;

fig. 3 shows a hardware configuration block diagram of a display device provided in an embodiment of the present application;

fig. 4 shows a software configuration schematic diagram of a display device according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 6 illustrates a schematic diagram of a screenshot provided by an embodiment of the present application;

FIG. 7 is a second flowchart illustrating a voice recognition method according to an embodiment of the present disclosure;

FIG. 8 is a third schematic flow chart of a speech recognition method according to an embodiment of the present disclosure;

FIG. 9 is a flowchart illustrating a voice recognition method according to an embodiment of the present disclosure;

FIG. 10 is a fifth flowchart of a speech recognition method according to an embodiment of the present disclosure;

fig. 11 is a schematic diagram of a display interface of a display device according to an embodiment of the present application;

FIG. 12 is a flowchart illustrating a voice recognition method according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein.

However, the existing speech recognition technology has the following problems: on the one hand, since the relation between the characters and the pinyin is not one-to-one, homonym recognition confusion is easily caused in the process of voice recognition, and the accuracy of voice recognition is affected. For example, for "Zhang Shan" and "Zhang Shan", the pronunciation is "zhang shan", and confusion is likely to occur during speech recognition, resulting in recognition errors. On the other hand, with respect to the new hot content, it is difficult to obtain a correct recognition result in the course of speech recognition.

In view of one or more of the above problems, embodiments of the present application provide a voice recognition method, which may include: the display device may obtain a screenshot in response to a voice command entered by the user. After the screen capture is acquired, the display device can identify the text in the screen capture and perform pinyin conversion on the text in the screen capture to obtain pinyin corresponding to the text in the screen capture. Meanwhile, the display equipment can perform voice recognition on the voice command to obtain a voice recognition result, and perform word segmentation processing and pinyin conversion processing on the voice recognition result to obtain recognition words and pinyin corresponding to the recognition words. And then, the display equipment can compare the pinyin corresponding to the recognition word with the pinyin corresponding to the text in the screenshot so as to correct the first voice recognition result by utilizing the information which can be reflected in the screenshot, thereby obtaining a more accurate second voice recognition result. That is, the voice recognition method provided by the embodiment of the application can solve the problems that homonym recognition is easy to confuse and the error recognition rate of new hot content is high, so that the accuracy of voice recognition of display equipment can be improved, and the user experience is better.

The voice recognition method may be applied to a display device. Fig. 1 shows an interaction schematic diagram of a display device and a control device provided in an embodiment of the present application. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the remote controller and the display device 200 may communicate through an infrared protocol, a bluetooth protocol, or the remote controller may control the display device 200 in a wireless or other wired manner. The user may input user instructions through keys on a remote controller, voice input, a control panel, etc., to control the display device 200. For example, the user can control the display device 200 to switch the displayed page by means of navigation keys (up key, down key, left key, right key) on the remote controller. Also for example, the user may control the display device 200 to play or pause playing of a media asset via a play pause key. Also for example, the user may adjust the degree of zoom of the display device 200 to display the page by volume up and down keys. Also for example, the user may wake up the voice assistant through the voice input key and input a voice instruction to control the display device 200 to perform an operation corresponding to the voice instruction.

In some embodiments, a user may also control the display device 200 using a smart device 300 (e.g., a mobile terminal, tablet, computer, notebook, etc. other smart devices). For example, a user may control the display device 200 through an application installed on the smart device 300 that, by configuration, may provide the user with various controls in an intuitive user interface on a screen associated with the smart device 300.

In some embodiments, the smart device 300 may implement connection communication with a software application installed on the display device 200 through a network communication protocol for the purpose of one-to-one control operation and data communication. For example, it may be implemented to establish a control instruction protocol with the smart device 300 and the display device 200, synchronize a remote control keyboard to the smart device 300, control a function of the display device 200 by controlling a user interface on the smart device 300, or may also transmit content displayed on the smart device 300 to the display device 200, so as to implement a function of synchronous display.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.

As shown in fig. 1, the display device 200 and the server 400 may communicate data in a variety of communication manners, which may allow the display device 200 to be communicatively connected via a local area network (Local Area Network, LAN), a wireless local area network (Wireless Local Area Network, WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. For example, the display device 200 receives software program updates by sending and receiving messages, and electronic program guide (Electrical Program Guide, EPG) interactions, or accesses a remotely stored digital media library. The server 400 may be one cluster or multiple clusters, and may include one or more types of servers.

The display device 200 may be a liquid crystal display, an Organic Light-Emitting Diode (OLED) display, a projection display device, a smart terminal, such as a mobile phone, a tablet computer, etc. The specific display device type, size, resolution, etc. are not limited.

Fig. 2 shows a block diagram of a configuration of the control device 100 in the exemplary embodiment of the present application, and as shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an operation instruction input by a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and may interact with the display device 200.

Taking a display device as an example of a television, fig. 3 shows a hardware configuration block diagram of a display device 200 according to an embodiment of the present application. As shown in fig. 3, the display device 200 includes: a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a receiver 280 and at least one of a memory, a power supply, a user interface.

The modem 210 may receive broadcast television signals through a wired or wireless reception manner and demodulate an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals. The detector 230 may be used to collect signals of the external environment or interaction with the outside.

In some embodiments, communicator 220 may be a component for communicating with external devices or external servers according to various communication protocol types. For example: the communicator may include at least one of a WiFi chip, a bluetooth communication protocol chip, a wired ethernet communication protocol chip, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver.

In some embodiments, the detector 230 may be used to collect signals of or interact with the external environment, may include an optical receiver and a temperature sensor, etc.

The light receiver can be used for acquiring a sensor of the intensity of the ambient light, and adaptively adjusting display parameters and the like according to the intensity of the ambient light; the temperature sensor may be used to sense an ambient temperature so that the display device 200 may adaptively adjust a display color temperature of an image, such as when the ambient temperature is high, a color temperature colder tone of the image displayed by the display device 200 may be adjusted, or when the ambient temperature is low, a color temperature warmer tone of the image displayed by the display device 200 may be adjusted.

In some embodiments, the detector 230 may further include an image collector, such as a camera, a video camera, etc., which may be used to collect external environmental scenes, collect attributes of a user or interact with a user, adaptively change display parameters, and recognize a user gesture to realize an interaction function with the user.

In some embodiments, the detector 230 may also include a sound collector or the like, such as a microphone, that may be used to receive the user's sound. For example, a voice signal including a control instruction for a user to control the display apparatus 200, or an acquisition environment sound for recognizing an environment scene type, so that the display apparatus 200 can adapt to an environment noise.

In some embodiments, external device interface 240 may include, but is not limited to, the following: any one or more interfaces such as a high-definition multimedia interface (High Definition Multimedia Interface, HDMI), an analog or data high-definition component input interface, a composite video input interface, a universal serial bus (Universal Serial Bus, USB) input interface, an RGB port, or the like may be used, or the interfaces may form a composite input/output interface.

As shown in fig. 3, the controller 250 may include a central processing unit (Central Processing Unit, CPU), a video processor, an audio processor, a graphic processor (Graphics Processing Unit, GPU), a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), and at least one of first to nth interfaces for input/output. Wherein the communication bus connects the various components.

In some embodiments, the controller 250 may control the operation of the display device and respond to the user's operations through various software control programs stored on an external memory. For example, a user may input a user command through a graphical user interface (Graphic User Interface, GUI) displayed on the display 260, the user input interface receiving the user input command through the graphical user interface, or the user may input the user command by inputting a specific sound or gesture, the user input interface recognizing the sound or gesture through the sensor.

A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of a user interface is a graphical user interface, which refers to a user interface related to computer operations that is displayed in a graphical manner. The control can comprise at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget (short for Widget) and other visual interface elements.

The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen.

In some embodiments, the display 260 may be used to display a display interface of an application. The display interface is any page in the application program.

In some embodiments, the display 260 may be used to receive audio and video signals output by the audio processor and video processor, display video content and images, play audio of the video content, and display components of a menu manipulation interface.

In some embodiments, the display 260 may be used to present a user-operated UI interface generated in the display device 200 and used to control the display device 200.

In some embodiments, the display device 200 may establish control signal and data signal transmission and reception between the communicator 220 and the external control device 100 or the content providing device.

In some embodiments, the display apparatus 200 may receive a control operation input by a user on the display interface by adding the receiver 280. For example, when the receiver 280 is a touch component, the touch component may together form a touch screen with the display 260. On the touch screen, a user can input different control instructions through touch operation, for example, the user can input touch instructions such as clicking, sliding, long pressing, double clicking and the like, and different touch instructions can represent different control functions.

To implement the different touch actions described above, the touch assembly 280 may generate different electrical signals when the user inputs the different touch actions, and transmit the generated electrical signals to the controller 250. The controller 250 may perform feature extraction on the received electrical signal to determine a control function to be performed by the user based on the extracted features.

In some embodiments, the receiver 280 may also be an external control component connected to the display device 200, for example, a desktop computer, and the receiver 280 may be a mouse, a keyboard, etc. connected to the display. The user can input different control instructions such as clicking, sliding, switching and other operation instructions through the mouse or the keyboard.

Accordingly, when the user performs different control operations on the external control unit, the external control unit may generate different control signals in response to the control operations of the user, and transmit the generated control signals to the controller 250. The controller 250 may perform feature extraction on the received control signal to determine a control function to be performed by the user according to the extracted features.

In some embodiments, the receiver 280 may also be an external control component connected to the display device 200, and the receiver 280 may be a remote control, taking smart electricity as an example. The user can input different control instructions, such as clicking, switching, voice input and other operation instructions through the remote controller.

For example, when the user presses a voice key on the remote controller, the voice recognition function of the smart television may be started, and then the voice collector of the smart television may collect a voice command input by the user, so that the controller recognizes and analyzes the voice command to execute an operation corresponding to the voice command.

Fig. 4 shows a software configuration diagram of the display device 200 in the exemplary embodiment of the present application, and as shown in fig. 4, the system is divided into four layers, namely, an application layer (application layer), a kernel layer and a hardware layer from top to bottom.

In some embodiments, at least one application program is running in the application program layer, and these application programs may be a Window (Window) program of an operating system, a system setting program, a clock program, or the like; or may be an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples. Illustratively, the application layer may include a speech recognition application (e.g., a voice assistant). The display device 200 may launch the speech recognition application in response to operation of the speech recognition application to initiate the speech recognition function of the display device.

The kernel layer is used as a software middleware between the hardware layer and the application layer and is used for managing and controlling hardware and software resources.

In some examples, the kernel layer may include a microphone driver and a display driver. The microphone driver may invoke an interface of the microphone to collect voice commands input by a user through the microphone (i.e., the sound collector), or invoke an interface of the microphone to set the microphone. The display driver may call an interface of the display to obtain the multimedia resource or call an interface of the display to set the display.

In some examples, display device 200 may obtain a screenshot of display 260 in response to a voice instruction entered by a user while display device 200 is in an on state. The controller 250 of the display device 200 may recognize the text in the screenshot and perform pinyin conversion on the text in the screenshot to obtain the pinyin corresponding to the text in the screenshot. Meanwhile, the controller 250 of the display device 200 may perform voice recognition on the voice command 250 to obtain a voice recognition result. The controller 250 of the display device 200 may perform word segmentation processing and pinyin conversion processing on the voice recognition result to obtain recognition words and pinyin corresponding to the recognition words. Thereafter, controller 250 of display device 200 may compare the pinyin corresponding to the identified terms to the pinyin corresponding to the text in the screenshot. And correcting the voice recognition result according to the text corresponding to the target pinyin under the condition that the target pinyin (namely the pinyin matched with the pinyin corresponding to the recognition word) exists in the pinyin corresponding to the text in the screenshot, so as to obtain the corrected voice recognition result. Thereafter, the controller 250 of the display apparatus 200 may control the display 260 to display the multimedia asset corresponding to the corrected voice recognition result. In this way, the voice recognition method provided by the embodiment of the application can solve the problems that homonym recognition is easy to confuse and the error recognition rate of new hot content is high, so that the accuracy of voice recognition of display equipment can be improved, and the user experience is better.

The following describes a voice recognition method according to an embodiment of the present application in detail with reference to fig. 5. The method in this embodiment may be implemented in a display device having the above-described hardware structure or software structure. The display device may include a sound collector, a display, and a controller coupled to the sound collector and the display, respectively. As shown in fig. 5, the controller is configured to perform the following steps 510-540.

Step 510, in response to the voice command, obtaining a screenshot.

Specifically, in the case where the display device starts the voice recognition function, the user may input a voice instruction to control the operation of the display device through the voice instruction. Such as a user may find multimedia assets through voice commands. The voice command may be a command input by a user and reflecting the operation intention of the user. When the display device receives the voice command, the display device can screen a currently displayed user interface to obtain a screen capture in response to the voice command, and further, the voice recognition result of the voice command can be corrected by utilizing information indicated by the screen capture.

The screen shots may be user interfaces that the display device displays upon receiving a voice command entered by a user. That is, when the display device receives a voice command input by a user, the display device captures a currently displayed user interface in response to the voice command, and obtains a screenshot. The user interface displayed by the display device when receiving the voice command input by the user can comprise multimedia resource information recommended according to the watching habit of the user, and can also comprise current new hot multimedia resource information.

Illustratively, the user interface displayed by the display device upon receiving a voice command input by the user may be a user-switched interface (e.g., a "recommended" interface, a "drama" interface, etc.).

For example, after the display device is powered on, the user interface of the display device may include title information such as "my", "cable television", "juvenile", "recommended", "god", "VIP", "drama", "movie", "variety", and so forth. At this time, the user switches the user interface of the display device to a "recommended" interface, which may include information of new hot multimedia resources (e.g., movies, dramas, variety, etc.), such as "god," "aver-water channel," "acquaintance year," "wild," "seed bar," etc. At this time, when a voice command input by a user is received, the display device captures a currently displayed user interface (i.e., a "recommended" interface) to obtain a screenshot. That is, as shown in FIG. 6, the screen shot 60 may include the above-mentioned new hot multimedia asset information, such as "god," "avida-water channel," "shadow," "acquaintance," "mania," "seed bar," etc.

In some embodiments, prior to obtaining the screenshot in response to the voice instruction, the controller is further configured to: in response to the wake-up instruction, a voice recognition function of the display device is initiated.

Specifically, when the user uses the voice recognition function of the display device while the display device is in the on state, the user may wake up the voice assistant of the display device by inputting a wake-up instruction (e.g., a wake-up word "small focus").

The wake-up instruction can be used for starting a voice recognition function of the display device, so that a user can control the display device to work through the voice instruction. That is, the wake-up instruction may be used to wake up a speech recognition application (e.g., a voice assistant) installed on the display device to cause the speech recognition application of the display device to enter an operational state.

The wake-up instruction may be, for example, a trigger operation of a voice key of an external control component (such as a remote controller) of the display device by a user, a trigger operation of a voice recognition control on a user interface of the display device by a user, or a voice wake-up instruction. For example, taking a wake-up instruction as a voice wake-up instruction, a user may wake up a voice assistant of the display device, i.e., initiate a voice recognition function of the display device, by a wake-up word (e.g., "gatekeeper").

Step 520, performing image recognition and pinyin conversion on the screenshot to obtain first mapping data, wherein the first mapping data comprises at least one first keyword and at least one first pinyin, and one first pinyin corresponds to one first keyword.

The first keyword may be a keyword corresponding to text content in the screenshot, or may be a keyword corresponding to character information indicated by a face image (such as an actor) in the screenshot. By identifying the screen shots, the first keywords can be obtained, and a plurality of first keywords can also be obtained. For example, taking the screen shot shown in fig. 6 as an example, the screen shot may include a plurality of first keywords, namely, "god," "avida-water channel," "shadow," "acquaintance," "manicure," "seed bar," and the like.

In this embodiment, since the text content in the screenshot may be composed of one text or a plurality of texts, and the character information indicated by the face image (such as an actor) in the screenshot may be composed of one text or a plurality of texts, the first keyword may be a word, or a sentence, that is, the first keyword may include one first text, or may include a plurality of first texts. For example, where the first keyword includes a first word, the first keyword may be a "shadow". For another example, when the first keyword includes a plurality of first words, the first keyword may be "god," avida-water channel, "" acquaintance year, "" fan, "or" seed bar.

The first pinyin may be a pinyin text corresponding to the first keyword, i.e., the plurality of first keywords may correspond to the plurality of first pinyin. When the first keyword includes a first word, the first pinyin may be a pinyin corresponding to the first word included in the first keyword. For example, the first keyword is "shadow", and the first pinyin corresponding to the first keyword is "ying". When the first keyword includes a plurality of first words, the first pinyin may include a third pinyin corresponding to each of the plurality of first words. For example, the first keyword is "seal" and the first pinyin corresponding to the first keyword includes the third pinyin of the two first words, i.e., the first pinyin includes the pinyin "feng" corresponding to "seal" and the pinyin "shen" corresponding to "god". For another example, the first keyword is "averda-water channel", and the first pinyin corresponding to the first keyword includes the third pinyin of six first words, that is, the first pinyin includes the pinyin "a" corresponding to "averda", "the pinyin" fan "corresponding to" averda "," the pinyin "shui" corresponding to "water", and the pinyin "dao" corresponding to "channel".

The first mapping data may be data reflecting correspondence between different first keywords and first pinyin corresponding to the first keywords. That is, the first mapping data may include at least one first keyword, a first pinyin corresponding to each first keyword in the at least one first keyword, and a correspondence between each first keyword and the first pinyin corresponding to the first keyword. The first mapping data may be, for example, a first mapping table.

It should be noted that, in a case where the first keyword includes a plurality of first words, a plurality of third pinyin corresponding to the plurality of first words included in the first mapping data may be arranged in a first order, where the first order may be an order in which positions of the plurality of first words in the first keyword are indicated.

For example, taking the screenshot shown in fig. 6 as an example, the first mapping data (i.e., the first mapping table) is shown in table 1 below:

TABLE 1

The first mapping table is used for mapping the first words to the second words, wherein the first mapping table is used for mapping the second words to the first words, and the second mapping table is used for mapping the second words to the second words. If the pinyin "feng" corresponding to the "seal" and the pinyin "shen" corresponding to the "god" are in the order of "feng", "shen". The sequences of the spellings "zhong" corresponding to the "species", "di" corresponding to the "ground" and the spellings "ba" corresponding to the "bar" are "zhong", "di", "ba".

In some embodiments, referring to fig. 7, the steps of performing image recognition and pinyin conversion on the screenshot to obtain the first mapping data may specifically include steps 710 to 730.

And 710, extracting features of the screen shots to obtain image features corresponding to the screen shots.

Specifically, feature extraction may be performed on the screen shots using a feature pyramid network (Feature Pyramid Network, FPN) to obtain image features of different scales. Wherein image features of different scales may include global and local information of the screen shots to facilitate recognition of text of different scales.

In some embodiments, to improve accuracy of recognition of the screen shots, the controller is further configured to, prior to feature extraction of the screen shots to obtain image features corresponding to the screen shots: the screen shots are preprocessed, wherein the preprocessing can comprise graying processing, binarizing processing and smoothing processing. After preprocessing the screen shots, feature extraction can be performed on the preprocessed screen shots to obtain image features corresponding to the screen shots.

And step 720, performing text detection on the image features corresponding to the screen shots based on a text detection algorithm, and determining at least one text area in the screen shots.

The text detection algorithm may be CTPN (Connectionist Text Proposal Network) text detection network or EAST (Efficient and Accurate Scene Text Detector) scene text detection algorithm.

In some examples, text detection of image features corresponding to a screenshot based on a text detection algorithm, determining at least one text region in the screenshot may include: detecting text regions of the screen shot to obtain text edge information of each text region in the screen shot; a text box for each text region is determined based on the text edge information for each text region.

In some embodiments, to further improve accuracy of recognition of the screen shots, after determining a text box for each text region based on text edge information of each text region, correction processing is performed on each text region according to the text box for each text region, resulting in at least one corrected text region.

And step 730, performing text recognition on at least one text area in the screenshot based on a text recognition algorithm to obtain at least one first keyword.

The character recognition algorithm may be an optical character recognition (Optical Character Recognition, OCR) algorithm or a deep learning algorithm, such as convolutional recurrent neural (Convolutional Recurrent Neural Network, CRNN).

Step 740, performing pinyin conversion on the at least one first keyword to obtain at least one first pinyin corresponding to the at least one first keyword.

The pinyin conversion may be a process of converting a text corresponding to the first keyword into a corresponding pinyin. Note that the pinyin includes syllables and tones.

In this embodiment, feature extraction is performed on the screenshot by using the feature pyramid network, and then at least one first keyword can be identified from the screenshot based on the detection and identification of the extracted image features, so that the accuracy is high. Furthermore, by combining the subsequent steps and utilizing the characteristic that homonyms are the same in pronunciation, the second keywords included in the first voice recognition result can be corrected based on the first mapping data obtained through screenshot, so that a more accurate second voice recognition result is obtained.

In step 530, the voice command is subjected to voice recognition, so as to obtain a first voice recognition result, wherein the first voice recognition result comprises the second keyword.

In this step, when a voice command input by a user is received, in response to the voice command, the display device may perform voice recognition on the voice command to obtain a first voice recognition result. The first voice recognition result may include text content corresponding to the voice command, that is, the first voice recognition result may include at least one second keyword included in the text content corresponding to the voice command.

It should be noted that, step 530 may be performed after step 520, step 530 may be performed before step 520, and step 530 and step 520 may be performed simultaneously, and the execution sequence of step 530 and step 520 is not limited in this embodiment of the present application.

In some embodiments, referring to fig. 8, the step of performing speech recognition on the speech command to obtain a first speech recognition result may specifically include steps 810 to 820.

And step 810, extracting the characteristics of the voice instruction to obtain the audio characteristics.

Illustratively, any one of linear prediction cepstral coefficients (Linear Predictive Cepstrum Coefficient, LPCC), mel-frequency cepstral coefficients (Mel Frequency Cepstrum Coefficient, MFCC), perceptual linear prediction parameters (Perceptual Linea r Predict ive, PLP) and mel-scale filtering (Melscale Filter Bank, FBANK) may be employed to extract the audio feature information in the speech instruction.

In some embodiments, to improve the accuracy of speech recognition, prior to feature extraction of the speech instructions to obtain the audio features, the controller is further configured to: the voice instructions are preprocessed, wherein the preprocessing may include noise reduction processing. After preprocessing the voice command, feature extraction can be performed on the preprocessed voice command to obtain audio features.

And step 820, processing the audio features through the target acoustic model and the target language model to obtain a first voice recognition result.

Specifically, the audio feature information may be converted into text by the target acoustic model, and the text obtained by the conversion may be subjected to context processing and error correction by the target language model to obtain a first speech recognition result. The target acoustic model may be a Hidden Markov Model (HMM), a deep learning model (e.g., a recurrent neural network), or the like, among others.

In some examples, the voice instructions may be voice-recognized using a voice recognition algorithm to obtain a first voice recognition result. The voice recognition algorithm can be a voice recognition algorithm based on linguistics and acoustics, a random model method, a voice recognition algorithm based on an artificial neural network, probability grammar analysis and the like.

In some embodiments, referring to fig. 9, after performing voice recognition on the voice command to obtain a first voice recognition result, the controller is further configured to: steps 910 to 920.

Step 910, performing word segmentation processing on the first speech recognition result to obtain a second keyword.

And performing word segmentation processing on the first voice recognition result, and dividing the first voice recognition result into one or more groups of characters to obtain at least one second keyword. The second keyword may be a keyword included in text content corresponding to the first speech recognition result. And performing word segmentation processing on the first voice recognition result to obtain one second keyword or a plurality of second keywords. For example, taking the first speech recognition result as "i want to see the god of the wind" as an example, the first speech recognition result may be divided into three second keywords of "i", "want to see" and "god of the wind".

In this embodiment, the first speech recognition result (i.e. the speech instruction) may be composed of one word or a plurality of words, i.e. the first speech recognition result (i.e. the speech instruction) may be one word, or may be one sentence. Thus, the second keyword obtained based on the first speech recognition result may include one second word or may include a plurality of second words, that is, any second keyword may be one word or one word. For example, where the second keyword includes a second word, the second keyword may be "me". For another example, when the second keyword includes a plurality of second words, the second keyword may be "want to see", "fan".

And step 920, performing pinyin conversion on the second keyword to obtain a second pinyin corresponding to the second keyword.

The second pinyin may be a pinyin text corresponding to the second keyword. For example, when the second keyword includes a second word, the second pinyin may be a pinyin corresponding to the second word included in the second keyword. For example, the second keyword is "me", and the second pinyin corresponding to the second keyword is "wo".

For example, when the second keyword includes a plurality of second words, the second pinyin may include fourth pinyin corresponding to each of the plurality of second words, where the plurality of fourth pinyin corresponding to the plurality of second words may be arranged in a second order, and two adjacent fourth pinyin are separated by a separator, where the second order is an order indicated by positions of the plurality of second words in the second keyword.

For example, the second keyword is "fan" and the second pinyin corresponding to the second keyword includes the fourth pinyin of the two second words, that is, the second pinyin includes the pinyin "feng" corresponding to the "fan" and the pinyin "shen" corresponding to the "fan", that is, the second pinyin is "feng, shen".

It should be noted that, the separator used for distinguishing the two adjacent pinyin may be other symbols, such as "·", "/", etc., and the specific type of the separator is not limited in the embodiments of the present application.

In this embodiment, the voice command is subjected to feature extraction to obtain the audio feature, and the audio feature is further processed through the target acoustic model and the target language model, so that a first recognition result can be obtained, and the accuracy is higher. And then, word segmentation processing is carried out on the first voice recognition result to obtain second keywords, pinyin conversion is carried out on the second keywords to obtain second pinyin corresponding to the second keywords, so that the second pinyin corresponding to each second keyword is compared with the pinyin in the first mapping data by utilizing the characteristic that homonyms are the same in pronunciation in combination with the subsequent steps, the first voice recognition result is corrected by utilizing information included in the screen capture, and further a more accurate second voice recognition result is obtained.

In this embodiment, after performing pinyin conversion on the second keyword to obtain a second pinyin corresponding to the second keyword, the second mapping data is determined according to the second pinyin corresponding to the second keyword.

The second mapping data may be data reflecting correspondence between different second keywords and second pinyin corresponding to the second keywords. That is, the second mapping data may include at least one second keyword, a second pinyin corresponding to each second keyword in the at least one second keyword, and a correspondence between each second keyword and the second pinyin corresponding to the second keyword. The second mapping data may be, for example, a second mapping table.

For example, continuing to take the first speech recognition result as "i want to see the god of the wind", the second mapping data (i.e., the second mapping table) is shown in the following table 2:

TABLE 2

Second keyword	Second Pinyin
		I am	wo
Want to see	xiang，kan
		Fengshen (Chinese character of 'Fengshen')	feng，shen

Step 540, under the condition that the target pinyin exists in the first mapping data, replacing the second keyword in the first voice recognition result with the target keyword corresponding to the target pinyin to obtain a second voice recognition result, wherein the target pinyin is a pinyin matched with the second pinyin corresponding to the second keyword in at least one first pinyin.

In this step, after the second mapping data is obtained according to the first speech recognition result, the second pinyin corresponding to each second keyword in the second mapping data may be compared with the pinyin corresponding to each first word in each first keyword in the first mapping data (i.e., the first pinyin or the third pinyin) to determine whether the target pinyin exists in the first mapping data, so that the second keyword in the first speech recognition result is replaced with the target keyword corresponding to the target pinyin when the target pinyin exists in the first mapping data, so as to obtain the corrected second speech recognition result.

The target pinyin is a pinyin matched with a second pinyin corresponding to the second keyword in the first mapping data. For example, when the second keyword includes a second word, the target pinyin is the same pinyin in the first mapping data as the second pinyin corresponding to the second keyword. For example, the second keyword is "cherry", the second pinyin corresponding to the second keyword is "ying", and the target pinyin is "ying" corresponding to the "shadow" in the first mapping data.

For example, when the second keyword includes a plurality of second words, each second word corresponds to a fourth pinyin, the target pinyin may include the same pinyin as the fourth pinyin corresponding to each second word, that is, the target pinyin may include a plurality of third pinyins corresponding to the plurality of fourth pinyins, and the order of the plurality of third pinyins corresponding to the plurality of fourth pinyins is the same as the order of the plurality of fourth pinyins corresponding to the plurality of second words. For example, the second keyword is "Fengshen", the second pinyin corresponding to the second keyword is "feng, shen", and the target pinyin is two consecutive pinyin "feng" and "shen" in the first mapping data, and the "feng" sequence is located before the "shen".

In some embodiments, in the case that the second keyword includes a second word, determining whether the target pinyin exists in the first mapping data may include: and when the same pinyin as the second pinyin corresponding to the second keyword (namely, the fourth pinyin corresponding to the second text) exists in the first mapping data, determining that the target pinyin exists in the first mapping data. The target pinyin is the same pinyin as the second pinyin corresponding to the second keyword.

In some embodiments, the second keyword includes a plurality of second words, the second pinyin includes a plurality of fourth pinyin arranged in a second order, the plurality of fourth pinyin is in one-to-one correspondence with the plurality of second words, and the second order is an order in which positions of the plurality of second words in the second keyword are indicated. Determining whether the target pinyin exists in the first mapping data may include: and determining that the target pinyin exists in the first mapping data and comprises a plurality of third pinyins corresponding to the fourth pinyins when the third pinyins corresponding to the fourth pinyins exist in the first mapping data and the sequence of the third pinyins is consistent with the sequence of the fourth pinyins.

In this embodiment, when the second keyword includes a plurality of second words, it may be determined whether the first mapping data includes pinyin corresponding to each second word, and it is further required to determine whether the order of the pinyin corresponding to each second word included in the first mapping data meets the requirement, so that accuracy of the second speech recognition result is further improved.

For example, taking the first mapping data as the first mapping table (i.e. table 1), the first speech recognition result is "i want to see the mind" (i.e. the second mapping table).

As shown in table 2, the first speech recognition result includes three second keywords, i.e. "me", "want to see" and "god of wind", where the pinyin corresponding to "i" is "wo", "the pinyin corresponding to" want to see "is" xiang, kan "and the pinyin corresponding to" god of wind "is" feng, shen ".

And detecting whether target pinyin matched with the second pinyin corresponding to the second keyword exists in the first mapping data according to each second keyword. That is, the pinyin corresponding to "me" is compared with the pinyin corresponding to each first word in the first mapping table, and it is determined whether the pinyin corresponding to "wo" exists in the first mapping data (i.e., the pinyin corresponding to "me" is the same pinyin as "wo"). Referring to table 1 above, the pinyin "wo" does not exist in the first mapping data (i.e., the pinyin corresponding to "me" is the same pinyin as "wo").

Then, the next second keyword is detected. Taking the next second keyword as an example, the pinyin corresponding to "want to see" is "xiang", and "xiang" in kan "is compared with the pinyin corresponding to each first word in the first mapping table, so as to determine whether the pinyin corresponding to" xiang "exists in the first mapping data (i.e. the pinyin corresponding to" want to see "is the same pinyin as" xiang "), and referring to the above table 1, it may be determined that the pinyin corresponding to" want to see "does not exist in the first mapping data (i.e. the pinyin corresponding to" want to see "is the same pinyin as" xiang "). Thereafter, the pinyin corresponding to "want to see" is "xiang," kan "in kan" is compared with the pinyin corresponding to each first word in the first mapping table, and whether the pinyin corresponding to "kan" exists in the first mapping data (i.e., the pinyin corresponding to "see" is the same pinyin as "kan"), and referring to the above table 1, it may be determined that the pinyin corresponding to "kan" does not exist in the first mapping data (i.e., the pinyin corresponding to "see" is the same pinyin as "kan").

And then, continuing to detect the next second keyword, comparing the Pinyin corresponding to the 'Fender' with the Pinyin corresponding to each first character in the first mapping table to determine whether the Pinyin corresponding to the 'Feng' exists in the first mapping data (namely, the Pinyin corresponding to the 'Feng' is the same Pinyin of the 'feng'), and referring to the table 1, determining that the Pinyin corresponding to the 'Feng' exists in the first mapping data (namely, the Pinyin corresponding to the 'Feng' is the same Pinyin of the 'feng'). Then, the pinyin corresponding to the "god" is "fang", the "shen" in the shen "is compared with the pinyin corresponding to each first word in the first mapping table, and it is determined whether the pinyin corresponding to the" shen "exists in the first mapping data (i.e., the pinyin corresponding to the" god "is the same pinyin as the" shen "), and referring to the above table 1, it may be determined that the pinyin corresponding to the" shen "exists in the first mapping data (i.e., the pinyin corresponding to the" god "is the same pinyin as the" shen "). Then, determining whether 'feng' and 'shen' are two consecutive pinyins in the first mapping data and whether 'feng' sequence is located before 'shen', and determining that a target pinyin matched with a second pinyin corresponding to a second keyword 'wind god' exists in the first mapping data under the condition that 'feng' and 'shen' are determined to be two consecutive pinyins in the first mapping data and the 'feng' sequence is located before 'shen', namely, the target pinyin comprises 'feng' and 'shen'. At this time, the "god of the first speech recognition result" i want to see god "is replaced with the target keyword corresponding to the target pinyin, namely, the" god of the first speech recognition result "i want to see god" is replaced with the "god of the second speech recognition result" i want to see god ". After traversing all the second keywords included in the first voice recognition result, completing the error correction process of the first voice recognition result, and obtaining a second voice recognition result 'I want to see the seal'.

In some embodiments, referring to fig. 10, after replacing the second keyword in the first speech recognition result with the target keyword corresponding to the target pinyin to obtain the second speech recognition result, the controller is further configured to: step 1010 to step 1020.

Step 1010, determining the multimedia resource corresponding to the second voice recognition result according to the second voice recognition result.

In step 1020, the display is controlled to display the multimedia resource corresponding to the second speech recognition result.

For example, referring to fig. 11, taking the second speech recognition result as "i want to see the god", the display device may acquire the multimedia resources related to "god" and control the display to display the multimedia resources related to "god", such as "god action", "god speech", etc.

In this embodiment, after correcting the first speech recognition result according to the information corresponding to the screenshot to obtain the second speech recognition result, the multimedia resource corresponding to the second speech recognition result may be displayed to the user, so that the user selects the multimedia resource to be viewed, and the use is more convenient. Furthermore, based on the corrected second voice recognition result, the multimedia resource corresponding to the second voice recognition result is acquired and displayed, so that the display device can accurately execute the operation intention corresponding to the voice instruction, and the user experience is better.

In some embodiments, after speech recognition of the speech instruction, resulting in a first speech recognition result, the controller is further configured to: under the condition that the target pinyin does not exist in the first mapping data, determining a multimedia resource corresponding to the first voice recognition result according to the first voice recognition result; and controlling the display to display the multimedia resource corresponding to the first voice recognition result.

In this embodiment, when the target pinyin does not exist in the first mapping data, the multimedia resource corresponding to the first speech recognition result may be displayed to the user, which is more convenient to use.

In order to facilitate understanding of the speech recognition method provided in the embodiment of the present application, a specific example will be described below with reference to fig. 12. As shown in fig. 12, the voice recognition method includes steps 1201 to 1215.

Step 1201, a voice command is received.

Step 1202, in response to a voice instruction, a screenshot is obtained.

Step 1203, extracting features of the screenshot to obtain image features corresponding to the screenshot.

And step 1204, performing text detection on the image features corresponding to the screen shots based on a text detection algorithm, and determining at least one text region in the screen shots.

And step 1205, performing text recognition on at least one text area in the screenshot based on a text recognition algorithm to obtain at least one first keyword.

In step 1206, pinyin conversion is performed on at least one first keyword to obtain at least one first pinyin corresponding to the at least one first keyword, thereby obtaining first mapping data.

In step 1207, feature extraction is performed on the voice command to obtain audio features.

And 1208, processing the audio features through the target acoustic model and the target language model to obtain a first voice recognition result.

Step 1209, performing word segmentation processing on the first voice recognition result to obtain a plurality of second keywords.

Step 1210, performing pinyin conversion on each second keyword in the plurality of second keywords to obtain a plurality of second pinyin corresponding to the plurality of second keywords.

Step 1211, selecting a current keyword from the plurality of second keywords one by one.

Step 1212, determining whether there is a target pinyin in the first mapping data that matches the second pinyin corresponding to the current keyword, if so, executing step 1213, otherwise, executing step 1214.

And 1213, replacing the current keyword in the first voice recognition result with the target keyword corresponding to the target pinyin to obtain a second voice recognition result.

Step 1214, it is determined whether the traversal of the plurality of second keywords in the first speech recognition result is completed, if yes, step 1215 is performed, otherwise, step 1211 is returned.

Step 1215, acquiring and displaying the multimedia resource corresponding to the second speech recognition result.

In this example, since the user interface displayed by the display device may include multimedia resources recommended according to the usage habit of the user and newly heated multimedia resources, when a voice command input by the user is received, a screenshot may be acquired, so that the display device may perform image recognition and pinyin conversion on the screenshot, and obtain first mapping data reflecting a correspondence relationship between at least one first keyword and first pinyin corresponding to each first keyword. Meanwhile, the display device can determine a first voice recognition result according to the voice command, and correct a second keyword included in the first voice recognition result by utilizing the first mapping data obtained based on the screen capture so as to obtain a more accurate second voice recognition result. That is, the voice recognition method provided by the embodiment of the application can solve the problems that homonym recognition is easy to confuse and the error recognition rate of new hot content is high, so that the accuracy of voice recognition of display equipment can be improved, and the user experience is better.

Fig. 13 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention. As shown in fig. 13, the voice recognition apparatus 1300 may include: an obtaining module 1301, configured to obtain a screenshot in response to a voice instruction; the processing module 1302 is configured to perform image recognition and pinyin conversion on the screenshot to obtain first mapping data, where the first mapping data includes at least one first keyword and at least one first pinyin, and one first pinyin corresponds to one first keyword; the voice recognition module 1303 is configured to perform voice recognition on the voice command to obtain a first voice recognition result, where the first voice recognition result includes a second keyword; the speech recognition correction module 1304 is configured to replace a second keyword in the first speech recognition result with a target keyword corresponding to the target pinyin when the target pinyin exists in the first mapping data, so as to obtain a second speech recognition result, where the target pinyin is a pinyin matched with the second pinyin corresponding to the second keyword in at least one first pinyin.

In some embodiments, the first keyword includes a plurality of first words, the first pinyin includes a plurality of third pinyin arranged in a first order, the plurality of third pinyin respectively corresponds to the plurality of first words one-to-one, the first order being an order in which positions of the plurality of first words in the first keyword are indicated; the second keyword comprises a plurality of second words, the second pinyin comprises a plurality of fourth pinyin arranged according to a second sequence, the plurality of fourth pinyin corresponds to the plurality of second words one by one, and the second sequence is the sequence of position indication of the plurality of second words in the second keyword; the voice recognition apparatus 1300 may further include: the first determining module is configured to determine that a target pinyin exists in the first mapping data, where the target pinyin includes a plurality of third pinyins corresponding to the plurality of fourth pinyins, where the plurality of third pinyins exists in the first mapping data and where the ordering of the plurality of third pinyins is consistent with the ordering of the plurality of fourth pinyins.

In some embodiments, the processing module 1302 is specifically configured to: extracting features of the screen shots to obtain image features corresponding to the screen shots; text detection is carried out on image features corresponding to the screen shots based on a text detection algorithm, and at least one text area in the screen shots is determined; based on a text recognition algorithm, performing text recognition on at least one text region in the screenshot to obtain at least one first keyword; performing pinyin conversion on at least one first keyword to obtain at least one first pinyin corresponding to the at least one first keyword

In some embodiments, the voice recognition module 1303 is specifically configured to: extracting the characteristics of the voice instruction to obtain audio characteristics; and processing the audio features through the target acoustic model and the target language model to obtain a first voice recognition result.

In some embodiments, the speech recognition apparatus 1300 may further include: the word segmentation module is used for carrying out word segmentation processing on the first voice recognition result to obtain a second keyword; and the pinyin conversion module is used for performing pinyin conversion on the second keyword to obtain a second pinyin corresponding to the second keyword.

In some embodiments, the speech recognition apparatus 1300 may further include: the second determining module is used for determining multimedia resources corresponding to the second voice recognition result according to the second voice recognition result; and the display module is used for displaying the multimedia resources corresponding to the second voice recognition result.

In some embodiments, the speech recognition apparatus 1300 may further include: the third determining module is used for determining multimedia resources corresponding to the first voice recognition result according to the first voice recognition result under the condition that the target pinyin does not exist in the first mapping data; and the display module is used for displaying the multimedia resources corresponding to the first voice recognition result.

Embodiments of the present application provide a computer readable storage medium storing at least one executable instruction that, when executed on a display device/apparatus, causes the display device/apparatus to perform the speech recognition method of any of the method embodiments described above.

The executable instructions may be particularly useful for causing a display device/apparatus to: under the condition that the next control operation is not received within a preset time period after the control operation is received, responding to the control operation, and acquiring the content to be displayed and the content to be released corresponding to the control operation; and controlling the display to display the content to be displayed and releasing the memory corresponding to the content to be released.

In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. In addition, embodiments of the present application are not directed to any particular programming language.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the present application may be practiced without these specific details. Similarly, in the above description of exemplary embodiments of the application, various features of embodiments of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or elements are mutually exclusive.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A display device, characterized by comprising:

the sound collector is configured to receive a voice instruction input by a user;

a controller coupled to the sound collector configured to:

responding to the voice command, and acquiring a screenshot;

performing image recognition and pinyin conversion on the screenshot to obtain first mapping data, wherein the first mapping data comprises at least one first keyword and at least one first pinyin, and one first pinyin corresponds to one first keyword;

performing voice recognition on the voice command to obtain a first voice recognition result, wherein the first voice recognition result comprises a second keyword;

and under the condition that target pinyin exists in the first mapping data, replacing the second keyword in the first voice recognition result with the target keyword corresponding to the target pinyin to obtain a second voice recognition result, wherein the target pinyin is a pinyin matched with the second pinyin corresponding to the second keyword in the at least one first pinyin.

2. The display device of claim 1, wherein the first keyword comprises a plurality of first words, the first pinyin comprises a plurality of third pinyin arranged in a first order, the plurality of third pinyin respectively corresponds to the plurality of first words one to one, and the first order is an order in which positions of the plurality of first words in the first keyword are indicated;

The second keyword comprises a plurality of second words, the second pinyin comprises a plurality of fourth pinyin arranged according to a second sequence, the fourth pinyin corresponds to the second words one by one, and the second sequence is the sequence of position indication of the second words in the second keyword;

the controller is further configured to:

and determining that a target pinyin exists in the first mapping data and comprises a plurality of third pinyins corresponding to the fourth pinyins when the third pinyins corresponding to the fourth pinyins exist in the first mapping data and the sequence of the third pinyins is consistent with that of the fourth pinyins.

3. The display device of claim 1, wherein the controller is specifically configured to:

extracting features of the screen shots to obtain image features corresponding to the screen shots;

text detection is carried out on image features corresponding to the screen shots based on a text detection algorithm, and at least one text area in the screen shots is determined;

based on a text recognition algorithm, performing text recognition on at least one text region in the screenshot to obtain at least one first keyword;

And performing pinyin conversion on the at least one first keyword to obtain the at least one first pinyin corresponding to the at least one first keyword.

4. The display device of claim 1, wherein the controller is specifically configured to:

extracting the characteristics of the voice command to obtain audio characteristics;

and processing the audio features through a target acoustic model and a target language model to obtain the first voice recognition result.

5. The display device of claim 1, wherein the controller is further configured to:

word segmentation processing is carried out on the first voice recognition result to obtain the second keyword;

and performing pinyin conversion on the second keyword to obtain a second pinyin corresponding to the second keyword.

6. The display device of any of claims 1-5, wherein the display device further comprises a display, the controller further configured to:

determining a multimedia resource corresponding to the second voice recognition result according to the second voice recognition result;

and controlling the display to display the multimedia resource corresponding to the second voice recognition result.

7. The display device of any of claims 1-5, wherein the display device further comprises a display, the controller further configured to:

under the condition that the target pinyin does not exist in the first mapping data, determining a multimedia resource corresponding to the first voice recognition result according to the first voice recognition result;

and controlling the display to display the multimedia resource corresponding to the first voice recognition result.

8. A method of speech recognition, for application to a display device, the method comprising:

responding to the voice command, and acquiring a screenshot;

9. A speech recognition apparatus for use with a display device, the apparatus comprising:

the acquisition module is used for responding to the voice instruction and acquiring a screenshot;

the processing module is used for carrying out image recognition and pinyin conversion on the screen shot to obtain first mapping data, wherein the first mapping data comprises at least one first keyword and at least one first pinyin, and one first pinyin corresponds to one first keyword;

the voice recognition module is used for carrying out voice recognition on the voice command to obtain a first voice recognition result, wherein the first voice recognition result comprises a second keyword;

and the voice recognition correction module is used for replacing the second keyword in the first voice recognition result with a target keyword corresponding to the target pinyin under the condition that the target pinyin exists in the first mapping data to obtain a second voice recognition result, wherein the target pinyin is a pinyin matched with the second pinyin corresponding to the second keyword in the at least one first pinyin.

10. A computer readable storage medium having stored therein at least one executable instruction that, when executed on a display device, causes the display device to perform the operations of the speech recognition method of claim 8.