CN115588431A - Voice recognition method and display device - Google Patents

Voice recognition method and display device Download PDF

Info

Publication number
CN115588431A
CN115588431A CN202211275976.0A CN202211275976A CN115588431A CN 115588431 A CN115588431 A CN 115588431A CN 202211275976 A CN202211275976 A CN 202211275976A CN 115588431 A CN115588431 A CN 115588431A
Authority
CN
China
Prior art keywords
character
rule
error correction
preset
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211275976.0A
Other languages
Chinese (zh)
Inventor
朱飞
胡胜元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vidaa Netherlands International Holdings BV
Original Assignee
Vidaa Netherlands International Holdings BV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vidaa Netherlands International Holdings BV filed Critical Vidaa Netherlands International Holdings BV
Priority to CN202211275976.0A priority Critical patent/CN115588431A/en
Publication of CN115588431A publication Critical patent/CN115588431A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the application discloses a voice recognition method and display equipment, relates to the field of intelligent equipment, and can accurately recognize characters input by a user through voice under the condition of not increasing the occupancy rate of computing resources. The specific scheme is as follows: receiving target character voice data from a display device; inputting the target character voice data into a preset voice recognition model to obtain an initial recognition result; correcting the initial recognition result according to a preset character error correction rule to obtain a target character corresponding to the target character voice data; and sending the target character to a display device to enable the display device to display the target character.

Description

Voice recognition method and display device
Technical Field
The application relates to the field of intelligent equipment, in particular to a voice recognition method and display equipment.
Background
At present, more and more intelligent devices (such as smart televisions, mobile phones, tablets and the like) have the functions of voice recognition and voice interaction. In some scenes in which account passwords need to be input, for intelligent devices such as smart televisions and the like, the traditional scheme is that the passwords are manually input through keys of a remote controller. If the account password is complex, the whole input process takes much time, and the user experience is not good enough.
Based on this, some smart devices provide the user with the function of inputting the account password by voice by using their own voice recognition capability. However, most of the current speech recognition schemes of the intelligent devices are sentence-level recognition schemes, which support speech recognition of words and complex sentences, but have poor recognition effect on single characters. Therefore, a character speech recognition model capable of accurately recognizing a single character needs to be separately developed, so that the development cost is high and the development difficulty is high. Moreover, two voice recognition methods need to be present on the server corresponding to the intelligent device, which occupies a large amount of computing resources and affects the use of the intelligent device by the user.
Disclosure of Invention
The embodiment of the application provides a voice recognition method and display equipment, which can accurately recognize characters input by a user through voice under the condition of not increasing the occupancy rate of computing resources.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
in a first aspect, a speech recognition method is provided, which is applied to a server, and the method may include: receiving target character voice data from a display device; inputting the target character voice data into a preset voice recognition model to obtain an initial recognition result; correcting the initial recognition result according to a preset character error correction rule to obtain a target character corresponding to the target character voice data; and sending the target character to a display device to enable the display device to display the target character.
Based on the technical scheme, when the display device needs to identify the received character voice data (such as target character voice data), the target character voice data can be sent to the server. After the server obtains the target character voice data, the server can perform initial recognition on the target character voice data to be recognized based on the existing preset voice recognition model to obtain an initial recognition result. And then, correcting the initial recognition result based on a preset character error correction rule, so as to obtain a target character corresponding to the target character voice data. The server may then send the target character to the display device. In the whole implementation process of voice recognition, the method is carried out based on the existing voice recognition scheme matched with the display equipment and the server, namely, a preset voice recognition model is used for carrying out initial recognition. On the basis, the preset character error correction rule is utilized to correct the error of the initial recognition result so as to obtain a more accurate recognition result, namely the target character. The preset character error correction rules can be a plurality of error correction tables, and the occupied computing resources are very small. Therefore, compared with the prior art, the technical scheme provided by the user can accurately identify the characters input by the user through the voice without increasing the occupancy rate of computing resources.
In a possible implementation manner of the first aspect, the preset character error correction rule at least includes any one or more of the following sub-rules: character mapping sub-rule, phono configurational code matching sub-rule and association sub-rule.
Wherein the character mapping sub-rule comprises: if the initial recognition result is matched with a first recognition result in a preset character mapping table, determining a first character associated with the first recognition result in the preset character mapping table as a first error correction result after the error correction of the initial recognition result; the preset character mapping table is used for indicating the incidence relation between the preset characters and the optional recognition results; the first recognition result is one of all the selectable recognition results, and the first character is a preset character associated with the first recognition result in a preset character mapping table;
the sound-shape code matching sub-rule comprises the following steps: if the sound-shape codes of the initial recognition result are matched with the first sound-shape codes in the sound-shape code dictionary, determining second characters, associated with the first sound-shape codes, in the sound-shape code dictionary as a second error correction result after error correction of the initial recognition result; the sound-shape code dictionary is used for indicating the incidence relation between the preset characters and the sound-shape codes of the preset characters; the first phonetic-shape code is one of the phonetic-shape codes of all preset characters, and the second character is a preset character to which the first phonetic-shape code in the phonetic-shape code dictionary belongs;
the association sub-rule includes: determining the associated character of the third character obtained according to the preset associated rule as a third error correction result after the error correction of the initial recognition result; the third character is a character obtained before the target character is obtained according to a voice recognition method; the preset association rule corresponds to the target type of the third character; the preset association rule is used for indicating the association relationship between the target character and the character of the target type.
Based on the implementation mode, the technical scheme provided by the application can utilize abundant sub-rules to correct the initial recognition result, so that a more accurate recognition result of the target character voice data, namely the target character, can be obtained.
In another possible implementation manner of the first aspect, in a case that the preset character error correction rule includes a character mapping sub-rule, performing error correction on the initial recognition result according to the preset character error correction rule to obtain a target character corresponding to target character voice data, includes: determining a first error correction result obtained according to the character mapping sub-rule as a target character;
under the condition that the preset character error correction rule comprises a sound-shape code matching sub rule, carrying out error correction on the initial recognition result according to the preset character error correction rule to obtain a target character corresponding to the target character voice data, wherein the method comprises the following steps: determining a second error correction result obtained according to the sound-shape code matching sub-rule as a target character;
under the condition that the preset character error correction rule comprises an association sub-rule, correcting errors of the initial recognition result according to the preset character error correction rule to obtain a target character corresponding to the target character voice data, wherein the method comprises the following steps: and determining a third error correction result obtained according to the association sub-rule as a target character.
Based on the implementation manner, the technical scheme provided by the application can obtain the error correction result of the initial recognition result in different manners according to different sub-rules contained in the preset character error correction rule. That is to say, according to the technical scheme provided by the application, under the condition that the preset character error correction rule includes any sub-rule, the proper recognition result of the target character voice data, namely the target character, can be accurately determined.
In a possible implementation manner of the first aspect, in a case that the preset character error correction rule includes any two or all of a character mapping sub-rule, a phonogram matching sub-rule, and an association sub-rule, performing error correction on the initial recognition result according to the preset character error correction rule to obtain a target character corresponding to the target character voice data, includes: and sequentially selecting sub-rules included in the preset character error correction rule according to the preset sequence to correct the error of the initial recognition result until the error correction result of the initial recognition result is obtained, and determining the error correction result of the initial recognition result as the target character.
Based on the implementation manner, under the condition that the preset character error correction rule comprises a plurality of sub-rules, different sub-rules can be sequentially used according to the preset sequence to correct the error of the initial recognition result. Therefore, the sub-rules in the preset character error correction rule can not be used completely in the process of acquiring the target character, so that the technical scheme provided by the application can ensure that the accurate target character is acquired, and can reduce the waste of computing resources as much as possible.
In a possible implementation manner of the first aspect, when the preset sequence is a sequence from a large weight to a small weight, the weight of the character mapping sub-rule is greater than the weight of the phonogram matching sub-rule, the weight of the phonogram matching sub-rule is greater than the weight of the association sub-rule, and the preset character error correction rule includes the character mapping sub-rule, the phonogram matching sub-rule, and the association sub-rule, sequentially selecting the sub-rules included in the preset character error correction rule according to the preset sequence to correct the initial recognition result until the error correction result of the initial recognition result is obtained, and determining the error correction result of the initial recognition result as the target character, includes:
correcting the initial recognition result according to the character mapping sub-rule; if a first error correction result is obtained according to the character mapping sub-rule, determining the first error correction result as a target character; if the first error correction result is not obtained according to the character mapping sub-rule, performing error correction on the initial recognition result according to the sound-shape code matching sub-rule;
if a second error correction result is obtained according to the sound-shape code matching sub-rule, determining the second error correction result as a target character; if the second error correction result is not obtained according to the sound-shape code matching sub-rule, the initial recognition result is corrected according to the association sub-rule;
and if the third error correction result is obtained according to the association sub-rule, determining the third error correction result as the target character.
Based on the possible implementation manner, under the condition that the preset character error correction rule comprises a character mapping sub-rule, a phono-configurational code matching sub-rule and an association sub-rule, and the weights of the character mapping sub-rule, the phono-configurational code matching sub-rule and the association sub-rule are sequentially reduced, the character mapping sub-rule can be preferentially utilized to correct the initial recognition result, and if the weight is successful, the error correction result can be determined as the target character. If the initial recognition result fails, the initial recognition result can be corrected by using the sound-shape code matching sub-rule, and if the initial recognition result succeeds, the correction result can be determined as the target character. If the initial recognition result fails, the initial recognition result can be corrected by using the association sub-rule to obtain an error correction result, and the error correction result is determined to be the target character. The matching of the character mapping sub-rule, the sound-shape code matching sub-rule and the association sub-rule can inevitably obtain the error correction result of the initial recognition result, and all the sub-rules may not be used. That is to say, based on the implementation manner, the technical scheme provided by the application can ensure that the accurate target character is obtained, and can reduce the waste of computing resources as much as possible.
In a second aspect, a speech recognition method is provided for use with a display device. The method can comprise the following steps: acquiring target character voice data; sending target character voice data to a server; receiving a target character sent by a server; and the target character is obtained by the server after the initial recognition result of the voice data of the target character is obtained by using the preset voice recognition model and the initial recognition result is corrected based on the preset character error correction rule.
Based on the technical scheme, when the display device needs to identify the received character voice data (such as target character voice data), the target character voice data can be sent to the server. After the server obtains the target character voice data, the server can perform initial recognition on the target character voice data to be recognized based on the existing preset voice recognition model to obtain an initial recognition result. And then, correcting the initial recognition result based on a preset character error correction rule, so as to obtain a target character corresponding to the target character voice data. The server may then send the target character to the display device. In the whole speech recognition implementation process, the initial recognition is carried out based on the existing speech recognition scheme matched by the display equipment and the server, namely, a preset speech recognition model is used for carrying out initial recognition. On the basis, the preset character error correction rule is utilized to correct the error of the initial recognition result so as to obtain a more accurate recognition result, namely the target character. The preset character error correction rule can be a plurality of error correction tables, and the occupied computing resource is very small. Therefore, compared with the prior art, the technical scheme provided by the user can accurately identify the characters input by the user through the voice without increasing the occupancy rate of computing resources.
In a possible implementation manner of the second aspect, after receiving the target character sent from the server, the method further includes: the target character is displayed.
Based on the implementation mode, the display equipment can display the recognition result of the target character voice data, namely, the target character is displayed, so that a user can know the recognition result of the target character voice data in time conveniently.
In one possible implementation manner of the second aspect, displaying the target character includes: if the target character comprises a plurality of characters, displaying a selection popup window; the selection popup window comprises a plurality of selection options, and the selection options correspond to characters included in the target characters one by one; and responding to the trigger operation of selecting the first selection option in the popup window, and displaying the character corresponding to the first selection option in the target character.
Based on the implementation manner, in the case of the identification result of the target character voice data acquired by the display device, that is, the target character, if a plurality of characters exist in the target character, it indicates that the identification result of the target character voice data by the server is not unique, and the user may be required to select a possible identification result provided by the server. Based on this, the display device may display a selection popup for the user to select the desired character. After the user triggers a selection option in the selection popup, the display device may display a character corresponding to the selection option. Therefore, under the condition that the target character voice data cannot be accurately recognized by the technical scheme provided by the application, the final recognition result can be determined and displayed by combining the selection of the user, and better use experience is brought to the user.
In a third aspect, a server is provided that may include a communication interface, a processor, a memory, a bus; the memory is used for storing computer execution instructions, and the processor is connected with the memory through a bus; when the server is running, the processor executes the computer-executable instructions stored by the memory to cause the server to perform the speech recognition method as provided by the first aspect and possible implementations thereof.
In a fourth aspect, a display device is provided that includes a display screen, a memory, and one or more processors; the display screen and the memory are coupled with the processor; wherein the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the display device to perform the speech recognition method as provided in the second aspect and any of its possible implementations.
In a fifth aspect, a computer-readable storage medium is provided, which stores instructions that, when executed on a server, enable the server to perform the speech recognition method provided by the first aspect and possible implementations thereof.
A sixth aspect provides a computer-readable storage medium having stored therein instructions, which, when run on a display device, cause the display device to execute the speech recognition method provided by the second aspect and possible implementations thereof.
In a seventh aspect, a computer program product is provided, which contains instructions that, when run on a server, cause the server to perform the speech recognition method as provided in the first aspect and possible implementations thereof.
In an eighth aspect, a computer program product is provided, which contains instructions that, when run on a display device, cause the display device to perform the speech recognition method as provided in the second aspect and possible implementations thereof.
In a ninth aspect, there is provided an apparatus (which may be a system-on-a-chip, for example) comprising a processor for enabling a server to carry out the functions referred to in the first aspect above. In one possible design, the apparatus further includes a memory for storing program instructions and data necessary for the server. When the device is a chip system, the device may be composed of a chip, or may include a chip and other discrete devices.
In a tenth aspect, there is provided an apparatus (which may be a system-on-a-chip, for example) comprising a processor for enabling a display device to carry out the functions recited in the second aspect above. In one possible design, the apparatus further includes a memory for storing program instructions and data necessary for the electronic device. When the device is a chip system, the device may be formed by a chip, and may also include a chip and other discrete devices.
In an eleventh aspect, a speech recognition system is provided that includes a server and a display device. The server is configured to execute the voice recognition method provided by the first aspect and possible implementations thereof, and the display device is configured to execute the voice recognition method provided by the second aspect and possible implementations thereof.
The technical effects brought by any one of the design manners in the third aspect to the eleventh aspect may be referred to the technical effects brought by the different design manners in the first aspect and the second aspect, and are not described herein again.
Drawings
Fig. 1 is a schematic diagram of a speech recognition method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a speech recognition system according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a control device according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a display device according to an embodiment of the present disclosure;
fig. 5 is a schematic diagram of a software architecture of a display device according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application;
fig. 7 is a first flowchart illustrating a speech recognition method according to an embodiment of the present application;
fig. 8 is a schematic interface diagram of a television according to an embodiment of the present application;
fig. 9 is a flowchart illustrating a speech recognition method according to an embodiment of the present application;
fig. 10 is a schematic view of a scene in which a television acquires character voice data according to an embodiment of the present application;
fig. 11 is a third schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 12 is a schematic diagram illustrating a preset speech recognition model according to an embodiment of the present application;
fig. 13 is a schematic diagram of error correction of a character mapping sub-rule according to an embodiment of the present application;
fig. 14 is a schematic diagram illustrating an error correction rule according to an embodiment of the present application;
fig. 15 is a schematic diagram illustrating a process of encoding a phonetic-shape code according to an embodiment of the present application;
fig. 16 is a fourth schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 17 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 18 is a sixth schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 19 is a first schematic view of a display interface of a television according to an embodiment of the present application;
fig. 20 is a second schematic view of a display interface of a television according to an embodiment of the present application;
fig. 21 is a third schematic view of a display interface of a television according to an embodiment of the present application;
fig. 22 is a seventh flowchart illustrating a speech recognition method according to an embodiment of the present application;
fig. 23 is a flowchart illustrating an eighth method for speech recognition according to an embodiment of the present application;
fig. 24 is a schematic structural diagram of another server provided in the embodiment of the present application;
fig. 25 is a schematic structural diagram of another display device provided in an embodiment of the present application.
Detailed Description
To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises" and "comprising," and any variations thereof, in this application are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to all elements expressly listed but may include other elements not expressly listed or inherent to such product or device.
The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
All other embodiments, which can be derived by a person skilled in the art from the exemplary embodiments described herein without inventive step, are intended to be within the scope of the claims appended hereto. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment. It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
At present, more and more intelligent devices (such as smart televisions, mobile phones, tablets and the like) have the functions of voice recognition and voice interaction. In some scenes in which account passwords need to be input, for intelligent devices such as smart televisions and the like, the traditional scheme is that the passwords are manually input through keys of a remote controller. If the account password is complex, the whole input process takes much time, and the user experience is not good enough.
Based on this, some smart devices provide the user with a function of inputting an account password by voice by using their own voice recognition capability. However, most of the current speech recognition schemes of the intelligent devices are sentence-level recognition schemes, which support speech recognition of words and complex sentences, but have poor recognition effect on single characters. For example, in the case of using an intelligent device as an intelligent television, in the currently adopted cloud scheme, the intelligent television sends voice data to be recognized to a cloud server for recognition. The server in the cloud can use a preset speech recognition model to recognize long sentences, and for example, such long sentences as "I wait to wait spiderman" can be recognized accurately. But it may be recognized as "two", "to", "rabbit", etc. for the voice data corresponding to the character "2".
Thus, it is necessary to separately develop a character speech recognition model capable of accurately recognizing a single character. The character speech recognition model may be developed for a preset character. In the present application, the preset characters may include 26 letters, 10 Arabic numerals, and N special characters, i.e., 36+ N characters. Specifically, the speech data of the preset characters can be collected first, and then a character speech recognition model for recognition and classification can be obtained through training. The finally obtained character voice recognition model can better recognize a single character. But the character speech recognition model cannot complete the recognition of a word or sentence. And the development cost is high, the development difficulty is high, and particularly under a multi-voice scene, the volume of the character voice recognition model is large.
Furthermore, even if the development cost and difficulty of the character voice recognition model are not considered, the final development is finished. Finally, two speech recognition models need to exist in the speech recognition scheme of the intelligent device, so that the occupation of computing resources is large, and the use of the intelligent device by a user is influenced.
Based on this, referring to fig. 1, an embodiment of the present application provides a speech recognition method, which may be applied to a speech recognition scheme of a display device, in which there is one more error correction module compared to an existing speech recognition method. Specifically, in the case that there is a target character speech input (e.g., a pronunciation of a speech input "2"), the scheme may first recognize the target character speech by using an existing preset speech recognition model for words and sentences, and obtain an initial recognition result (e.g., "two"). Then, the error correction module may perform error correction on the initial recognition result based on a preset character error correction rule (e.g., a character mapping sub-rule and a phonetic-shape code matching sub-rule) established in advance to obtain an error correction result, where for example, the error correction result obtained by the character mapping sub-rule is 2, and the error correction result obtained by the phonetic-shape code matching sub-rule is also 2. Then, based on the specific fusion condition, the error correction result obtained by the preset character error correction rule is fused to obtain the final result, i.e. the target character (for example, 2) corresponding to the target character voice.
The following describes a speech recognition method provided in an embodiment of the present application in detail with reference to the drawings.
Fig. 2 is a schematic diagram illustrating a constituent structure of a speech recognition system to which a speech recognition method is applied according to an exemplary embodiment. Referring to fig. 2, the voice recognition system includes a display device 01 and a server 02.
Wherein, the user can control the display device 01 through the mobile terminal 100 and the control apparatus 200. The control device 200 may be a remote controller, and the communication between the remote controller and the display device 01 includes infrared protocol communication, bluetooth protocol communication, wireless or other wired method to control the display device 01. The user can input a user instruction through a key on the remote controller, voice input, control panel input, or the like to control the display device 01. In addition, the display apparatus 01 may also directly receive a voice input or a voice instruction of a user through a module (e.g., MIC) configured therein to acquire a voice instruction. In some embodiments, a tablet, computer, laptop, and other smart devices may also be used to control the display device 01.
In some embodiments, the same or mutually matched software applications may be installed on the mobile terminal 100 and the display device 01, so as to implement connection communication through a network protocol, and further implement the purpose of one-to-one control operation and data communication. In this case, the audio and video contents displayed on the mobile terminal 100 may also be transmitted to the display device 01, so as to implement a synchronous display function.
Data communication between the display device 01 and the server 02 may be performed by a limited or wireless communication method. The server 02 may provide various contents and interactions to the display device 01, for example, the server 02 may store a preset speech recognition model and a preset character error correction rule that are required to be used in the speech recognition method provided by the embodiment of the present application, so that the server 02 may provide the speech recognition capability to the display device 01. Alternatively, the server 02 may cooperate with the display device 01 to implement a speech recognition scheme.
For example, in the embodiment of the present application, the display device may have various implementation forms, and for example, the display device may be a television, a smart television, a laser projection device, a display (monitor), an electronic whiteboard (electronic whiteboard), an electronic desktop (electronic table), and the like, which can perform voice input. The embodiments of the present application do not limit the specific form of the display device. In the embodiment of the present application, a display device is taken as a television as an example for schematic description.
For example, in this embodiment of the application, the server 02 may be a single server, may also be a server cluster formed by multiple servers, or may also be a cloud computing service center, which is not specifically limited in this application. The server 02 may be connected to at least one display device 01, and the number and types of the display devices 01 are not specifically limited in the present application. Referring to fig. 1, in the embodiment of the present application, the preset speech recognition model and the error correction module may both be disposed in the server 02, and the target character speech data may be sent to the server 02 after being acquired by the display device 01. The server 02 may transmit the obtained target character to the display device 01 after recognizing and correcting the target character voice data. And the display equipment can correspondingly display the target character after receiving the target character so that the user can know the recognition result in time.
Fig. 3 shows a block diagram of one possible configuration of the control device 200. As shown in fig. 3, the control device 200 includes a controller 210, a communication interface 230, a user input/output interface 240, a memory, and a power supply. The control apparatus 200 may receive an input operation instruction (e.g., a voice instruction) of the user and convert the operation instruction into an instruction recognizable and responsive to the display device 01, serving as an interaction intermediary between the user and the display device 200.
By way of example, taking a display device as a television as an example, fig. 4 shows a schematic structural diagram of a display device 01 provided in an embodiment of the present application.
As shown in fig. 4, the display apparatus 01 includes at least one of a tuner demodulator 110, a communicator 120, a detector 130, an external device interface 140, a controller 150, a display 160, an audio output interface 170, a memory, a power supply, and a user interface.
In some embodiments the controller comprises a processor, a video processor, an audio processor, a graphics processor, a RAM, a ROM, a first interface to an nth interface for input/output.
The display 160 includes a display screen component for displaying pictures, and a driving component for driving image display, and is used for receiving image signals from the controller output, displaying video content, image content, and a menu manipulation Interface component, and a user manipulation User Interface (UI).
The display 160 may be a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.
The communicator 120 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The display apparatus 01 may establish transmission and reception of control signals and data signals with the external control apparatus 200 or the server 02 through the communicator 120.
A user interface for receiving a control signal for controlling the apparatus 200 (e.g., an infrared remote controller, etc.).
The detector 130 is used to collect signals of the external environment or interaction with the outside. For example, the detector 130 includes a light receiver, a sensor for collecting the intensity of ambient light; alternatively, the detector 130 includes an image collector, such as a camera, which can be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 130 includes a sound collector, such as a microphone, which is used to receive external sounds.
The external device interface 140 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, and the like. Or may be a composite input/output interface formed by the plurality of interfaces.
The tuner demodulator 110 receives a broadcast television signal through a wired or wireless reception manner, and demodulates an audio and video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals.
In some embodiments, the controller 150 and the modem 110 may be located in different separate devices, that is, the modem 110 may also be located in an external device of the main device where the controller 150 is located, such as an external set-top box.
The controller 150 controls the operation of the display device and responds to the user's operation through various software control programs stored in the memory. The controller 150 controls the overall operation of the display device 01. For example: in response to receiving a user command for selecting a UI object displayed on the display 160, the controller 150 may perform an operation related to the object selected by the user command.
In some embodiments the controller comprises at least one of a Central Processing Unit (CPU), a video processor, an audio processor, a Graphics Processing Unit (GPU), a RAM Random Access Memory (RAM), a ROM (Read-Only Memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.
The user may input a user command through a Graphic User Interface (GUI) displayed on the display 160, and the user input interface receives the user input command through the Graphic User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.
A "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables the conversion of the internal form of information to a form acceptable to the user. A common presentation form of a User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the display device, where the control may include at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc. visual interface elements.
It will be appreciated that, in general, the implementation of display device functions requires the cooperation of software in addition to the support of the hardware described above.
In some embodiments, taking an operating system used by the display device 01 as an Android system as an example, as shown in fig. 5, the system of the display device 01 may be divided into four layers, which are, from top to bottom, an Application (Applications) layer (abbreviated as "Application layer"), an Application Framework (Application Framework) layer (abbreviated as "Framework layer"), an Android runtime (Android runtime) layer and a system library layer (abbreviated as "system runtime library layer"), and a kernel layer.
In some embodiments, at least one application program runs in the application program layer, and the application programs may be windows (windows) programs carried by an operating system, system setting programs, clock programs or the like; or an application developed by a third party developer. In the embodiment of the present application, the application layer may include a voice recognition application, and the application is specifically configured to call a communication interface of the display device 01 to send voice data received by the display device 01 to the server 02 for recognition. In particular implementations, the application packages in the application layer are not limited to the above examples.
The framework layer provides an Application Programming Interface (API) and a programming framework for the application. The application framework layer includes some predefined functions or services. The application framework layer acts as a processing center that decides to let the applications in the application layer act. The application program can access the resources in the system and obtain the services of the system in execution through the API interface.
As shown in fig. 5, in the embodiment of the present application, the application framework layer includes a manager (Managers), a Content Provider (Content Provider), a View system (View system), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used for interacting with all activities running in the system; the Location Manager (Location Manager) is used for providing the system service or application with the access of the system Location service; a Package Manager (Package Manager) for retrieving various information related to an application Package currently installed on the device; a Notification Manager (Notification Manager) for controlling display and clearing of Notification messages; a Window Manager (Window Manager) is used to manage the icons, windows, toolbars, wallpapers, and desktop components on a user interface.
In some embodiments, the activity manager is used to manage the lifecycle of the various applications and the usual navigation fallback functions, such as controlling exit, opening, fallback, etc. of the applications. The window manager is used for managing all window programs, such as obtaining the size of a display screen, judging whether a status bar exists, locking the screen, intercepting the screen, controlling the change of the display window (for example, reducing the display window, displaying a shake, displaying a distortion deformation, and the like), and the like.
In some embodiments, the system runtime layer provides support for an upper layer, i.e., the framework layer, and when the framework layer is used, the android operating system runs the C/C + + library included in the system runtime layer to implement the functions to be implemented by the framework layer.
In some embodiments, the kernel layer is a layer between hardware and software. The inner core layer comprises at least one of the following drivers: audio drive, demonstration drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.), MIC drive and power drive etc..
For example, fig. 6 shows a schematic structural diagram of a server. Referring to fig. 6, the server includes one or more processors 201, a communication line 202, and at least one communication interface (which is only exemplified in fig. 6 to include a communication interface 203 and one processor 201), and optionally may further include a memory 204.
The processor 201 may be a general processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs in accordance with the present invention.
The communication line 202 may include a path for communication between different components.
The communication interface 203 may be a transceiver module for communicating with other devices or communication networks, such as ethernet, RAN, wireless Local Area Networks (WLAN), etc. For example, the transceiver module may be a transceiver, or the like. Optionally, the communication interface 203 may also be a transceiver circuit located in the processor 201 to realize signal input and signal output of the processor.
The memory 204 may be a device having a storage function. Such as, but not limited to, read-only memory (ROM) or other types of static storage devices that may store static information and instructions, random Access Memory (RAM) or other types of dynamic storage devices that may store information and instructions, electrically erasable programmable read-only memory (EEPROM), compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via communication link 202. The memory may also be integral to the processor.
The memory 204 is used for storing computer-executable instructions for executing the present application, and is controlled by the processor 201 to execute. The processor 201 is configured to execute computer-executable instructions stored in the memory 204, thereby implementing the speech recognition method provided in the embodiment of the present application.
Alternatively, in this embodiment of the present application, the processor 201 may also execute a function related to processing in the speech recognition method provided in the following embodiments of the present application, and the communication interface 203 is responsible for communicating with another device (for example, a display device) or a communication network, which is not specifically limited in this embodiment of the present application.
Optionally, the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.
In particular implementations, processor 201 may include one or more CPUs, such as CPU0 and CPU1 in fig. 6, as one embodiment.
In particular implementations, a server may include multiple processors, such as processor 201 and processor 207 in FIG. 6, for example, as an embodiment. Each of these processors may be a single-core (si) processor or a multi-core (multi-core) processor. The processor herein may include, but is not limited to, at least one of: various computing devices running software, such as a Central Processing Unit (CPU), a microprocessor, a Digital Signal Processor (DSP), a microcontroller unit (MCU), or an artificial intelligence processor, may each include one or more cores for executing software instructions to perform operations or processing.
In one embodiment, the server may further include an output device 205 and an input device 206. The output device 205 is in communication with the processor 201 and may display information in a variety of ways. For example, the output device 205 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 206 is in communication with the processor 201 and may receive user input in a variety of ways. For example, the input device 206 may be a mouse, keyboard, touch screen device, or sensing device, among others.
The server may be a general purpose device or a dedicated device. For example, the server may be a desktop computer, a portable computer, a network server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, an embedded device, or a device having a similar structure as in fig. 6. The embodiment of the application does not limit the type of the server.
The voice data referred to in the present application may be data authorized by the user or sufficiently authorized by the parties.
The method in the following embodiments may be implemented in a display device or a server having the above-described hardware configuration and software configuration. In the following embodiments, a voice recognition method provided in the embodiments of the present application is described by taking a display device as a television as an example.
Referring to fig. 7, an embodiment of the present application provides a speech recognition method, which may include S701-S707:
s701, the television acquires target character voice data.
In the case that the user needs to use some function or application (e.g. a member that the user needs to log in a certain video application) on the television, which needs to input an account and/or a password, the television displays a corresponding interface for instructing the user to input a corresponding account and/or password.
Taking a scenario in which a password needs to be input as an example, after a user triggers a corresponding function icon or application icon in a user interface UI of the television, the television may display a password input interface 801 as shown in fig. 8. An input box 802, prompt information 803, and keyboard 804 may be included in the interface 801. The prompt 803 is used to instruct the user to input a password, such as "Please input a password by voice or Keyboard". The input box 802 is used to display the password that the user has input, and the keyboard 804 includes the character options required by the password. For example, 26 letters, 10 digits, and N special characters. The specific style of the keyboard 804 may be determined according to the configuration of the television and the setting of the user, and may be a keyboard in a nine-space format as shown in fig. 8, or may be a keyboard like a computer keyboard, which is not limited in this application.
In the case where the user determines to input the password using voice, the user may input character voice data to the television set in any feasible voice input manner. And inputting the target character voice data into the television by the user, namely acquiring the target character voice data by the television.
In an implementable manner, a user can input character speech data via the control device. Taking the control device as a remote controller of a television as an example, a user can press a radio key (or MIC key) in the remote controller when voice input is required. In response to the hold operation, as shown in fig. 9 in conjunction with fig. 7, the remote controller may start receiving character voice data (e.g., target character voice data) of the user. After the user finishes speaking a certain character, the user can release the radio key. In response to the release operation, the remote controller may transmit the received character voice data to the television set.
In another implementation manner, in the case that the television itself has the MIC of the radio reception, referring to fig. 10, the user may input character voice data to the television through the MIC of the television. That is, the user can speak characters directly to the television. The television can receive character voice data of a user under the condition that the MIC function is started in advance. Referring to fig. 10, in a case where the MIC function is previously turned on by the television, a voice input prompt 1001 may be displayed in a target area (for example, a lower right corner) in a display interface of the television. Of course, if the user inputs character voice data to the television through the remote controller, the television may also display a voice input prompt in the target area.
S702, the television sends target character voice data to the server.
In a character input scene that a user inputs an account or a password, if the user inputs target character voice data to the television, the television needs to identify the target character voice data to obtain a specific target character. Based on the above, the television can send the target character voice data to the server with voice recognition capability, so that the server can recognize the target character voice data and return the corresponding target character.
In one implementation, a speech recognition application may be present in the television, and in the event that the television receives the target character speech data, the television may first transmit the target character speech data to the speech recognition application. Taking an example that a user inputs target character voice data to a television through a remote controller, as shown in fig. 11 in conjunction with fig. 9, the television here may be specifically an operating system (or a whole system) of the television, and the voice recognition application may also be referred to as a voice client. After receiving the target character voice data sent by the remote controller, the operating system of the television can send the target character voice data to the voice client. The voice client can call software and hardware of the television to send the target character voice data to the corresponding server. Here, the server may specifically be a platform server corresponding to the speech recognition application. The subsequent service transmits the target character to the tv (i.e., S706) for the same reason.
It should be noted that, when the operating system of the television sends the target character voice data to the voice client, the operating system also notifies the sending voice recognition instruction, which is used to instruct the voice client to recognize the target character voice data. Illustratively, the target character voice data may be carried in the voice recognition instruction. The voice system sends the target character voice data to the server in the same way.
As shown in fig. 12, in order to accurately recognize the subsequent target character voice data, after acquiring the target character voice data, the television first performs preprocessing (or referred to as "acoustic front-end processing") on the target character voice data by using its own processing capability. In particular, the pre-treatment may comprise any one or more of: microphone arrays, noise reduction, dereverberation, de-echo, etc. Wherein the microphone array may be configured to remove the ambient audio data input by the user and retain only the target character speech data.
S703, the server receives the target character voice data from the television.
S704, the server inputs the target character voice data into a preset voice recognition model to obtain an initial recognition result.
In the embodiment of the application, when the server obtains the target character voice data and determines that the target character voice data needs to be identified, the server firstly performs preliminary identification. The initial recognition refers to calling an existing preset voice recognition model to perform on the target character voice data to obtain a recognition text result, namely an initial recognition result.
For a tv set that has integrated a speech assistant (i.e. the speech recognition application or speech client in the previous embodiment), the corresponding remote server will typically have a preset speech recognition model applicable to the sentence level. Illustratively, the predetermined speech recognition model may include any possible ASR (automatic speech recognition) developed by a manufacturer or company. This is not specifically limited by the present application.
For example, referring to fig. 12, the preset speech recognition model included in the cloud server may include the following four components: a speech signal processing and feature extraction module 1201, an acoustic model 1202, a language model 1203 and a decoding module 1204.
The voice signal processing and feature extraction module 1201 is mainly used for performing feature extraction on target character voice data which is sent to a server after being preprocessed by a television so as to obtain acoustic features in the target character voice data. The acoustic feature may specifically be a multidimensional vector.
The acoustic model 1202 is configured to process the acoustic features obtained by the speech signal processing and feature extraction module 1201 to obtain phonemes (or words) corresponding to the acoustic features, or to obtain word vectors or word vectors. For example, the acoustic model may be any one of a Hidden Markov Model (HMM), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like.
The language model 1203 is then used to determine reasonable combination relationships between multiple phonemes or words. Illustratively, the language model may be an N-gram language model, an RNN language model, a long-short-term memory (LSTM) language model, or the like.
The decoding module 1204 is used for decoding the phonemes from the acoustic model 1202 into words or sentences recognizable by the user according to the capability of the language model 1203.
It should be noted that, the preset speech recognition models existing in the current server are generally at the sentence level, and the same pronunciation may have different recognition results in different context environments. This will depend on the language model and the decoding model in the pre-set speech recognition model. For example, in the english environment, if the user says "set volume to twentity two", the decoding model will generally recognize the audio segment corresponding to twentiy two therein as the number "22"; if the user says "two", the decoding model may identify "two" as "2" or "two" or "to" or even "too". Therefore, if only the preset speech recognition model is used for character-level recognition, some pre-specified error correction rules are also needed to correct the recognition result. I.e., the subsequent S705 is performed.
S705, the server corrects the error of the initial recognition result according to a preset character error correction rule to obtain a target character corresponding to the target character voice data.
In order to ensure that the preset character error correction rule can perform accurate error correction on the initial recognition result, the preset character error correction rule at least includes any one or more of the following sub-rules: character mapping sub-rule, phono configurational code matching sub-rule and association sub-rule. Therefore, the technical scheme provided by the application can utilize abundant sub-rules to correct the initial recognition result, so that a more accurate recognition result of the target character voice data, namely the target character, can be obtained.
Wherein the character mapping sub-rule comprises: if the initial recognition result is matched with a first recognition result in a preset character mapping table, determining a first character associated with the first recognition result in the preset character mapping table as a first error correction result after error correction of the initial recognition result; the preset character mapping table is used for indicating the incidence relation between the preset characters and the optional recognition results; the first recognition result is one of all the selectable recognition results, and the first character is a preset character associated with the first recognition result in the preset character mapping table. Illustratively, the preset characters may include 26 letters (specifically, may include upper case characters and lower case characters), 10 Arabic numerals, and N special characters, i.e., 36+ N characters. The subsequent preset characters are treated the same.
In an implementation manner, the preset character mapping table in the character mapping sub-rule may be obtained by presetting a speech recognition model. Specifically, for the preset characters needing to be recognized, the corresponding voice can be manually input into the preset voice recognition model by multiple persons for multiple times, so as to obtain the optional recognition result corresponding to the preset characters. Thereby obtaining the mapping relation between the preset characters and the optional recognition results, and constructing and obtaining the preset character mapping table.
For example, in english as an example, the speech of the character "2" may be repeatedly input into the preset speech recognition model, so as to obtain optional recognition results such as "two", "to", "too". In the finally constructed preset character mapping table, the content of the character "2" may be { "two":2, "to":2, "too":2}.
Based on this, referring to fig. 13 (a), if the initial recognition result of the preset character speech recognition model on the target character speech data is "two" or "to" or "too", the server will use 2 as the first error correction result of the initial recognition result according to the character mapping sub-rule.
In addition, if the recognition of upper and lower case letters is supported, when the user inputs the capital of a certain letter, the input voice of the user can be added with the pronunciation of 'big' before the letter pronunciation. On the basis, taking a as an example, based on the above manner, the following contents in the preset character mapping table can be obtained: { "a": a, "big a": a }, and the rest letters are in the same way.
Based on this, referring to fig. 13 (b), if the initial recognition result of the preset character speech recognition model on the target character speech data is "big a", the server will use a as the first error correction result of the initial recognition result according to the character mapping sub-rule.
The sound-shape code matching sub-rule comprises the following steps: if the sound-shape codes of the initial recognition result are matched with the first sound-shape codes in the sound-shape code dictionary, determining second characters, associated with the first sound-shape codes, in the sound-shape code dictionary as a second error correction result after the error correction of the initial recognition result; the sound-shape code dictionary is used for indicating the incidence relation between the preset characters and the sound-shape codes of the preset characters; the first phonetic-shape code is one of the phonetic-shape codes of all the preset characters, and the second character is the preset character of the first phonetic-shape code in the phonetic-shape code dictionary.
The sound-shape code of a certain object (i.e. a word or a word) is obtained by coding the pronunciation of the object according to a specific rule. The sound-shape code coding rules of different language types are different. Thus, coarse-grained pronunciation normalization can be achieved for a certain language type. One phonographic code may correspond to a plurality of objects, and one object may correspond to only one phonographic code.
In the embodiment of the application, the phonographic code dictionary is constructed according to the pronunciation of the preset characters, and is specifically related to different language types (such as Chinese, english, arabic and the like). The phonetic-configurational code dictionary includes the association relationship between each preset character and its phonetic-configurational code. Taking English as an example, the character "sound-shape code" "is KM, the character" "2" "sound-shape code" "is T, and the character" @ "" sound-shape code "" is AT. The specific content (i.e. association) in the phonographic code dictionary may include: { "KM": "," }, { "T": "2" }, { "AT": "@" }. That is, if the initial recognition result is corrected by using the phonogram code matching sub-rule, if the phonogram code of the initial recognition result is KM, the second correction result is ","; if the sound-shape code of the initial identification result is T, the second error correction result is '2'; if the sound-shape code of the initial recognition result is AT, the second error correction result is '@'.
Based on this, in the context of character input of an account or a password as described in this application, referring to fig. 14, if the pictophonetic code of the initial recognition result of the preset character voice recognition model on the target character voice data is KM, for example, come, the server will use "as the second error correction result of the initial recognition result" according to the matching sub-rule of the pictophonetic code.
In addition, when the initial recognition result is corrected using the phonetic-configurational code matching sub-rule, it is necessary to encode the initial recognition result to obtain the phonetic-configurational code of the initial recognition result. For example, using english as an example, referring to fig. 15, the procedure of encoding the phonographic code of the initial recognition result may include 1501 to 1504:
1501. and deleting or ignoring non-English letter characters in the initial recognition result and converting all letters into capital letters to obtain a first result.
The reason why the non-english alphabet character appears in the initial recognition result is that, due to the limitation of the preset speech recognition model during training, once the pronunciation of the user does not conform to the pronunciation of the preset speech recognition model during training data, the preset speech recognition model may not be recognized or may not be recognized accurately. At this time, if the preset speech recognition model outputs any character, non-english characters may appear (assuming that the current recognition scene is an english usage scene, or the user sets the default language of the television set to an english type). Therefore, the non-english alphabet characters need to be deleted to eliminate interference.
For example, if the initial recognition result is COMMA, then the first result may be COMMA.
1502. And preprocessing the initial letters or letter combinations of the first results according to preset processing rules to obtain second results.
For example, the preset processing rule may include: if the letter combination AE is located at the beginning of the word, deleting the first letter; the letter X of the prefix is replaced by S.
The preset encoding rule may be determined based on english pronunciation habits, for example, when AE is at the beginning of a word, a is unvoiced or lightly voiced, so a can be deleted, and the rest of the rules are the same.
For example, if the first result is COMMA and the preset processing rule includes deleting O in the prefix CO, the second result may be CMMA.
1503. And carrying out duplicate removal on adjacent repeated letters in the second result to obtain a third result.
For example, if the third result is CMMA, one of two M is removed.
1504. And coding the third result according to a preset coding rule to obtain the sound-shape code of the initial recognition result.
The preset coding rule can be specified according to English pronunciation. For example, 5 vowels A, E, I, O, U are retained when located at the beginning of the word and removed when located elsewhere; consonant letters are processed according to predetermined conversion rules, such as C, C in letter combinations-CIA-and-CH-is converted to X, C in-CI-, -CE-and-CY-is converted to S, C in letter combinations-SCI-, -SCE-and-SCY-is deleted, and other cases are converted to K. Illustratively, if the third result is CMA, for example, a may be omitted and C converted to K, depending on the pronunciation, resulting in KM.
Of course, the above-mentioned phono configurational code encoding rule may also be any other feasible manner, and the present application does not specifically limit this.
After the sound-shape code of the initial recognition result is obtained, a second error correction result of the initial recognition result can be determined according to the sound-shape code matching sub-rule.
The association sub-rule includes: determining the associated character of the third character obtained according to the preset associated rule as a third error correction result after the error correction of the initial recognition result; the third character is a character obtained before the target character is obtained according to a voice recognition method; the preset association rule corresponds to the target type of the third character; the preset association rule is used for indicating the association relationship between the target character and the character of the target type. In this embodiment of the application, the association sub-rule may determine a plurality of characters as the third error correction result according to a preset association rule of the third character. Finally, if the third error correction result is determined to be the target character, the electronic device displays all characters for the user to select after receiving the target character.
In the embodiment of the application, the association sub-rule may include a plurality of preset association rules, and each preset association rule corresponds to one type of characters in the preset characters. The selection of each preset association rule may be determined according to the input habits of the user or a majority of the input habits obtained through statistics. Exemplarily, the association sub-rule and the corresponding example in the embodiment of the present application are shown in table 1 below:
TABLE 1 Association rules
Figure BDA0003896660440000141
Based on this, in the context of character input of an account or a password in the present application, taking a third character obtained by last recognizing character voice data input by a user according to the voice recognition method provided by the present application as an example, according to the association sub-rule, the server will use "b" and/or "s" as a second error correction result of the initial recognition result.
In some embodiments, after the server obtains the error correction result of the initial recognition result according to the preset character error correction rule, the obtained error correction result may be determined as the target character.
Based on this, when the preset character error correction rule includes the character mapping sub-rule, referring to fig. 16 in combination with fig. 7, S705 may specifically be S705A:
S705A, the server determines a first error correction result obtained according to the character mapping sub-rule as a target character.
In some implementations, because the preset character mapping table according to the character mapping sub-rule is a full character matching method, and needs to be constructed according to the pronunciation of a certain character artificially, there is a certain error or deficiency. Moreover, based on the complexity of different languages and the imperfection of the preset speech recognition model, the preset character mapping table of each language cannot completely represent the mapping relationship between the preset character and all possible optional recognition results, and certain deficiency inevitably exists. Therefore, in practice, if the preset character error correction rule includes the character mapping sub-rule, the server may not necessarily find the first recognition result matching (or equal to) the initial recognition result from the preset character mapping table according to the character mapping sub-rule, and thus the first error correction result corresponding to the initial recognition result cannot be determined.
In this case, if the preset character error correction rule does not include other sub-rules, the server returns the initial recognition result as the target character to the television for display. If the subsequent user does not agree, the subsequent user can modify the television through a remote controller or a touch screen of the television.
Alternatively, the server may send the initial recognition result to the television and simultaneously inform the television that the error correction of the initial recognition result is not performed. At this time, the television can display corresponding prompt information to inform the user, and display corresponding pop-up windows to indicate the user to determine whether to take the initial recognition result as the final recognition result. And if the user instructs the television through corresponding operation and determines that the initial recognition result is taken as the final recognition result of the target character voice data, the television displays the initial recognition result as the target character in the input box. If the user instructs the television through corresponding operation, and the initial recognition result is determined not to be used as the final recognition result of the target character voice data, the television instructs the user to input characters (or character voice data) again.
In the case that the preset character error correction rule includes the phonetic-configurational code matching sub-rule, referring to fig. 17 in combination with fig. 7, S705 may specifically be S705B:
S705B, the server determines a second error correction result obtained according to the sound-shape code matching sub-rule as a target character.
In some embodiments, since the pronunciation of the user is not standardized, once the user cannot find the matching (or equal) first phonetic-shape code from the phonetic-shape code dictionary for the phonetic-shape code of the initial recognition result obtained after the pronunciation of a certain character is recognized by the preset speech recognition model, the second character cannot be determined, and the second error correction result cannot be obtained.
In this case, the server may implement the specific implementation that the first error correction result cannot be obtained as described above, and details thereof are not described here.
In the case that the preset character error correction rule includes an association sub-rule, referring to fig. 18 in combination with fig. 7, S705 may specifically be S705C:
S705C, the server determines the third error correction result obtained according to the association sub-rule as the target character.
Since the association sub-rule corrects the initial recognition result based on the third character obtained by the server at the previous time, once the third character exists, the server can necessarily obtain the third correction result according to the association sub-rule and determine it as the target character.
It should be noted that if the target character voice data input to the television by the current user is the character voice data input for the first time in the current account and/or password input scenario, the third character does not exist. In this case, the server cannot obtain the third error correction result. At this time, the specific implementation of the server is similar to the foregoing implementation that cannot obtain the first error correction result, and details thereof are not repeated here.
S706, the server sends the target character to the television.
After the server identifies the target character voice data and corrects errors to obtain the target character, the server can send the target character to the television so that the television can display the target character. Thereby enabling the user to know the voice recognition result in time.
And S707, the television receives the target character from the server and displays the target character.
In an implementation manner, the target character obtained by the server through recognizing and correcting the target character voice data includes only one character (for example, the target character obtained through correcting the error according to the character mapping sub-rule or the phono-configurational code matching sub-rule). Taking the target character as 2 as an example, with reference to fig. 8 and with reference to fig. 19, the tv set may display "2" at the current character position in the input box 802.
In another implementation, the target character obtained by the server after recognizing and correcting the target character voice data includes only a plurality of characters (e.g., the target character obtained according to the association rule). Taking the third character as a and the target character comprising b and s as an example, with reference to fig. 8 and (a) in fig. 20, the tv may display a selection popup 2001 after receiving the target character. The selection popup 2001 may include a plurality of selection options, with the selection options corresponding one-to-one with the characters in the target character. For example, as shown in fig. 20, selection popup 2001 may include selection option 2002 and selection option 2003, selection option 2002 corresponding to the character "b" and selection option 2003 corresponding to the character "s".
Then, the user may select the first selection option to perform the trigger operation, and the television may display the character corresponding to the first selection option in the target character in response to the trigger operation of the user for selecting the first selection option in the pop-up window 2001. For example, if the first selection option is selection option 2002, the character corresponding to the first selection option is "b". At this time, referring to (b) in fig. 20, the television set may display "b" at the current character position in the input box 802.
In addition, if the user thinks that the characters corresponding to all the selection options in the selection popup 2001 are not intended by himself, the user can input the intended characters by using the keyboard 804 to display the characters on the television. Alternatively, as shown in fig. 20 (a), a cancel option 2004 may also be included in the selection popup 2001. The cancel option 2004, when triggered by the user, may be used to trigger the television to re-receive the character voice data entered by the user. That is, if the user performs a trigger operation on the cancel option, the user may input character voice data to the television again to re-identify the television in conjunction with the server.
In addition, in the embodiment of the application, in order to enable the user to know the initial recognition result obtained by the preset speech recognition model, the server may also send the initial recognition result to the television when sending the target character to the television. When the television receives the initial recognition result, the television can display the initial recognition result in a specific area. Illustratively, referring to fig. 21 in conjunction with fig. 10, the specific area may be the vicinity of the target area, and the television may display the initial recognition result, for example, two, in the specific area.
Based on the technical scheme provided by the application, when the display device needs to identify the received character voice data (such as target character voice data), the target character voice data can be sent to the server. After the server obtains the target character voice data, the server can perform initial recognition on the target character voice data to be recognized based on the existing preset voice recognition model to obtain an initial recognition result. And then, correcting the initial recognition result based on a preset character error correction rule, so as to obtain a target character corresponding to the target character voice data. The server may then send the target character to the display device for display. In the whole implementation process of voice recognition, the method is carried out based on the existing voice recognition scheme matched with the display equipment and the server, namely, a preset voice recognition model is used for carrying out initial recognition. On the basis, the preset character error correction rule is utilized to correct the error of the initial recognition result so as to obtain a more accurate recognition result, namely the target character. The preset character error correction rules can be a plurality of error correction tables, and the occupied computing resources are very small. Therefore, compared with the prior art, the technical scheme provided by the user can accurately identify the characters input by the user through the voice without increasing the occupancy rate of computing resources.
Furthermore, because the preset character error correction rule is easy to change, once a subsequent user needs to recognize other characters, the corresponding error correction rule can be added to the preset character error correction rule (or the sub-rule included therein), which is convenient and easy to operate.
The foregoing S705A, S705B and S705C respectively provide error correction conditions that the preset character error correction rule includes a single sub-rule, and in practice, the preset character error correction rule may further include a plurality of sub-rules. Based on this, in some embodiments, in the case that the preset character error correction rule includes any two or all of the character mapping sub-rule, the phonetic-font code matching sub-rule, and the association sub-rule, referring to fig. 22 in conjunction with fig. 7, S705 may be S705D:
S705D, the server sequentially selects sub-rules included in the preset character error correction rule according to the preset sequence to correct the error of the initial recognition result until the error correction result of the initial recognition result is obtained, and determines the error correction result of the initial recognition result as the target character.
Therefore, since all sub-rules in the preset character error correction rule can not be used in the process of acquiring the target character, the technical scheme provided by the application can ensure that the accurate target character is acquired and can reduce the waste of computing resources as much as possible.
For example, the preset order may be an order of increasing weights. Among the sub-rules that the preset character error correction rule may include, the character mapping sub-rule is a full character matching method, and it utilizes the preset speech recognition model, so the error correction rule of the sub-rule is most suitable for error correction of the initial recognition result, and its weight may be the largest. Secondly, the phonogram code matching sub-rule is based on the phonogram code, and compared with the preset association rule obtained according to the use habit or experience of the user in the association sub-rule, the phonogram code matching rule is performed at least based on the initial recognition result (the phonogram code of the initial recognition result needs to be obtained), so the weight of the phonogram code matching sub-rule can be next to the character mapping sub-rule. The weight of the association sub-rule is the smallest.
Based on the above description, in some embodiments, when the preset order is taken as the order from the greater weight to the lesser weight, the weight of the character mapping sub-rule is greater than the weight of the phonogram code matching sub-rule, the weight of the phonogram code matching sub-rule is greater than the weight of the association sub-rule, and the preset character error correction rule includes the character mapping sub-rule, the phonogram code matching sub-rule, and the association sub-rule, referring to fig. 23 in combination with fig. 22, S705D may include S7051D to S7056D:
S7051D, the server corrects the initial recognition result according to the character mapping sub-rule.
S7052D, if the server obtains the first error correction result according to the character mapping sub-rule, the server determines the first error correction result as the target character.
And S7053D, if the server does not obtain the first error correction result according to the character mapping sub-rule, the server corrects the initial recognition result according to the sound-shape code matching sub-rule.
And S7054D, if the server obtains a second error correction result according to the sound-shape code matching sub-rule, determining the second error correction result as the target character.
And S7055D, if the server does not obtain a second error correction result according to the sound-shape code matching sub-rule, correcting the initial identification result according to the association sub-rule.
And S7056D, if the server obtains a third error correction result according to the association sub rule, determining the third error correction result as the target character.
If the server does not obtain the third error correction result according to the association sub-rule, reference may be made to the relevant description after S705C in the foregoing embodiment for specific implementation, and details are not described here again.
Certainly, the technical solutions corresponding to the above-mentioned S7051D-S7056D are only one possible implementation manner, and after the preset sequence is changed in practice, any other feasible implementation manner may also be adopted. In addition, if the implementation when the preset character error correction rule includes two sub-rules can refer to the implementation of the technical scheme, which is not described in detail herein.
Based on the technical solutions corresponding to S7051D-S7056D, under the condition that the preset character error correction rule includes the character mapping sub-rule, the phonogram code matching sub-rule, and the association sub-rule, and the weights of the character mapping sub-rule, the phonogram code matching sub-rule, and the association sub-rule are sequentially reduced, the initial recognition result may be preferentially corrected by using the character mapping sub-rule, and if the correction result is successful, the error correction result may be determined as the target character. If the initial recognition result fails, the initial recognition result can be corrected by using the sound-shape code matching sub-rule, and if the initial recognition result succeeds, the correction result can be determined as the target character. If the initial recognition result fails, the initial recognition result can be corrected by using the association sub-rule to obtain an error correction result, and the error correction result is determined to be the target character. The matching of the character mapping sub-rule, the sound-shape code matching sub-rule and the association sub-rule can inevitably obtain the error correction result of the initial recognition result, and all the sub-rules may not be used. That is to say, based on the implementation manner, the technical scheme provided by the application can ensure that the accurate target character is obtained, and can reduce the waste of computing resources as much as possible.
In the foregoing embodiment, in the whole process, the television receives the target character voice data of the user, and then sends the target character voice data to the target server, so that the server performs recognition by using the preset voice recognition model existing in the server and performs error correction by using the preset character error correction rule in the error correction module, thereby obtaining the target character and returning the target character to the television.
In other embodiments, the preset speech recognition model and the error correction module may also be disposed in a television, in which case, all steps of S701 to S707 are implemented by the television itself, and the specific implementation thereof can be easily obtained by reasonably deriving according to the foregoing embodiments, which is described in detail herein. In this case, the speed of speech recognition may be faster than if a high speech recognition method were provided in the foregoing embodiments, but the computational resource requirements and storage resource requirements for the television would be higher. Further, in the voice recognition method provided in the foregoing embodiment, if the communication connection between the television set and the server is broken due to a network problem or any other possible problem, the target character voice data cannot be recognized, but the voice recognition method in which the voice recognition is performed by the television set does not have such a problem.
In addition, further, since the preset character error correction rules in the preset speech recognition model and the rule module may be updated according to the user's needs, the update is placed in the server for convenience. Therefore, in order to update the preset character error correction rules in the preset speech recognition model and the rule module stored in the television set in time. The television set may periodically (e.g., once a month or set by a user) acquire the latest preset speech recognition model and the latest preset character correction rule from the server. Thus, the television can more accurately identify the target character voice data input by the user.
The scheme provided by the embodiment of the application is mainly introduced from the perspective of a method. To implement the above functions, it includes hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiment of the present application, the server and the electronic device may be divided into the functional modules according to the above method examples, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.
Referring to fig. 24, an embodiment of the present application provides a server, which may include a communication module 241 and a processing module 242. The processing module 242 may include the error correction module mentioned in the foregoing embodiments.
Specifically, the communication module 241 is configured to receive target character voice data from the display device; the processing module 242 is configured to input the target character voice data received by the communication module 241 into a preset voice recognition model to obtain an initial recognition result; the processing module 242 is further configured to correct errors of the initial recognition result according to a preset character error correction rule to obtain a target character corresponding to the target character voice data; the communication module 241 is further configured to send the target character obtained by the processing module 242 to the display device.
In some implementable examples, the preset character error correction rules include at least any one or more of the following sub-rules: character mapping sub-rule, phono configurational code matching sub-rule and association sub-rule.
Wherein, the character mapping sub-rule comprises: if the initial recognition result is matched with a first recognition result in a preset character mapping table, determining a first character associated with the first recognition result in the preset character mapping table as a first error correction result after error correction of the initial recognition result; the preset character mapping table is used for indicating the incidence relation between the preset characters and the optional recognition results; the first recognition result is one of all the selectable recognition results, and the first character is a preset character which is associated with the first recognition result in a preset character mapping table;
the sound-shape code matching sub-rule comprises the following steps: if the sound-shape codes of the initial recognition result are matched with the first sound-shape codes in the sound-shape code dictionary, determining second characters, associated with the first sound-shape codes, in the sound-shape code dictionary as a second error correction result after the error correction of the initial recognition result; the sound-shape code dictionary is used for indicating the incidence relation between the preset characters and the sound-shape codes of the preset characters; the first phonetic-shape code is one of the phonetic-shape codes of all preset characters, and the second character is a preset character to which the first phonetic-shape code in the phonetic-shape code dictionary belongs;
the association sub-rule includes: determining the associated character of the third character obtained according to the preset associated rule as a third error correction result after the error correction of the initial recognition result; the third character is a character obtained before the target character is obtained according to a voice recognition method; the preset association rule corresponds to the target type of the third character; and the preset association rule is used for indicating the association relationship between the third error correction result and the characters of the target type.
In some implementable examples, in case that the preset character error correction rule includes a character mapping sub-rule, the processing module 242 is specifically configured to: determining a first error correction result obtained according to the character mapping sub-rule as a target character; in a case that the preset character error correction rule includes a phonetic-shape code matching sub-rule, the processing module 242 is specifically configured to: determining a second error correction result obtained according to the sound-shape code matching sub-rule as a target character; in a case that the preset character error correction rule includes an association sub-rule, the processing module 242 is specifically configured to: and determining a third error correction result obtained according to the association sub-rule as a target character.
In some practical examples, in a case that the preset character error correction rule includes any two or all of a character mapping sub-rule, a phonetic-font code matching sub-rule, and an association sub-rule, the processing module 242 is specifically configured to: and sequentially selecting sub-rules included by the preset character error correction rule according to the preset sequence to correct the error of the initial recognition result until the error correction result of the initial recognition result is obtained, and determining the error correction result of the initial recognition result as the target character.
In some practical examples, in a case that the preset order is an order from large to small in weight, the weight of the character mapping sub-rule is greater than the weight of the phonogram matching sub-rule, the weight of the phonogram matching sub-rule is greater than the weight of the association sub-rule, and the preset character error correction rule includes the character mapping sub-rule, the phonogram matching sub-rule, and the association sub-rule, the processing module 242 is specifically configured to: correcting the initial recognition result according to the character mapping sub-rule; if a first error correction result is obtained according to the character mapping sub-rule, determining the first error correction result as a target character; if the first error correction result is not obtained according to the character mapping sub-rule, the initial recognition result is corrected according to the sound-shape code matching sub-rule; if a second error correction result is obtained according to the sound-shape code matching sub-rule, determining the second error correction result as a target character; if the second error correction result is not obtained according to the sound-shape code matching sub-rule, performing error correction on the initial recognition result according to the association sub-rule; and if the third error correction result is obtained according to the association sub-rule, determining the third error correction result as the target character.
With regard to the server in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the foregoing embodiment of the speech recognition method, and will not be elaborated here.
Referring to fig. 25, an embodiment of the present application further provides a display device, which may include an obtaining module 251 and a sending module 252.
Specifically, the obtaining module 251 is configured to obtain target character voice data; a sending module 252, configured to send the target character voice data obtained by the obtaining module 251 to the server; the obtaining module 251 is further configured to receive a target character sent from a server; and the target character is obtained by the server after the initial recognition result of the voice data of the target character is obtained by using the preset voice recognition model and the initial recognition result is corrected based on the preset character error correction rule.
In a possible example, the display device further includes a display module 253, and after the obtaining module 251 receives the target character sent from the server, the display module 253 is used for displaying the target character.
In one possible example, the display module 253 is specifically configured to: if the target character comprises a plurality of characters, displaying a selection popup window; the selection popup window comprises a plurality of selection options, and the selection options correspond to characters included in the target characters one by one; and responding to the trigger operation of selecting the first selection option in the popup window, and displaying characters corresponding to the first selection option in the target characters.
With regard to the display device in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the foregoing embodiment of the speech recognition method, and will not be elaborated here.
It should be understood that the division of units or modules (hereinafter referred to as units) in the above apparatus is only a division of logical functions, and may be wholly or partially integrated into one physical entity or physically separated in actual implementation. And the units in the device can be realized in the form of software called by the processing element; or may be implemented entirely in hardware; part of the units can also be realized in the form of software called by a processing element, and part of the units can be realized in the form of hardware.
For example, each unit may be a processing element separately set up, or may be implemented by being integrated into a chip of the apparatus, or may be stored in a memory in the form of a program, and a function of the unit may be called and executed by a processing element of the apparatus. In addition, all or part of the units can be integrated together or can be independently realized. The processing element described herein, which may also be referred to as a processor, may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each unit above may be implemented by an integrated logic circuit of hardware in a processor element or in a form called by software through the processor element.
In one example, the units in the above apparatus may be one or more integrated circuits configured to implement the above method, such as: one or more ASICs, or one or more DSPs, or one or more FPGAs, or a combination of at least two of these integrated circuit forms.
As another example, when a unit in a device may be implemented in the form of a processing element scheduler, the processing element may be a general purpose processor, such as a CPU or other processor capable of invoking programs. As another example, these units may be integrated together, implemented in the form of a system on chip SOC.
In one implementation, the means for implementing the respective corresponding steps of the above method by the above apparatus may be implemented in the form of a processing element scheduler. For example, the apparatus may include a processing element and a memory element, the processing element invoking a program stored by the memory element to perform the speech recognition method described in the above method embodiments. The memory elements may be memory elements on the same chip as the processing elements, i.e. on-chip memory elements.
In another implementation, the program for performing the above method may be in a memory element on a different chip than the processing element, i.e. an off-chip memory element. At this time, the processing element calls or loads a program from the off-chip storage element onto the on-chip storage element to call and execute the voice recognition method described in the above method embodiment.
An embodiment of the present application further provides a server, where the server may include: communication interface, processor, memory, bus; the memory is used for storing computer execution instructions, and the processor is connected with the memory through a bus; when the server is running, the processor executes the computer-executable instructions stored by the memory to cause the server to perform the various functions or steps performed by the server as in the above-described method embodiments.
An embodiment of the present application further provides an electronic device, where the electronic device may include: a display screen, a memory, and one or more processors. The display screen, memory and processor are coupled. The memory is for storing computer program code comprising computer instructions. When the processor executes the computer instructions, the electronic device may perform the functions or steps performed by the electronic device (e.g., a television) in the above-described method embodiments.
For example, the embodiment of the present application also provides a chip, and the chip may be applied to the display device or the server. The chip includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected through a line; the processor receives and executes computer instructions from the memory of the display device through the interface circuitry to implement the methods described in the method embodiments above.
Embodiments of the present application also provide a computer-readable storage medium having computer program instructions stored thereon. The computer program instructions, when executed by the server, cause the server to implement a speech recognition method as described above.
Embodiments of the present application further provide a computer-readable storage medium having computer program instructions stored thereon. The computer program instructions, when executed by the display device, cause the display device to implement the speech recognition method as described above.
Embodiments of the present application further provide a computer program product, which includes computer instructions executed by the server as described above, and when the computer program product is executed in the server, the display device is enabled to implement the speech recognition method as described above.
The embodiment of the present application further provides a computer program product, which includes computer instructions executed by the display device as described above, and when the computer program product is executed in the display device, the display device is enabled to implement the speech recognition method as described above.
The embodiment of the present application further provides a speech recognition system, which includes the server and the display device in the foregoing embodiments. The server and the display device are configured to perform corresponding steps or functions in the voice recognition method in the foregoing embodiments.
Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of software products, such as: and (5) programming. The software product is stored in a program product, such as a computer readable storage medium, and includes several instructions for causing a device (which may be a single chip, a chip, or the like) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (11)

1. A speech recognition method applied to a server, the method comprising:
receiving target character voice data from a display device;
inputting the target character voice data into a preset voice recognition model to obtain an initial recognition result;
correcting the initial recognition result according to a preset character error correction rule to obtain a target character corresponding to the target character voice data;
and sending the target character to the display equipment so as to enable the display equipment to display the target character.
2. The method according to claim 1, wherein the preset character error correction rules comprise at least any one or more of the following sub-rules: character mapping sub-rules, phono configurational code matching sub-rules and association sub-rules;
wherein the character mapping sub-rule comprises: if the initial recognition result is matched with a first recognition result in a preset character mapping table, determining a first character associated with the first recognition result in the preset character mapping table as a first error correction result after the initial recognition result is corrected; the preset character mapping table is used for indicating the incidence relation between the preset characters and the optional recognition results; the first recognition result is one of all the optional recognition results, and the first character is a preset character associated with the first recognition result in the preset character mapping table;
the sound-shape code matching sub-rule comprises the following steps: if the sound-shape codes of the initial recognition result are matched with first sound-shape codes in a sound-shape code dictionary, determining second characters, associated with the first sound-shape codes, in the sound-shape code dictionary as a second error correction result after the initial recognition result is subjected to error correction; the sound-shape code dictionary is used for indicating the incidence relation between a preset character and the sound-shape code of the preset character; the first sound-shape code is one of sound-shape codes of all the preset characters, and the second character is a preset character to which the first sound-shape code belongs in the sound-shape code dictionary;
the association sub-rule includes: determining the associated character of the third character obtained according to a preset associated rule as a third error correction result after the error correction of the initial recognition result; the third character is a character obtained before the target character is obtained according to the voice recognition method; the preset association rule corresponds to a target type of the third character; the preset association rule is used for indicating the association relationship between the third error correction result and the characters of the target type.
3. The method of claim 2,
when the preset character error correction rule includes the character mapping sub-rule, the correcting the initial recognition result according to the preset character error correction rule to obtain the target character corresponding to the target character voice data, including: determining the first error correction result obtained according to the character mapping sub-rule as the target character;
under the condition that the preset character error correction rule comprises the phonetic-configurational code matching sub-rule, the error correction is carried out on the initial recognition result according to the preset character error correction rule so as to obtain the target character corresponding to the target character voice data, and the method comprises the following steps: determining the second error correction result obtained according to the sound-shape code matching sub-rule as the target character;
when the preset character error correction rule includes the association sub-rule, the correcting the initial recognition result according to the preset character error correction rule to obtain the target character corresponding to the target character voice data, including: and determining the third error correction result obtained according to the association sub-rule as the target character.
4. The method according to claim 2, wherein in a case that the preset character error correction rule includes any two or all of the character mapping sub-rule, the phonogram matching sub-rule, and the association sub-rule, the performing error correction on the initial recognition result according to the preset character error correction rule to obtain the target character corresponding to the target character voice data includes:
and sequentially selecting sub-rules included in the preset character error correction rule according to a preset sequence to correct the error of the initial recognition result until the error correction result of the initial recognition result is obtained, and determining the error correction result of the initial recognition result as the target character.
5. The method according to claim 4, wherein when the preset sequence is a sequence with weights from large to small, the weight of the character mapping sub-rule is greater than the weight of the phonogram code matching sub-rule, the weight of the phonogram code matching sub-rule is greater than the weight of the association sub-rule, and the preset character error correction rule includes the character mapping sub-rule, the phonogram code matching sub-rule and the association sub-rule, the sub-rules included in the preset character error correction rule are sequentially selected according to the preset sequence to correct the initial recognition result until the error correction result of the initial recognition result is obtained, and the error correction result of the initial recognition result is determined as the target character, the method includes:
correcting the initial recognition result according to the character mapping sub-rule;
if the first error correction result is obtained according to the character mapping sub-rule, determining the first error correction result as the target character;
if the first error correction result is not obtained according to the character mapping sub-rule, performing error correction on the initial recognition result according to the sound-shape code matching sub-rule;
if the second error correction result is obtained according to the sound-shape code matching sub-rule, determining the second error correction result as the target character;
if the second error correction result is not obtained according to the sound-shape code matching sub-rule, performing error correction on the initial identification result according to the association sub-rule;
and if the third error correction result is obtained according to the association sub-rule, determining the third error correction result as the target character.
6. A speech recognition method, applied to a display device, the method comprising:
acquiring target character voice data;
sending the target character voice data to a server;
receiving a target character sent by the server; and the target character is obtained by the server after an initial recognition result of the target character voice data is obtained by using a preset voice recognition model and the initial recognition result is corrected based on a preset character error correction rule.
7. The speech recognition method of claim 6, wherein after receiving the target character transmitted from the server, the method further comprises:
and displaying the target character.
8. The speech recognition method of claim 7, wherein the displaying the target character comprises:
if the target characters comprise a plurality of characters, displaying a selection popup window; the selection popup comprises a plurality of selection options, and the selection options are in one-to-one correspondence with characters included in the target characters;
and responding to the trigger operation of selecting a first selection option in a popup window, and displaying characters corresponding to the first selection option in the target characters.
9. A server, comprising a communication interface, a processor, a memory, a bus; the memory is used for storing computer execution instructions, and the processor is connected with the memory through the bus; the processor executes computer-executable instructions stored by the memory when the server is running to cause the server to perform the speech recognition method of any of claims 1-5.
10. A display device comprising a display screen, a memory, and one or more processors; the display screen, the memory and the processor are coupled; wherein the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the display device to perform the speech recognition method according to any of claims 6-8.
11. A speech recognition system comprising a server for performing the speech recognition method according to any of claims 1-7 and a display device for performing the speech recognition method according to any of claims 6-8.
CN202211275976.0A 2022-10-18 2022-10-18 Voice recognition method and display device Pending CN115588431A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211275976.0A CN115588431A (en) 2022-10-18 2022-10-18 Voice recognition method and display device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211275976.0A CN115588431A (en) 2022-10-18 2022-10-18 Voice recognition method and display device

Publications (1)

Publication Number Publication Date
CN115588431A true CN115588431A (en) 2023-01-10

Family

ID=84779738

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211275976.0A Pending CN115588431A (en) 2022-10-18 2022-10-18 Voice recognition method and display device

Country Status (1)

Country Link
CN (1) CN115588431A (en)

Similar Documents

Publication Publication Date Title
US20220043628A1 (en) Electronic device and method for generating short cut of quick command
AU2015375326B2 (en) Headless task completion within digital personal assistants
US20210082412A1 (en) Real-time feedback for efficient dialog processing
CN117056622A (en) Voice control method and display device
CN112511882B (en) Display device and voice call-out method
CN112885354B (en) Display device, server and display control method based on voice
CN112002321B (en) Display device, server and voice interaction method
CN114118064A (en) Display device, text error correction method and server
WO2022100283A1 (en) Display device, control triggering method and scrolling text detection method
US10529324B1 (en) Geographical based voice transcription
CN110543290B (en) Multimodal response
CN115273848A (en) Display device and control method thereof
CN112926420B (en) Display device and menu character recognition method
CN115588431A (en) Voice recognition method and display device
CN112256232B (en) Display device and natural language generation post-processing method
CN113035194B (en) Voice control method, display device and server
US20190050391A1 (en) Text suggestion based on user context
CN112885347A (en) Voice control method of display device, display device and server
CN114155846A (en) Semantic slot extraction method and display device
CN113593559A (en) Content display method, display equipment and server
CN109032379B (en) Language option display method and terminal
CN113079400A (en) Display device, server and voice interaction method
US20160283453A1 (en) Text correction using a second input
CN111914114A (en) Badcase mining method and electronic equipment
CN113076427B (en) Media resource searching method, display equipment and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination