WO2019001194A1 - 语音识别方法、装置、设备及存储介质 - Google Patents

语音识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2019001194A1
WO2019001194A1 PCT/CN2018/088646 CN2018088646W WO2019001194A1 WO 2019001194 A1 WO2019001194 A1 WO 2019001194A1 CN 2018088646 W CN2018088646 W CN 2018088646W WO 2019001194 A1 WO2019001194 A1 WO 2019001194A1
Authority
WO
WIPO (PCT)
Prior art keywords
result
keyword
candidate recognition
selection rule
voice
Prior art date
Application number
PCT/CN2018/088646
Other languages
English (en)
French (fr)
Inventor
郑平
饶丰
卢鲤
李涛
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to KR1020197028881A priority Critical patent/KR102315732B1/ko
Priority to JP2019560155A priority patent/JP6820058B2/ja
Priority to EP18825077.3A priority patent/EP3648099B1/en
Publication of WO2019001194A1 publication Critical patent/WO2019001194A1/zh
Priority to US16/547,097 priority patent/US11164568B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/081Search algorithms, e.g. Baum-Welch or Viterbi
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the embodiments of the present invention relate to the field of computers, and in particular, to a voice recognition method, apparatus, device, and storage medium.
  • Speech recognition technology refers to the technology of recognizing speech information as text information through speech recognition devices. Speech recognition technology is widely used in scenes such as voice dialing, voice navigation, smart home control, voice search, and dictation data entry.
  • the embodiment of the present invention provides a voice recognition method, device, device, and storage medium, which can solve the problem that a target recognition result is selected from a plurality of candidate recognition results due to a long time consumed by the voice recognition device to calculate the confusion according to the RNN language model.
  • the problem of poor real-time performance is as follows:
  • a speech recognition method comprising:
  • a target result in the n candidate recognition results according to a selection rule in the execution order of j in the m selection rules, where the target result is a candidate having the highest matching degree with the voice signal among the n candidate recognition results Recognizing the result, the m is an integer greater than 1, and the initial value of the j is 1;
  • the target result among the n candidate recognition results is determined according to a selection rule whose execution order is j+1.
  • a candidate recognition result selection device comprising:
  • a signal acquisition module configured to acquire a voice signal
  • a voice recognition module configured to identify the voice signal acquired by the signal acquisition module according to a voice recognition algorithm, to obtain n candidate recognition results, where the candidate recognition result refers to text information corresponding to the voice signal, Said n is an integer greater than one;
  • a determining module configured to determine a target result of the n candidate recognition results identified by the voice recognition module according to a selection rule in which the execution order is j in the m selection rules, where the target result refers to the n candidates a candidate recognition result having the highest matching degree with the voice signal in the recognition result, wherein m is an integer greater than 1, and an initial value of the j is 1;
  • the determining module configured to determine the target of the n candidate recognition results according to a selection rule whose execution order is j+1 when the target result is not determined according to the selection rule of the execution order j result.
  • a speech recognition apparatus comprising a processor and a memory, the memory storing at least one instruction, at least one program, a code set or a set of instructions, the at least An instruction, the at least one program, the set of codes, or a set of instructions is loaded and executed by the processor to implement the speech recognition method provided by the first aspect.
  • a computer readable storage medium having stored therein at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program
  • the code set or set of instructions is loaded and executed by the processor to implement the speech recognition method provided by the first aspect.
  • the target result is selected from the n candidate recognition results recognized by the speech by sequentially executing at least one of the m selection rules, wherein the algorithm complexity of each selection rule is lower than that of the algorithm for calculating the confusion according to the RNN language model. Degree; solves the problem that the calculation of confusion based on the RNN language model takes a long time, resulting in poor real-time selection of target results from multiple candidate recognition results; when only one selection rule is executed, the target result can be determined.
  • the algorithm complexity of the selection rule is lower than the complexity of calculating the confusion according to the RNN language model, the real-time performance of selecting the target result from the n candidate recognition results is improved.
  • FIG. 1 is a schematic structural diagram of a voice recognition system according to an embodiment of the present application.
  • FIG. 2 is a flowchart of a voice recognition method provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a voice recognition method according to another embodiment of the present application.
  • FIG. 4 is a schematic diagram of a first correspondence relationship and a second correspondence relationship provided by an embodiment of the present application
  • FIG. 5 is a flowchart of a voice recognition method according to another embodiment of the present application.
  • FIG. 6 is a flowchart of a voice recognition method according to another embodiment of the present application.
  • FIG. 7 is a block diagram of a voice recognition apparatus according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
  • Speech recognition device An electronic device having a function of recognizing a speech signal as text information.
  • the voice recognition device may be a server installed with a voice recognition engine through which the voice recognition device recognizes the voice signal as text information.
  • the voice signal received by the voice recognition device may be collected by the voice recognition device through the audio collection component; or the voice receiving device may be collected by the audio collection component and sent to the voice recognition device, where the voice receiving device may It is an electronic device independent of the voice recognition device.
  • the voice receiving device can be a mobile phone, a tablet computer, a smart speaker, a smart TV, a smart air purifier, a smart air conditioner, an e-book reader, and an MP3 (Moving Picture Experts Group Audio Layer).
  • the motion picture expert compresses the standard audio layer 3) player, MP4 (Moving Picture Experts Group Audio Layer IV) player, laptop portable computer, desktop computer and the like.
  • the voice recognition device may also be a mobile phone, a tablet computer, a smart speaker, a smart TV, a smart air purifier, a smart air conditioner, etc., which is not limited in this embodiment.
  • the voice recognition device is a server, and the voice recognition device receives the voice signal sent by the voice receiving device as an example for description.
  • Candidate recognition result at least one piece of text information recognized by the speech recognition device for a certain speech signal.
  • the target result needs to be selected from the at least two candidate recognition results.
  • the target result refers to the candidate recognition result with the highest matching degree with the voice signal.
  • the voice recognition device is based on voice.
  • the signal may identify multiple candidate recognition results.
  • the speech recognition device recognizes multiple candidate recognition results, how to select the candidate recognition result with the highest degree of matching with the speech signal is particularly important.
  • a typical speech recognition method is provided in the related art. After the speech recognition device acquires n candidate recognition results, the confusion degree of each candidate recognition result is calculated according to a Recurrent Neural Network (RNN) language model. The candidate recognition result corresponding to the minimum value of the confusion degree is determined as the target result.
  • the RNN language model is based on a general corpus training. The confusion is used to indicate the similarity between the candidate recognition result and the speech signal, and the confusion is negatively correlated with the similarity degree; the target result refers to n candidate identifications.
  • n is an integer greater than 1.
  • FIG. 1 is a schematic structural diagram of a voice recognition system provided by an embodiment of the present application.
  • the system includes at least one voice receiving device 110 and a voice recognition device 120.
  • the voice receiving device 110 can be a mobile phone, a tablet computer, a smart speaker, a smart TV, a smart air purifier, a smart air conditioner, an e-book reader, an MP3 player, an MP4 player, a laptop portable computer, a desktop computer, and the embodiment. This is not limited.
  • An audio collection component 111 is installed in the voice receiving device 110.
  • the audio collection component 111 is for acquiring voice signals.
  • a connection is established between the voice receiving device 110 and the voice recognition device 120 via a wireless network or a wired network. After the voice receiving device 110 collects the voice signal through the audio collecting component 111, the voice signal is sent to the voice recognition device 120 through the connection.
  • the voice recognition device 120 is for recognizing a voice signal as text information (candidate recognition result).
  • the text information is at least two.
  • the voice recognition device 120 is configured to select a target result from the plurality of candidate recognition results when the plurality of candidate recognition results are identified.
  • the target result is fed back to the voice receiving device 110.
  • the voice recognition device 120 can be implemented as a server or a server cluster, which is not limited in this embodiment.
  • the physical hardware support operation of mobile terminals such as mobile phones, tablet computers, smart speakers, smart TVs, smart air purifiers, smart air conditioners, e-book readers, MP3 players, MP4 players, laptop portable computers, and the like
  • the speech recognition device 120 can be implemented as at least one of the above-mentioned mobile terminals, which is not limited in this embodiment.
  • the wireless or wired network described above uses standard communication techniques and/or protocols.
  • the network is usually the Internet, but can also be any network, including but not limited to a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, a wired or a wireless. Any combination of networks, private networks, or virtual private networks).
  • data exchanged over a network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), Extensible Markup Language (XML), and the like.
  • HTTP HyperText Mark-up Language
  • XML Extensible Markup Language
  • you can use such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), and Internet Protocol Security (IPsec).
  • SSL Secure Socket Layer
  • TLS Transport Layer Security
  • VPN Virtual Private Network
  • IPsec Internet Protocol Security
  • Regular encryption techniques are used to encrypt all or some of the links.
  • the above described data communication techniques may also be replaced or supplemented
  • the embodiment of the present application is described by taking the execution subject of each embodiment as a voice recognition device as an example.
  • FIG. 2 shows a flowchart of a voice recognition method provided by an exemplary embodiment of the present application. This embodiment is exemplified by the method applied to the voice recognition device. The following steps can be included:
  • Step 101 Acquire a voice signal.
  • the voice signal is sent by the voice receiving device to the voice recognition device; or it is collected by the voice recognition device; or is input to the voice recognition device through the mobile storage device.
  • Step 102 Identify a speech signal according to a speech recognition algorithm, and obtain n candidate recognition results.
  • the candidate recognition result refers to text information corresponding to the voice signal, and n is an integer greater than 1.
  • a speech recognition algorithm is used to identify the speech signal as at least one piece of textual information.
  • the speech recognition algorithm may be a parallel algorithm based on the improvement of the Viterbi algorithm; or it may be based on a serial algorithm improved by the Viterbi algorithm; or it may be a tree-Trellis
  • the algorithm is not limited in this embodiment.
  • the voice recognition algorithm has a function of performing preliminary sorting on the n candidate recognition results.
  • the n candidate recognition results acquired by the voice recognition device have a sequence identifier, so that when the voice recognition device selects the target result, the sequence is performed.
  • the identification indication order sequentially detects whether it is a target result.
  • the voice recognition device can identify only one candidate recognition result, which is not limited in this embodiment.
  • Step 103 Determine a target result among the n candidate recognition results according to a selection rule in which the order j is executed in the m selection rules.
  • the target result refers to the candidate recognition result with the highest degree of matching with the speech signal among the n candidate recognition results, m is an integer greater than 1, and the initial value of j is 1. 1 ⁇ j ⁇ m-1.
  • the execution order of the m selection rules is determined according to the algorithm complexity of each selection rule, and the complexity of the algorithm is positively correlated with the execution order. That is, the lower the complexity of the algorithm, the smaller the sequence number of the execution order, and the higher the execution order; the higher the complexity of the algorithm, the larger the sequence number of the execution sequence, and the later the execution order.
  • the algorithm complexity of the selection rule is negatively correlated with the speed of selecting the target result. That is, the higher the complexity of the algorithm, the slower the selection of the target result; the lower the complexity of the algorithm, the faster the selection of the target result.
  • the algorithmic complexity of each selection rule is represented by a complexity level.
  • the algorithm complexity is identified as 1, 2, 3, where the smaller the value, the lower the complexity of the algorithm.
  • the execution order of the m selection rules is specified by the developer, and since the algorithm complexity of the m selection rules is lower than the complexity of the algorithm for calculating the confusion according to the RNN language model, no matter which one is preferentially executed With the selection rule, the speech recognition device selects the target result faster than the target result by calculating the confusion according to the RNN language model.
  • the execution order can be represented by the execution order identifier.
  • the execution order is identified as #1, #2, #3, where #1 indicates that the execution order is 1, #2 indicates that the execution order is 2, and #3 indicates that the execution order is 3.
  • the execution order of the m selection rules is randomly selected.
  • Step 104 When the target result is not determined according to the selection rule whose execution order is j, the target result among the n candidate recognition results is determined according to the selection rule whose execution order is j+1.
  • the speech recognition apparatus may not determine the target result according to the selection rule in which the execution order is j. At this time, the speech recognition apparatus continues to determine the target result according to the selection rule whose execution order is j+1 until the target result in the n candidate identifications is determined. At the end of the process.
  • the voice recognition device reorders the n candidate recognition results, wherein the order of the target results in the n candidate recognition results is the first bit; the remaining n-1 candidate identifiers except the first bit result The order of the target results in the result is the second bit; the order of the target results in the remaining n-2 candidate recognition results except the first and second results is the third bit, and thus cycles.
  • the speech recognition method selects a target result from the n candidate recognition results recognized by the speech by sequentially executing at least one of the m selection rules, wherein the algorithm complexity of each selection rule It is lower than the complexity of the algorithm for calculating the confusion according to the RNN language model; it solves the problem that the calculation of the confusion consumption according to the RNN language model takes a long time, resulting in poor real-time selection of the target results from multiple candidate recognition results;
  • the speech recognition method selects a target result from the n candidate recognition results recognized by the speech by sequentially executing at least one of the m selection rules, wherein the algorithm complexity of each selection rule It is lower than the complexity of the algorithm for calculating the confusion according to the RNN language model; it solves the problem that the calculation of the confusion consumption according to the RNN language model takes a long time, resulting in poor real-time selection of the target results from multiple candidate recognition results;
  • the algorithm complexity of the selection rule is lower than the algorithm complexity of calculating the confusion according to the RNN language model, the selection of the
  • the m selection rules in this embodiment are selection rules determined according to different usage scenarios.
  • the m selection rules include at least two of a command selection rule, a function selection rule, and a dialog selection rule.
  • the command scenario ie, the voice signal is a message in the form of a command
  • the target result can be identified by the command selection rule in the m selection rules
  • the functional scenario ie, the voice signal is a functional message
  • the function selection rule in the selection rule can identify the target result
  • the conversation scenario ie, the speech signal is a message in the form of a dialogue
  • the target selection result can be identified by the dialog selection rule in the m selection rules.
  • the message in the form of a command is used to instruct the voice receiving device to execute a certain command.
  • the message in the command form may be the previous message, the next message, the pause, the broadcast, and the like.
  • the message in the form of command is irregular and limited.
  • the message of the previous command form it can be changed to the top, please play the previous one, please play the first one, please switch the first, please switch The previous changes, etc., are irregular and the types of changes are limited.
  • the command word database is pre-set in the voice recognition device, the command word database includes a plurality of command keywords, and the command selection rule is used to indicate the voice recognition device. Whether the ith candidate recognition result is the target result is detected according to whether the command vocabulary includes a command keyword matching the ith candidate recognition result, 1 ⁇ i ⁇ n.
  • the functional message is used to indicate that the voice receiving device executes a certain command according to at least one voice keyword, for example, the functional message is “playing Jay Chou's song”.
  • the function template library and the voice word library are pre-set in the voice recognition device, and the function selection rule is used to indicate the voice recognition device. Whether the i-th candidate recognition result is a target result is detected according to whether the phonetic vocabulary includes a lexicon keyword matching the voice keyword, and the voice keyword is at least one keyword of the ith candidate recognition result.
  • a message in the form of a conversation is a message that is irregular and whose number of changes is unknown.
  • the dialogue message is "What are you doing?", “Is it free today?", "The movie is really beautiful” and so on.
  • a language model is pre-trained in the voice recognition, and the dialog selection rule is used to instruct the voice recognition device to determine each according to the trained language model.
  • the candidate recognition result is similar to the speech signal to select the target result.
  • the algorithm complexity of the command selection rule is lower than the algorithm complexity of the function selection rule, and the algorithm complexity of the function selection rule is lower than the algorithm complexity of the dialog selection rule.
  • the voice recognition device preferentially executes the command selection rule to select the target result; when the target result is not selected according to the command selection rule, the function selection rule is executed to select the target result; when the target result is not selected according to the function selection rule, Then execute the dialog selection rule to select the target result.
  • the algorithm complexity of the command selection rule, the algorithm complexity of the function selection rule, and the algorithm complexity of the dialog selection rule are far less than the algorithm complexity of selecting the target result according to the RNN language model. If the voice recognition device sequentially executes the command selection rule, the function selection rule, and the dialog selection rule to determine the target result, the total duration consumed by the voice recognition device is also less than the total duration consumed by the target result selection according to the RNN language model.
  • the target result is selected according to the command selection rule (see the embodiment shown in FIG. 3), the target result is selected according to the function selection rule (see the embodiment shown in FIG. 5), and the target result is selected according to the dialog selection rule (see the figure).
  • the embodiment shown in Fig. 6 is separately introduced.
  • FIG. 3 shows a flowchart of a voice recognition method provided by another embodiment of the present application.
  • This embodiment is exemplified by applying the speech recognition method to a speech recognition device.
  • the method can include the following steps:
  • Step 201 Detect whether the first correspondence of the command lexicon includes a command keyword that matches the ith candidate identification result.
  • the first correspondence includes a correspondence between the index value and the command keyword.
  • the first correspondence is implemented by a positive row table
  • the positive row table includes at least one key value pair
  • the key in each key value pair is a hash value (index value)
  • the value in each key value pair For a command keyword.
  • This embodiment does not limit the number of key-value pairs in the first correspondence.
  • the number of key-value pairs in the first correspondence is 1000.
  • Step 202 Determine the i-th candidate recognition result as the target result, and the process ends.
  • the voice recognition device may use the first candidate recognition result as a target result, or the voice recognition device performs step 203 from the at least The target result is again selected among the two candidate recognition results.
  • Step 203 When the first correspondence does not include a command keyword that matches any one of the n candidate recognition results, whether the second correspondence in the command vocabulary includes the ith candidate identification result is included A keyword that matches any of the words.
  • the second correspondence includes a correspondence between the index value and the keyword, and the command keyword includes a keyword.
  • the second correspondence is implemented by an inversion table, where the inverted table includes at least one key value pair, and the key in each key value pair is a hash value of the keyword, and the value in each key value pair is The keyword corresponds to at least one index value in the first correspondence.
  • the key of each key value pair in the second correspondence relationship may also be a keyword.
  • Step 204 Search for a command keyword corresponding to the index value in the first correspondence relationship according to the index value corresponding to the keyword in the second correspondence.
  • the voice recognition device Since the command keyword is composed of keywords, and different command keywords may include the same keyword, the voice recognition device according to the index value corresponding to the keyword, that is, the keyword corresponding key value in the second correspondence relationship The value of the command keyword found is at least one.
  • the command keyword matching the ith candidate recognition result is detected by combining the first correspondence relationship and the second correspondence relationship, so that the voice recognition device does not need to store all the command keyword variations, and only needs to be stored. All the keywords included in the change form can determine the corresponding command keyword, which saves the storage space of the voice recognition device.
  • Step 205 Determine an edit distance between the i-th candidate recognition result and the command keyword.
  • the edit distance (or Levenshtein distance) is used to indicate the number of operations required to convert the i-th candidate recognition result into a command keyword.
  • the operations of the conversion include but are not limited to: replacement, insertion and deletion.
  • the speech recognition device may determine a plurality of command keywords, and at this time, determine an edit distance between the i-th candidate recognition result and each command keyword.
  • the i-th candidate recognition result is “at pause”, and the voice recognition device determines that the command keyword is “pause”, and the voice recognition device only needs to replace “at” with “temporary” to convert “at pause”.
  • the edit distance between the i-th candidate recognition result and the command keyword is 1.
  • Step 206 When the edit distance is less than the preset value, determine that the i-th candidate recognition result is the target result.
  • the edit distance is less than the preset value, it indicates that the i-th candidate recognition result is similar to the command keyword, and at this time, the i-th candidate recognition result is determined as the target result.
  • the value of the preset value is usually small, and the value of the preset value is not limited in this embodiment. Schematically, the preset value is 2.
  • each key value pair is composed of an index value and a command keyword
  • the second correspondence relationship Includes three key-value pairs, each of which consists of a hash value and an index value.
  • the four candidate recognition results are: refill, field, fill, and pause.
  • the speech recognition device respectively calculates the hash values of the four candidate recognition results, wherein the refilled hash value is 1, the hash value in the field is 2, the hash value in the padding is 3, and the hash value is suspended. If it is 4, and the key in the first correspondence includes 4, the pause is determined as the target result.
  • the four candidate recognition results are: refill, field, fill, and stop.
  • the speech recognition device separately calculates the hash values of the four candidate recognition results, wherein the refilled hash value is 1, the hash value in the field is 2, the hash value in the pad is 3, and the hash is stopped.
  • the value is 5, and at this time, the key in the first correspondence does not include 1, 2, 3, and 5, and the voice recognition device calculates the hash value of each word in each candidate recognition result.
  • the candidate recognition result "in stop” the hash value of "at” is 11, the hash value of "stop” is 12, the key in the second correspondence includes 12, and the voice recognition device is according to the second correspondence. 12 corresponding index value 4, in the first correspondence relationship, look for the command keyword corresponding to the index value 4 "pause", the editing distance between "stop” and “pause” is 1, less than the preset value 2, then "At stop” is determined as the target result.
  • the target result is not selected according to the command selection rule.
  • the voice recognition device continues according to other selection rules.
  • the target result is selected; or, the first candidate recognition result is determined as the target result; or, the target result is not selected, and the process ends.
  • other selection rules are function selection rules or dialog selection rules.
  • the voice recognition device may also determine the candidate recognition result with the smallest edit distance as the target result.
  • the speech recognition method provided by the present application selects a target result in n candidate recognition results by a command selection rule; when the target selection result can be determined only when the command selection rule is executed, the algorithm of the command selection rule
  • the complexity is lower than the complexity of the algorithm for calculating the confusion according to the RNN language model. Therefore, the real-time performance of selecting the target result from the n candidate recognition results is improved.
  • the command keyword matching the ith candidate recognition result is detected by combining the first correspondence relationship and the second correspondence relationship, so that the voice recognition device does not need to store all the variations of the command keyword, and only needs to store all the variations.
  • the keywords included in the list can determine the corresponding command keywords, which saves the storage space of the voice recognition device.
  • the voice recognition device sends the target result to the voice receiving device, and the voice receiving device performs a corresponding operation according to the command corresponding to the target result. For example, if the voice receiving device is a smart speaker and the target result is paused, the smart speaker pauses to play the currently played audio information after receiving the target result.
  • FIG. 5 shows a flowchart of a voice recognition method provided by another embodiment of the present application.
  • This embodiment is exemplified by applying the speech recognition method to a speech recognition device.
  • the method can include the following steps:
  • Step 401 analyzing a function template of the i-th candidate recognition result, 1 ⁇ i ⁇ n.
  • a function template library is preset in the voice recognition device, and the function template library includes at least one function template.
  • the function template is represented by a regular expression (or a regular expression).
  • the function template is "a song of a (.+)".
  • This embodiment does not limit the number of function templates in the function template library.
  • the number of function templates in the function template library is 540.
  • regular expressions are used to retrieve and/or replace text information that conforms to a function template.
  • the speech recognition device analyzes the function template in the i-th candidate recognition result by matching the i-th candidate recognition result with each function template in the function template library.
  • Step 402 Detect whether the phonetic lexicon includes a lexicon keyword that matches the voice keyword in the ith candidate recognition result.
  • the i-th candidate recognition result includes a function template and at least one voice keyword. After the speech recognition device analyzes the function template of the i-th candidate recognition result, the remaining keywords in the i-th candidate recognition result are used as the speech keyword.
  • a voice vocabulary is pre-set in the voice recognition device, and the voice vocabulary includes at least one lexicon keyword.
  • This embodiment does not limit the number of lexicon keywords in the phonetic lexicon. Schematically, the number of lexicon keywords in the phonetic vocabulary is 1 million.
  • the voice recognition device matches the voice keyword in the ith candidate recognition result with at least one lexicon keyword in the phonetic vocabulary one by one, and the voice vocabulary includes matching the voice keyword in the ith candidate recognition result.
  • step 403 the i-th candidate recognition result is determined as the target result, and the process ends.
  • the voice recognition device continues to select the target result according to other selection rules; or, the first candidate recognition result is determined as the target result; or, the target result is not selected, the process End.
  • other selection rules are command selection rules or dialog selection rules.
  • the target result is not selected according to the function selection rule, but is not limited to the following cases: the voice recognition device does not analyze the function template of each candidate recognition result; or the voice recognition device does not find the candidate in the phonetic vocabulary Identify the lexicon keywords that match the voice keywords in the results.
  • the speech recognition device obtains three candidate recognition results, which are: 1. I want to listen to the song of the pattern song; 2. I want to listen to Tong Ange; 3. I want to listen to Tong Ange's song.
  • the speech recognition device matches the above three candidate recognition results with the function templates in the function template library, and obtains the function template of the first candidate recognition result as "I want to listen (.+) song" and the second candidate recognition.
  • the function template of the result is "I want to listen (.+) (.+)”
  • the function template of the third candidate recognition result is "I want to listen (.+) song”.
  • the voice keyword is a pattern song; for the second candidate recognition result, the voice recognition device uses the first keyword as a voice keyword, that is, the voice keyword is Tong Ange; For the third candidate recognition result, the voice keyword is Tong Ange.
  • the speech recognition device sequentially matches the speech keyword in the candidate recognition result with the lexicon keyword in the speech lexicon, and performs speech recognition when the speech keyword in the second candidate recognition result is matched with the lexicon keyword.
  • the device can determine the lexicon keyword that matches the voice keyword, and then determine the second candidate recognition result as the target result.
  • the voice recognition device may also use all keywords as voice keywords, that is, the voice keywords are Tong Ange and slightly.
  • the voice vocabulary includes and "The matching lexicon keyword, but does not include the lexicon keyword that matches " ⁇ ".
  • the speech recognition device sequentially selects the speech keyword in the candidate recognition result and the lexicon keyword in the speech lexicon. Matching, when the speech keyword in the third candidate recognition result is matched with the thesaurus keyword, the speech recognition device can determine the thesaurus keyword matching the speech keyword, and then the third candidate is identified. The result is determined as the target result.
  • the speech recognition method provided by the present application selects the target result among the n candidate recognition results by the function selection rule; when the target result can be determined only when the function selection rule is executed, the algorithm of the function selection rule
  • the complexity is lower than the complexity of the algorithm for calculating the confusion according to the RNN language model. Therefore, the real-time performance of selecting the target result from the n candidate recognition results is improved.
  • the voice recognition device transmits the target result to the voice receiving device, and the voice receiving device performs a corresponding operation according to the voice keyword in the target result. For example, if the voice receiving device is a smart speaker and the target result is to play Jay Chou's song, the smart speaker receives the target result, searches for Jay Chou's song, and plays the audio information corresponding to the search result.
  • the voice recognition device performs a search according to the voice keyword in the target result, and sends the search result to the voice receiving device, where the voice receiving device plays the audio information corresponding to the search result.
  • the voice receiving device is a smart speaker and the target result is to play Jay Chou's song
  • the voice recognition device searches for Jay Chou's song according to the voice keyword Jay Chou in the target result, and sends the search result to the smart speaker, and the smart speaker plays the search.
  • the corresponding audio information is a smart speaker and the target result is to play Jay Chou's song.
  • FIG. 6 shows a flowchart of a voice recognition method provided by another embodiment of the present application.
  • This embodiment is exemplified by applying the speech recognition method to a speech recognition system.
  • the method can include the following steps:
  • Step 501 Calculate the confusion of each candidate recognition result according to the language model.
  • Preplexity is used to indicate how similar the candidate recognition result is to the speech signal. There is a negative correlation between confusion and similarity.
  • the language model is a mathematical model used to describe the intrinsic laws of natural language.
  • the language model is an N-gram language model generated according to a specific corpus corresponding to at least one domain, and the N-gram language model is used to determine the current probability according to the occurrence probability of the first N-1 words of the current word.
  • the probability of occurrence of a word, N is a positive integer.
  • N is a 3,3-gram language model, also called a Tri-gram language model.
  • N is a 2,2-gram language model also known as the Bi-gram language model.
  • the N-gram language model describes the nature and relationship of natural language basic units such as words, phrases and sentences through probability and distribution functions, and embodies the rules of generation and processing based on statistical principles in natural language.
  • the speech recognition device calculates the confusion of each candidate recognition result according to the 3-gram language model or the 2-gram language model as an example.
  • 3-gram language model is represented by the following formula:
  • p(S) p(w1)p(w2
  • p(S) represents the probability of occurrence of the candidate recognition result
  • p(w1) represents the probability of occurrence of the first word in the candidate recognition result
  • w1) indicates that the second word of the candidate recognition result depends on the first one
  • the probability of occurrence of a word indicates the probability that the third word in the candidate recognition result depends on the appearance of the first word and the second word
  • wn-1, Wn-2) indicates the probability that the nth word in the candidate recognition result depends on the appearance of the first word (n-1th word) and the first two words (n-2th word).
  • the 2-gram language model is represented by the following formula:
  • p(S) p(w1)p(w2
  • p(S) represents the probability of occurrence of the candidate recognition result
  • p(w1) represents the probability of occurrence of the first word in the candidate recognition result
  • w1) indicates that the second word of the candidate recognition result depends on the first one
  • the probability of occurrence of a word indicates the probability that the third word in the candidate recognition result depends on the appearance of the second word
  • wn-1) indicates the nth in the candidate recognition result.
  • the probability that a word depends on the appearance of the first word (n-1th word).
  • At least one field includes but is not limited to the following: weather field, music field, math field, sports field, computer field, home field, geographic field, natural field.
  • At least one field may also include other fields, which is not limited in this embodiment.
  • the speech recognition device calculates the confusion of each candidate recognition result according to the language model by a preset formula.
  • the degree of confusion can be regarded as the geometric mean of the probability of occurrence of candidate words after each word predicted by the language model.
  • the probability of occurrence of the candidate recognition result is negatively correlated with the degree of confusion. That is, the greater the probability that the candidate recognition result appears, the lower the confusion degree; the smaller the probability of occurrence of the candidate recognition result, the higher the confusion.
  • the speech recognition device calculates the confusion degree of each candidate recognition result according to the language model by using a preset formula, first calculating the cross entropy of each candidate recognition result, and determining the language recognition result according to the cross entropy and the preset formula. Confusion.
  • the cross entropy is used to represent the difference between the model language determined by the speech model and the candidate recognition result.
  • the smaller the cross entropy the smaller the difference between the model language and the candidate recognition result, and the higher the matching degree between the candidate recognition result and the speech signal.
  • the larger the cross entropy the larger the difference between the model language and the candidate recognition result, the candidate recognition result and the speech signal. The lower the match.
  • the language model may be other types, such as a neural network language model, which is not limited in this embodiment.
  • Step 502 Determine a minimum value of the confusion in the n candidate recognition results, and determine an ith candidate recognition result corresponding to the minimum value as the target result.
  • the degree of confusion is smaller, the degree of similarity between the candidate recognition result and the voice signal is higher, and therefore, the i-th candidate recognition result corresponding to the minimum value of the confusion degree is determined as the target result.
  • the speech recognition method provided by the present application selects a target result in n candidate recognition results through a dialog selection rule; when the target selection result can be determined only when the dialog selection rule is executed, the algorithm of the dialog selection rule
  • the complexity is lower than the complexity of the algorithm for calculating the confusion according to the RNN language model. Therefore, the real-time performance of selecting the target result from the n candidate recognition results is improved.
  • the voice recognition device transmits the target result to the voice receiving device, and the voice receiving device acquires the dialog information according to the target result.
  • the voice receiving device is a smart speaker and the target result is what you are doing, the smart speaker generates the dialogue message according to the dialogue model after receiving the target result.
  • the voice recognition device generates the dialog information according to the target result, and sends the dialog information to the voice receiving device, where the voice receiving device plays the audio information corresponding to the dialog information.
  • the voice receiving device is a smart speaker
  • the target result is what you are doing
  • the voice recognition device generates dialog information according to the target result, and sends the dialog information to the smart speaker, and the smart speaker plays the audio information corresponding to the dialogue information.
  • the embodiment shown in FIG. 3, the embodiment shown in FIG. 5, and the embodiment shown in FIG. 6 can be combined to form a new embodiment, or the three embodiments can be combined to form a new embodiment.
  • the command selection rule is the first selection rule
  • the function selection rule is the second selection rule
  • the dialog selection rule is the third selection rule.
  • FIG. 7 shows a block diagram of a speech recognition apparatus provided by an embodiment of the present application.
  • the apparatus has a function of executing the above-described method examples, and the functions may be implemented by hardware or by hardware to execute corresponding software.
  • the device may include: a signal acquisition module 610, a voice recognition module 620, and a determination module 630.
  • a signal acquisition module 610 configured to acquire a voice signal
  • the voice recognition module 620 is configured to identify the voice signal acquired by the signal acquiring module 610 according to a voice recognition algorithm, to obtain n candidate recognition results, where the candidate recognition result refers to text information corresponding to the voice signal.
  • the n is an integer greater than one;
  • a determining module 630 configured to determine a target result of the n candidate recognition results that are recognized by the voice recognition module 620 according to a selection rule of the execution order of j in the m selection rules, where the target result refers to the n Among the candidate recognition results, the candidate recognition result with the highest degree of matching with the voice signal, the m is an integer greater than 1, and the initial value of the j is 1;
  • the determining module 630 is configured to determine, according to the selection rule with the execution order j+1, the determining result of the n candidate identification results when the target result is not determined according to the selection rule of the execution order j Target result.
  • the execution order of the m selection rules is determined according to the complexity of the respective algorithms, and the execution order is positively correlated with the complexity of the algorithm.
  • the m selection rules include at least two of a command selection rule, a function selection rule, and a dialog selection rule.
  • the algorithm complexity of the command selection rule is lower than the algorithm complexity of the function selection rule, and the algorithm complexity of the function selection rule. Less complex than the algorithm of the dialog selection rule,
  • the command selection rule is used to instruct the voice recognition device to detect whether the ith candidate recognition result is the target result according to whether the command vocabulary includes a command keyword matching the ith candidate recognition result, 1 ⁇ i ⁇ n;
  • the function selection rule is used to instruct the voice recognition device to detect whether the i-th candidate recognition result is a target result according to whether the phonetic vocabulary includes a lexicon keyword matching the voice keyword, and the voice keyword is the ith candidate recognition result. At least one keyword in the middle;
  • the dialog selection rule is used to instruct the voice recognition device to determine the degree of similarity of each candidate recognition result to the voice signal based on the trained language model to select the target result.
  • the determining module 630 includes: a first detecting unit and a first determining unit.
  • a first detecting unit configured to detect whether the first correspondence of the command lexicon includes a command keyword that matches the ith candidate identification result, 1 ⁇ i ⁇ n;
  • a first determining unit configured to determine, when the first correspondence includes a command keyword that matches the i-th candidate recognition result, the i-th candidate recognition result is a target result
  • the first correspondence includes at least a command keyword.
  • the determining module 630 further includes: a second detecting unit, a keyword searching module, a second determining unit, and a third determining unit;
  • a second detecting unit configured to detect, when the first correspondence does not include a command keyword that matches any one of the n candidate recognition results, whether the second correspondence in the command lexicon includes the ith a keyword that matches any one of the candidate recognition results;
  • a keyword search unit configured to: when the second correspondence includes a keyword that matches a word in the i-th candidate recognition result, search for the first correspondence according to the index value corresponding to the keyword in the second correspondence The command keyword corresponding to the index value;
  • a second determining unit configured to determine an edit distance between the i-th candidate recognition result and the command keyword, where the edit distance is used to indicate the number of operations required to perform the conversion of the i-th candidate recognition result into the command keyword;
  • a third determining unit configured to determine, when the edit distance is less than the preset value, that the i-th candidate recognition result is the target result
  • the first correspondence includes a correspondence between the index value and the command keyword
  • the second correspondence includes a correspondence between the index value and the keyword
  • the determining module 630 includes: a template analyzing unit, a third detecting unit, and a fourth determining unit.
  • a template analysis unit configured to analyze a function template of the i-th candidate recognition result, 1 ⁇ i ⁇ n;
  • a third detecting unit configured to detect whether the phonetic lexicon includes a lexicon keyword that matches a voice keyword in the ith candidate recognition result
  • a fourth determining unit configured to determine, when the phonetic vocabulary includes a lexicon keyword that matches a voice keyword in the ith candidate recognition result, the ith candidate recognition result is determined as a target result, and the voice keyword is a At least one of the i candidate recognition results;
  • the i-th candidate recognition result includes a function template and a voice keyword.
  • the determining module 630 includes: a confusion degree calculation unit and a fifth determination unit.
  • a confusion degree calculation unit configured to calculate a confusion degree of each candidate recognition result according to the language model
  • a fifth determining unit configured to determine a minimum value of the confusion in the n candidate recognition results, and determine an ith candidate recognition result corresponding to the minimum value as the target result;
  • the confusion degree is used to indicate the degree of similarity between the candidate recognition result and the voice signal, and the degree of confusion is negatively correlated with the degree of similarity;
  • the language model is an N-gram language model generated according to a specific corpus corresponding to at least one domain, N-gram The language model is used to determine the probability of occurrence of the current word according to the probability of occurrence of the first N-1 words of the current word, and N is a positive integer.
  • the embodiment of the present application further provides a computer readable storage medium, which may be a computer readable storage medium included in the memory, or may be a computer that is not separately installed in the voice recognition device. Read the storage medium.
  • the computer readable storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program, the code set, or the set of instructions being loaded and executed by a processor to implement the various method embodiments described above. Speech recognition method.
  • FIG. 8 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
  • the voice recognition device 700 includes a central processing unit (English: Central Processing Unit, CPU for short) 701, including a random access memory (English: random access memory, RAM: 702) and a read-only memory (English: read-only memory,
  • the system memory 704 is abbreviated as: ROM) 703, and the system bus 705 is connected to the system memory 704 and the central processing unit 701.
  • the speech recognition device 700 also includes a basic input/output system (I/O system) 706 that facilitates the transfer of information between various devices within the computer, and a large storage system 713, application 714, and other program modules 715. Capacity storage device 707.
  • I/O system basic input/output system
  • the basic input/output system 706 includes a display 708 for displaying information and an input device 709 such as a mouse or keyboard for user input of information.
  • the display 708 and input device 709 are both connected to the central processing unit 701 by an input/output controller 710 that is coupled to the system bus 705.
  • the basic input/output system 706 can also include an input and output controller 710 for receiving and processing input from a plurality of other devices, such as a keyboard, mouse, or electronic stylus.
  • input/output controller 710 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 707 is connected to the central processing unit 701 by a mass storage controller (not shown) connected to the system bus 705.
  • the mass storage device 707 and its associated computer readable medium provide non-volatile storage for the speech recognition device 700. That is, the mass storage device 707 may include a computer readable medium (not shown) such as a hard disk or a compact disc read-only memory (CD-ROM).
  • a computer readable medium such as a hard disk or a compact disc read-only memory (CD-ROM).
  • the computer readable medium can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage medium includes RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (English: electrically erasable programmable read-only memory) , referred to as: EEPROM), flash memory or other solid-state storage technology, CD-ROM, digital Versatile Disc (DVD) or other optical storage, tape cartridge, tape, disk storage or other magnetic storage devices.
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other solid-state storage technology
  • CD-ROM compact disc
  • DVD digital Versatile Disc
  • the voice recognition device 700 can also be operated by a remote computer connected to the network through a network such as the Internet. That is, the voice recognition device 700 can be connected to the network 712 through a network interface unit 711 connected to the system bus 705, or can also be connected to other types of networks or remote computer systems using the network interface unit 711 (not shown) Out).
  • the voice recognition device 700 further includes a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors.
  • the one or more programs described above include instructions for performing the above described speech recognition method.
  • the embodiment of the present application further provides a voice recognition system, where the voice recognition system includes: a smart speaker and a server.
  • the smart speaker may be a voice collecting device as shown in FIG. 1
  • the server may be a voice recognition device as shown in FIG. 1.
  • the intelligent speaker is configured to collect a voice signal and send the voice signal to the server.
  • a server configured to acquire a voice signal, and identify the voice signal according to a voice recognition algorithm, to obtain n candidate recognition results, where the candidate recognition result refers to text information corresponding to the voice signal, where n is greater than 1.
  • An integer value is determined according to a selection rule in which the execution order is j in the m selection rules, wherein the target result is that the n candidate recognition results have the highest degree of matching with the voice signal.
  • Candidate recognition result the m is an integer greater than 1, the initial value of the j is 1; when the target result is not determined according to the selection rule in which the execution order is j, j+1 according to the execution order
  • the selection rule determines the target result of the n candidate recognition results, and sends the target result to the smart speaker.
  • the server performs the identification of the target result according to the voice recognition method shown in any of the above FIGS. 3 to 6.
  • the smart speaker is also used to respond according to the target result.
  • the response includes, but is not limited to, performing command execution according to the target result, performing a function response according to the target result, and performing at least one of the voice conversations according to the target result.
  • command execution is performed according to the target result, including at least one of the following command execution: play, pause, previous, next.
  • the functional response is performed according to the target result, including at least one of the following functional responses: playing a certain singer or a certain song name or a certain style of song, playing a certain host or a certain program name or a certain type of Music programs, voice navigation, schedule reminders, translations.
  • a voice dialogue is performed according to the target result, including at least one of the following dialogue scenarios: weather quiz, knowledge quiz, entertainment chat, and joke explanation.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音识别方法、装置、设备及存储介质,属于计算机领域,该方法包括:获取语音信号(101);根据语音识别算法对该语音信号进行识别,得到n个候选识别结果(102);根据m种选择规则中执行顺序为j的选择规则确定该n个候选识别结果中的目标结果(103);当根据该执行顺序为j的选择规则未确定出该目标结果时,根据执行顺序为j+1的选择规则确定该n个候选识别结果中的该目标结果(104)。该方法解决了根据RNN语言模型计算困惑度消耗的时间较长,导致从多个候选识别结果中选择目标结果的实时性较差的问题,提高了从n个候选识别结果中选择目标结果的实时性。

Description

语音识别方法、装置、设备及存储介质
本申请要求于2017年06月29日提交中国国家知识产权局、申请号为2017105177374、发明名称为“语音识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机领域,特别涉及一种语音识别方法、装置、设备及存储介质。
背景技术
语音识别技术是指通过语音识别设备将语音信息识别为文本信息的技术,语音识别技术广泛应用于语音拨号、语音导航、智能家居控制、语音搜索、听写数据录入等场景。
发明内容
本申请实施例提供了一种语音识别方法、装置、设备及存储介质,可以解决由于语音识别设备根据RNN语言模型计算困惑度消耗的时间较长,导致的从多个候选识别结果中选择目标结果的实时性较差的问题。所述技术方案如下:
根据本申请的一个方面,提供了一种语音识别方法,所述方法包括:
获取语音信号;
根据语音识别算法对所述语音信号进行识别,得到n个候选识别结果,所述候选识别结果是指所述语音信号对应的文本信息,所述n为大于1的整数;
根据m种选择规则中执行顺序为j的选择规则确定所述n个候选识别结果中的目标结果,所述目标结果是指所述n个候选识别结果中与所述语音信号匹配度最高的候选识别结果,所述m为大于1的整数,所述j的初始值为1;
当根据所述执行顺序为j的选择规则未确定出所述目标结果时,根据执行顺序为j+1的选择规则确定所述n个候选识别结果中的所述目标结果。
根据本申请的另一方面,提供了一种候选识别结果选择装置,所述装置包 括:
信号获取模块,用于获取语音信号;
语音识别模块,用于根据语音识别算法对所述信号获取模块获取到的所述语音信号进行识别,得到n个候选识别结果,所述候选识别结果是指所述语音信号对应的文本信息,所述n为大于1的整数;
确定模块,用于根据m种选择规则中执行顺序为j的选择规则确定所述语音识别模块识别出的所述n个候选识别结果中的目标结果,所述目标结果是指所述n个候选识别结果中与所述语音信号匹配度最高的候选识别结果,所述m为大于1的整数,所述j的初始值为1;
所述确定模块,用于当根据所述执行顺序为j的选择规则未确定出所述目标结果时,根据执行顺序为j+1的选择规则确定所述n个候选识别结果中的所述目标结果。
根据本申请的另一方面,提供了一种语音识别设备,所述语音识别设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现第一方面提供的语音识别方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现第一方面提供的语音识别方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过依次执行m选择规则中的至少一种来从语音识别出的n个候选识别结果中选择目标结果,其中,每种选择规则的算法复杂程度均低于根据RNN语言模型计算困惑度的算法复杂程度;解决了根据RNN语言模型计算困惑度消耗的时间较长,导致从多个候选识别结果中选择目标结果的实时性较差的问题;当仅执行了一种选择规则就能够确定出目标结果时,由于该选择规则的算法复杂程度低于根据RNN语言模型计算困惑度的算法复杂程度,因此,提高了从n个候选识别结果中选择目标结果的实时性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所 需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一个实施例提供的语音识别系统的结构示意图;
图2是本申请一个实施例提供的语音识别方法的流程图;
图3是本申请另一个实施例提供的语音识别方法的流程图;
图4是本申请一个实施例提供的第一对应关系和第二对应关系的示意图;
图5是本申请另一个实施例提供的语音识别方法的流程图;
图6是本申请另一个实施例提供的语音识别方法的流程图;
图7是本申请一个实施例提供的语音识别装置的框图;
图8是本申请一个实施例提供的语音识别设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
首先,对本申请实施例涉及的若干个名词进行介绍。
语音识别设备:具有将语音信号识别为文本信息的功能的电子设备。
可选地,语音识别设备可以是安装有语音识别引擎的服务器,语音识别设备通过该语音识别引擎将语音信号识别为文本信息。
其中,语音识别设备接收到的语音信号可以是该语音识别设备通过音频采集组件采集到的;或者,也可以是语音接收设备通过音频采集组件采集到后发送给语音识别设备的,语音接收设备可以是与语音识别设备相独立的电子设备,比如:语音接收设备可以为手机、平板电脑、智能音箱、智能电视、智能空气净化器、智能空调、电子书阅读器、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机、台式计算机等等。
可选地,语音识别设备也可以是手机、平板电脑、智能音箱、智能电视、智能空气净化器、智能空调等,本实施例对此不作限定。
可选地,下文中以语音识别设备是服务器,语音识别设备接收语音接收设 备发送的语音信号为例进行说明。
候选识别结果:对于某一条语音信号来说,语音识别设备识别出的至少一条文本信息。
可选地,当语音识别设备得到的候选识别结果为至少两条时,需要从该至少两条候选识别结果中选择出目标结果。其中,目标结果是指与语音信号的匹配度最高的候选识别结果。
在相关技术中,由于同一发音的语音信号可能对应多组不同字的组合,比如:nihao对应“你好”、“拟好”、“倪浩”这三种组合,因此,语音识别设备根据语音信号可能识别出多个候选识别结果。当语音识别设备识别出多个候选识别结果时,如何选择出与语音信号匹配度最高的候选识别结果显得尤为重要。
相关技术中提供了一种典型的语音识别方法,语音识别设备获取到n个候选识别结果后,根据循环神经网络(Recurrent Neural Network,RNN)语言模型,计算每个候选识别结果的困惑度,将困惑度的最小值对应的候选识别结果确定为目标结果。其中,RNN语言模型是根据一个通用的语料库训练得到的,困惑度用于指示候选识别结果与语音信号的相似程度,且困惑度与该相似程度呈负相关关系;目标结果是指n个候选识别结果中与实际接收到的语音信号匹配度最大的候选识别结果,n为大于1的整数。
由于根据RNN语言模型计算困惑度消耗的时间较长,因此,从n个候选识别结果中选择目标结果的实时性较差。
请参考图1,其示出了本申请一个实施例提供的语音识别系统的结构示意图。该系统包括:至少一个语音接收设备110和语音识别设备120。
语音接收设备110可以是手机、平板电脑、智能音箱、智能电视、智能空气净化器、智能空调、电子书阅读器、MP3播放器、MP4播放器、膝上型便携计算机、台式计算机,本实施例对此不作限定。
语音接收设备110中安装有音频采集组件111。音频采集组件111用于采集语音信号。
语音接收设备110与语音识别设备120之间通过无线网络或有线网络建立连接。语音接收设备110通过音频采集组件111采集到语音信号后,通过该连接将语音信号发送至语音识别设备120。
语音识别设备120用于将语音信号识别为文本信息(候选识别结果)。可选地,该文本信息为至少两个。
可选地,语音识别设备120用于在识别出多个候选识别结果时,还从该多个候选识别结果中选择出目标结果。
可选地,语音识别设备120选择出目标结果后,将该目标结果反馈给语音接收设备110。
可选地,语音识别设备120可以实现为服务器或者服务器集群,本实施例对此不作限定。
可选地,在手机、平板电脑、智能音箱、智能电视、智能空气净化器、智能空调、电子书阅读器、MP3播放器、MP4播放器、膝上型便携计算机等移动终端的物理硬件支持运行复杂算法时,语音识别设备120可以实现为上述移动终端中的至少一种,本实施例对此不作限定。
可选地,上述的无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也可以是任何网络,包括但不限于局域网(Local Area Network,LAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合)。在一些实施例中,使用包括超文本标记语言(HyperText Mark-up Language,HTML)、可扩展标记语言(Extensible Markup Language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure Socket Layer,SSL)、传输层安全(Trassport Layer Security,TLS)、虚拟专用网络(Virtual Private Network,VPN)、网际协议安全(Internet Protocol Security,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
可选地,本申请实施例以各个实施例的执行主体为语音识别设备为例进行说明。
请参考图2,其示出了本申请一个示例性实施例提供的语音识别方法的流程图。本实施例以该方法应用于语音识别设备中举例说明。可以包括以下几个步骤:
步骤101,获取语音信号。
可选地,语音信号是语音接收设备发送至语音识别设备的;或者,是语音 识别设备采集到的;或者,是通过移动存储装置输入到语音识别设备的。
步骤102,根据语音识别算法对语音信号进行识别,得到n个候选识别结果。
其中,候选识别结果是指语音信号对应的文本信息,n为大于1的整数。
语音识别算法用于将语音信号识别为至少一条文本信息。语音识别算法可以是基于对维特比(Viterbi)算法进行改进得到的并行算法;或者,也可以是基于对维特比算法进行改进得到的串行算法;或者,还可以是树-格(Tree-Trellis)算法,本实施例对此不作限定。
可选地,语音识别算法具有对n个候选识别结果进行初步排序的功能,此时,语音识别设备获取到的n个候选识别结果具有顺序标识,这样,语音识别设备选择目标结果时,按照顺序标识指示顺序依次检测是否为目标结果。
需要补充说明的是,语音识别设备可以仅识别出一个候选识别结果,本实施例对此不作限定。
步骤103,根据m种选择规则中执行顺序为j的选择规则确定n个候选识别结果中的目标结果。
目标结果是指n个候选识别结果中与语音信号匹配度最高的候选识别结果,m为大于1的整数,j的初始值为1。1≤j≤m-1。
可选地,m种选择规则的执行顺序是根据每种选择规则的算法复杂程度确定的,且算法复杂程度与执行顺序呈正相关关系。即,算法复杂程度越低,执行顺序的序号越小,执行顺序越靠前;算法复杂程度越高,执行顺序的序号越大,执行顺序越靠后。
其中,选择规则的算法复杂程度与选择目标结果的速度呈负相关关系。即,算法复杂程度越高,选择目标结果的速度越慢;算法复杂程度越低,选择目标结果的速度越快。
可选地,每种选择规则的算法复杂程度通过复杂程度标识表示。示意性地,算法复杂程度标识为1、2、3,其中,数值越小算法复杂程度越低。
可选地,m种选择规则的执行顺序是由开发人员指定的,由于m种选择规则的算法复杂程度均低于根据RNN语言模型计算困惑度的算法复杂程度,因此,无论优先执行哪一种选择规则,语音识别设备选择目标结果的速度均比通过根据RNN语言模型计算困惑度来选择目标结果的速度快。
在这种情况下,执行顺序可以通过执行顺序标识表示。示意性地,执行顺 序标识为#1、#2、#3,其中,#1表示执行顺序为1,#2表示执行顺序为2,#3表示执行顺序为3。
可选地,m种选择规则的执行顺序是随机选取的。
步骤104,当根据执行顺序为j的选择规则未确定出目标结果时,根据执行顺序为j+1的选择规则确定n个候选识别结果中的目标结果。
语音识别设备根据执行顺序为j的选择规则可能未确定出目标结果,此时,语音识别设备根据执行顺序为j+1的选择规则继续确定目标结果,直至确定出n个候选识别中的目标结果时,流程结束。
可选地,语音识别设备对n个候选识别结果重新排序,其中,n个候选识别结果中的目标结果的排列顺序为第一位;除第一位结果之外的剩余n-1个候选识别结果中的目标结果的排列顺序为第二位;除第一位和第二位结果之外的剩余n-2个候选识别结果中的目标结果的排列顺序为第三位,如此循环。
综上所述,本申请提供的语音识别方法,通过依次执行m选择规则中的至少一种来从语音识别出的n个候选识别结果中选择目标结果,其中,每种选择规则的算法复杂程度均低于根据RNN语言模型计算困惑度的算法复杂程度;解决了根据RNN语言模型计算困惑度消耗的时间较长,导致从多个候选识别结果中选择目标结果的实时性较差的问题;当仅执行了一种选择规则就能够确定出目标结果时,由于该选择规则的算法复杂程度低于根据RNN语言模型计算困惑度的算法复杂程度,因此,提高了从n个候选识别结果中选择目标结果的实时性。
可选地,本实施例中的m种选择规则是根据不同使用场景确定的选择规则。m种选择规则包括命令选择规则、功能选择规则和对话选择规则中的至少两种。在命令场景下(即,语音信号为命令形式的消息),通过m种选择规则中的命令选择规则能够识别出目标结果;在功能场景下(即,语音信号为功能性消息),通过m种选择规则中的功能选择规则能够识别出目标结果;在对话场景下(即,语音信号为对话形式的消息),通过m种选择规则中的对话选择规则能够识别出目标结果。
其中,命令形式的消息用于指示语音接收设备执行某个命令,比如:当语音接收设备为智能音箱时,命令形式的消息可以为上一首、下一首、暂停、播放等消息。
通常命令形式的消息无规律、且数量有限,比如:对于上一首这个命令形式的消息来说,可以变化为上首、请播放上一首、请播放上首、请切换上首、请切换上一首等,上述各种变化无规律,且变化的种类有限。
由于命令形式的消息无规律、且数量有限,因此,本实施例中在语音识别设备中预设有命令词库,该命令词库包括多个命令关键词,命令选择规则用于指示语音识别设备根据命令词库中是否包括与第i个候选识别结果相匹配的命令关键词来检测该第i个候选识别结果是否为目标结果,1≤i≤n。
功能性消息用于指示的语音接收设备根据至少一个语音关键词执行某个命令,比如:功能性消息为“播放周杰伦的歌”。
通常功能性消息具有固定形式的功能模板和可变化的语音关键词。比如:在“播放周杰伦的歌”中,功能模板为“播放()的歌”,语音关键词为“周杰伦”。
由于通常功能性消息具有固定形式的功能模板和可变化的语音关键词,因此,本实施例中在语音识别设备中预设有功能模板库和语音词库,功能选择规则用于指示语音识别设备根据语音词库中是否包括与语音关键词相匹配的词库关键词来检测第i个候选识别结果是否为目标结果,该语音关键词是第i个候选识别结果中的至少一个关键词。
对话形式的消息是指无规律、且变化数量未知的消息。比如:对话消息为“你在做什么”、“今天有空吗”、“电影真好看”等。
由于对话形式的消息无规律、且变化数量未知,因此,本实施例中,在语音识别中设置有预先训练出语言模型,对话选择规则用于指示语音识别设备根据训练出的语言模型确定每个候选识别结果与语音信号的相似程度来选择目标结果。
可选地,在本实施例中,命令选择规则的算法复杂程度低于功能选择规则的算法复杂程度,功能选择规则的算法复杂程度低于对话选择规则的算法复杂程度。相应地,语音识别设备优先执行命令选择规则来选择目标结果;在根据命令选择规则未选择出目标结果时,再执行功能选择规则来选择目标结果;在根据功能选择规则未选择出目标结果时,再执行对话选择规则来选择目标结果。
可选地,本实施例中,命令选择规则的算法复杂程度、功能选择规则的算法复杂程度和对话选择规则的算法复杂程度均远远小于根据RNN语言模型选 择目标结果的算法复杂程度,因此,若语音识别设备依次执行命令选择规则、功能选择规则和对话选择规则才确定出目标结果,语音识别设备消耗的总时长也小于根据RNN语言模型选择目标结果消耗的总时长。
下面对根据命令选择规则来选择目标结果(参见图3所示的实施例)、根据功能选择规则选择目标结果(参见图5所示的实施例)、根据对话选择规则选择目标结果(参见图6所示的实施例)分别作介绍。
请参考图3,其示出了本申请另一个实施例提供的语音识别方法的流程图。本实施例以该语音识别方法应用于语音识别设备中来举例说明。该方法可以包括以下几个步骤:
步骤201,检测命令词库的第一对应关系是否包括与第i个候选识别结果相匹配的命令关键词。
第一对应关系包括索引值与命令关键词之间的对应关系。
可选地,第一对应关系通过正排表实现,该正排表包括至少一个键值对,每个键值对中的键为哈希值(索引值),每个键值对中的值为一个命令关键词。
本实施例不对第一对应关系中的键值对的数量作限定,示意性地,第一对应关系中的键值对的数量为1000。
语音识别设备检测命令词库的第一对应关系是否包括与第i个候选识别结果相匹配的命令关键词,包括:计算第i个候选识别结果的哈希值,在第一对应关系中检测是否存在与该哈希值相等的键;如果存在,确定第一对应关系包括与第i个候选识别结果相匹配的命令关键词,执行步骤202;如果不包括,则令i=i+1,继续执行本步骤。
可选地,第一对应关系可以指包括至少一个命令关键词,语音识别设备将第i个候选识别结果与每个命令关键词进行匹配;如果第一对应关系中存在与第i个候选识别结果完全匹配的命令关键词,则执行步骤202,如果不包括,则令i=i+1,继续执行本步骤。
步骤202,确定第i个候选识别结果为目标结果,流程结束。
可选地,当第一对应关系包括至少两个候选识别结果对应的命令关键词时,语音识别设备可以将第一个候选识别结果作为目标结果,或者,语音识别设备执行步骤203,从该至少两个候选识别结果中再次选择目标结果。
步骤203,在第一对应关系不包括与n个候选识别结果中的任意一个候选识别结果相匹配的命令关键词时,检测命令词库中的第二对应关系是否包括与第i个候选识别结果中的任意一个字相匹配的关键字。
第二对应关系包括索引值与关键字之间的对应关系,命令关键词包括关键字。
可选地,第二对应关系通过倒排表实现,该倒排表包括至少一个键值对,每个键值对中的键为关键字的哈希值,每个键值对中的值为该关键字对应在第一对应关系中的至少一个索引值。
语音识别设备检测命令词库中的第二对应关系是否包括与第i个候选识别结果中的任意一个字相匹配的关键字,包括:计算第i个候选识别结果中每个字的哈希值;检测第二对应关系中是否包括与任意一个字的哈希值相等的键;如果包括,确定第二对应关系包括与第i个候选识别结果中的字相匹配的关键字,执行步骤204;如果不包括,则令i=i+1,继续执行本步骤。
可选地,第二对应关系中每个键值对的键也可以为关键字。
步骤204,根据第二对应关系中关键字对应的索引值,在第一对应关系中查找索引值对应的命令关键词。
由于命令关键词是由关键字组成的,而不同的命令关键词可能包含同一关键字,因此,语音识别设备根据关键字对应的索引值,即,第二对应关系中关键字对应键值对中的值,查找到的命令关键词的数量为至少一个。
本实施例中,通过第一对应关系和第二对应关系结合来检测与第i个候选识别结果相匹配的命令关键词,使得语音识别设备无需存储所有的命令关键词的变化形式,只需要存储所有变化形式均包括的关键字,就能够确定出对应的命令关键词,节省了语音识别设备的存储空间。
步骤205,确定第i个候选识别结果与命令关键词之间的编辑距离。
编辑距离(或称莱文斯坦(Levenshtein)距离)用于指示第i个候选识别结果转换为命令关键词所需执行的操作次数。其中,转换的操作包括但不限于:替换、插入和删除。
语音识别设备可能确定出多个命令关键词,此时,确定第i个候选识别结果与每个命令关键词之间的编辑距离。
比如:第i个候选识别结果为“在停”,语音识别设备确定出的命令关键词为“暂停”,语音识别设备只需要将“在”替换为“暂”就可以将“在停”转 换为“暂停”,则第i个候选识别结果与命令关键词之间的编辑距离为1。
步骤206,在编辑距离小于预设数值时,确定第i个候选识别结果为目标结果。
在编辑距离小于预设数值时,说明第i个候选识别结果与命令关键词相似程度较高,此时,将该第i个候选识别结果确定为目标结果。
预设数值的取值通常较小,且本实施例不对预设数值的取值作限定,示意性地,预设数值为2。
参考图4所示的第一对应关系和第二对应关系的示意图,其中,第一对应关系包括3个键值对,每个的键值对由索引值和命令关键词组成,第二对应关系包括3个键值对,每个键值对由哈希值和索引值组成。
若语音识别设备识别出4个候选识别结果,该4个候选识别结果分别为:再填、在田、在填和暂停。语音识别设备分别计算这4个候选识别结果的哈希值,其中,再填的哈希值为1、在田的哈希值为2、在填的哈希值为3、暂停的哈希值为4,而第一对应关系中的键包括4,则将暂停确定为目标结果。
若语音识别设备识别出4个候选识别结果,该4个候选识别结果分别为:再填、在田、在填和在停。语音识别设备分别计算这4个候选识别结果的哈希值,其中,再填的哈希值为1、在田的哈希值为2、在填的哈希值为3、在停的哈希值为5,此时,第一对应关系中的键不包括1、2、3、5,则语音识别设备计算每个候选识别结果中每个字的哈希值。对于候选识别结果“在停”来说,“在”的哈希值为11,“停”的哈希值为12,第二对应关系中的键包括12,语音识别设备根据第二对应关系中12对应的索引值4,在第一对应关系中查找索引值4对应的命令关键词“暂停”,“在停”与“暂停”之间的编辑距离为1,小于预设数值2,则将“在停”确定为目标结果。
可选地,当所有的候选识别结果与命令关键词之间的编辑距离均大于或等于预设数值时,则未根据命令选择规则选择出目标结果,此时,语音识别设备根据其它选择规则继续选择目标结果;或者,将第一条候选识别结果确定为目标结果;或者,不选择目标结果,流程结束。其中,其它选择规则为功能选择规则或对话选择规则。
可选地,语音识别设备也可以将编辑距离最小的候选识别结果确定为目标结果。
综上所述,本申请提供的语音识别方法,通过命令选择规则来选择n个候 选识别结果中的目标结果;当仅执行了命令选择规则就能够确定出目标结果时,由于命令选择规则的算法复杂程度低于根据RNN语言模型计算困惑度的算法复杂程度,因此,提高了从n个候选识别结果中选择目标结果的实时性。
另外,通过第一对应关系和第二对应关系结合来检测与第i个候选识别结果相匹配的命令关键词,使得语音识别设备无需存储所有的命令关键词的变化形式,只需要存储所有变化形式均包括的关键字,就能够确定出对应的命令关键词,节省了语音识别设备的存储空间。
可选地,语音识别设备将目标结果发送至语音接收设备,该语音接收设备根据目标结果对应的命令执行相应的操作。比如:语音接收设备为智能音箱,且目标结果为暂停,则智能音箱接收到该目标结果后暂停播放当前播放的音频信息。
请参考图5,其示出了本申请另一个实施例提供的语音识别方法的流程图。本实施例以该语音识别方法应用于语音识别设备来举例说明。该方法可以包括以下几个步骤:
步骤401,分析第i个候选识别结果的功能模板,1≤i≤n。
可选地,语音识别设备中预设有功能模板库,该功能模板库包括至少一个功能模板。
可选地,功能模板通过正则表达式(或称规则表达式)来表示。比如:功能模板为“一首(.+)的歌”。本实施例不对功能模板库中的功能模板的数量作限定,示意性地,功能模板库中的功能模板的数量为540个。
其中,正则表达式用于检索和/或替换符合某个功能模板的文本信息。
语音识别设备通过将第i个候选识别结果与功能模板库中的每个功能模板进行匹配来分析第i个候选识别结果中的功能模板。
步骤402,检测语音词库是否包括与第i个候选识别结果中的语音关键词相匹配的词库关键词。
第i个候选识别结果包括功能模板和至少一个语音关键词,语音识别设备分析出第i个候选识别结果的功能模板后,将第i个候选识别结果中剩余的关键词作为语音关键词。
语音识别设备中预设有语音词库,语音词库包括至少一个词库关键词。本实施例不对语音词库中词库关键词的数量作限定,示意性地,语音词库中词库 关键词的数量为100万。
语音识别设备将第i个候选识别结果中的语音关键词与语音词库中的至少一个词库关键词逐个进行匹配,在语音词库包括与第i个候选识别结果中的语音关键词相匹配的词库关键词时,执行步骤403;在语音词库不包括与第i个候选识别结果中的语音关键词相匹配的词库关键词时,令i=i+1,继续执行本步骤。
步骤403,将第i个候选识别结果确定为目标结果,流程结束。
可选地,当根据功能选择规则未选择出目标结果时,语音识别设备根据其它选择规则继续选择目标结果;或者,将第一条候选识别结果确定为目标结果;或者,不选择目标结果,流程结束。其中,其它选择规则为命令选择规则或对话选择规则。
其中,根据功能选择规则未选择出目标结果包括但不限于如下几种情况:语音识别设备未分析出各个候选识别结果的功能模板;或者,语音识别设备未在语音词库中查找到与各个候选识别结果中的语音关键词相匹配的词库关键词。
假设语音识别设备得到3个候选识别结果,分别为:1、我想听图案歌的歌;2、我想听童安格的咯;3、我想听童安格的歌。语音识别设备将上述3个候选识别结果分别与功能模板库中的功能模板相匹配,得到第1个候选识别结果的功能模板为“我想听(.+)的歌”、第2个候选识别结果的功能模板为“我想听(.+)的(.+)”、第3个候选识别结果的功能模板为“我想听(.+)的歌”。
对于第1个候选识别结果来说,语音关键词为图案歌;对于第2个候选识别结果来说,语音识别设备将第一个关键词作为语音关键词,即,语音关键词为童安格;对于第3个候选识别结果来说,语音关键词为童安格。
语音识别设备依次将候选识别结果中的语音关键词与语音词库中的词库关键词进行匹配,在将第2个候选识别结果中的语音关键词与词库关键词进行匹配时,语音识别设备能够确定出与语音关键词相匹配的词库关键词,则将第2个候选识别结果确定为目标结果。
可选地,对于第2个候选识别结果来说,语音识别设备也可以将所有关键词都作为语音关键词,即,语音关键词为童安格和咯,此时,语音词库虽然包括与“童安格”相匹配的词库关键词,但是不包括与“咯”相匹配的词库关键词,此时,语音识别设备依次将候选识别结果中的语音关键词与语音词库中的 词库关键词进行匹配,在将第3个候选识别结果中的语音关键词与词库关键词进行匹配时,语音识别设备能够确定出与语音关键词相匹配的词库关键词,则将第3个候选识别结果确定为目标结果。
综上所述,本申请提供的语音识别方法,通过功能选择规则来选择n个候选识别结果中的目标结果;当仅执行了功能选择规则就能够确定出目标结果时,由于功能选择规则的算法复杂程度低于根据RNN语言模型计算困惑度的算法复杂程度,因此,提高了从n个候选识别结果中选择目标结果的实时性。
可选地,语音识别设备将目标结果发送至语音接收设备,该语音接收设备根据目标结果中的语音关键词执行相应的操作。比如:语音接收设备为智能音箱,且目标结果为播放周杰伦的歌,则智能音箱接收到该目标结果后,搜索周杰伦的歌,并播放搜索结果对应的音频信息。
可选地,语音识别设备根据目标结果中的语音关键词进行搜索,并将搜索结果发送至语音接收设备,该语音接收设备播放该搜索结果对应的音频信息。比如:语音接收设备为智能音箱,且目标结果为播放周杰伦的歌,则语音识别设备根据目标结果中的语音关键字周杰伦,搜索周杰伦的歌,并将搜索结果发送至智能音箱,智能音箱播放搜索结果对应的音频信息。
请参考图6,其示出了本申请另一个实施例提供的语音识别方法的流程图。本实施例以该语音识别方法应用于语音识别系统来举例说明。该方法可以包括以下几个步骤:
步骤501,根据语言模型计算每个候选识别结果的困惑度。
困惑度(preplexity)用于指示候选识别结果与语音信号的相似程度。困惑度与相似程度呈负相关关系。
语言模型是用来描述自然语言内在规律的数学模型。
可选地,本实施例中,语言模型是根据至少一个领域对应的专用语料库生成的N-gram语言模型,N-gram语言模型用于根据当前词的前N-1个词的出现概率确定当前词的出现概率,N为正整数。本实施例不对N的取值作限定,示意性地,N为3,3-gram语言模型也称为Tri-gram语言模型。示意性地,N为2,2-gram语言模型也称为Bi-gram语言模型。
N-gram语言模型通过概率和分布函数来描述词、词组及句子等自然语言基本单位的性质和关系,体现了自然语言中存在的基于统计原理的生成和处理 规则。
本实施例中,以语音识别设备根据3-gram语言模型或2-gram语言模型计算每个候选识别结果的困惑度为例进行说明。
可选地,3-gram语言模型通过下述公式表示:
p(S)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)
=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|wn-1,wn-2)
其中,p(S)表示候选识别结果出现的概率,p(w1)表示候选识别结果中第1个词出现的概率、p(w2|w1)表示候选识别结果中第2个词依赖第1个词的出现而出现的概率、p(w3|w1,w2)表示候选识别结果中第3个词依赖第1个词和第2个词的出现而出现的概率、p(wn|wn-1,wn-2)表示候选识别结果中第n个词依赖前1个词(第n-1个词)和前2个词(第n-2个词)的出现而出现的概率。
可选地,2-gram语言模型通过下述公式表示:
p(S)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)
=p(w1)p(w2|w1)p(w3|w2)...p(wn|wn-1)
其中,p(S)表示候选识别结果出现的概率,p(w1)表示候选识别结果中第1个词出现的概率、p(w2|w1)表示候选识别结果中第2个词依赖第1个词的出现而出现的概率、p(w3|w2)表示候选识别结果中第3个词依赖第2个词的出现而出现的概率、p(wn|wn-1)表示候选识别结果中第n个词依赖前1个词(第n-1个词)的出现而出现的概率。
其中,至少一个领域包括但不限于以下几种:天气领域、音乐领域、数学领域、体育领域、计算机领域、家居领域、地理领域、自然领域。
当然,至少一个领域也可以包括其它领域,本实施例对此不作限定。
语音识别设备通过预设公式根据语言模型计算每个候选识别结果的困惑度。
困惑度可以视为语言模型预测出的每个词后的候选词的出现概率的几何平均数。通常候选识别结果出现的概率与困惑度呈负相关关系,即,候选识别结果出现的概率越大,则困惑度越低;候选识别结果出现的概率越小,则困惑度越高。
可选地,语音识别设备通过预设公式根据语言模型计算每个候选识别结果的困惑度时,先计算每个候选识别结果的交叉熵,根据该交叉熵和预设公式确定该语言识别结果的困惑度。
其中,交叉熵用于表示语音模型确定出的模型语言与候选识别结果的差异情况。交叉熵越小,模型语言与候选识别结果的差异越小,候选识别结果与语音信号的匹配度越高;交叉熵越大,模型语言与候选识别结果的差异越大,候选识别结果与语音信号的匹配度越低。
可选地,语言模型也可以为其它类型,比如神经网络语言模型,本实施例对此不作限定。
步骤502,确定n个候选识别结果中困惑度的最小值,将最小值对应的第i个候选识别结果确定为目标结果。
由于困惑度越小,说明候选识别结果与语音信号的相似程度越高,因此,将困惑度的最小值对应的第i个候选识别结果确定为目标结果。
综上所述,本申请提供的语音识别方法,通过对话选择规则来选择n个候选识别结果中的目标结果;当仅执行了对话选择规则就能够确定出目标结果时,由于对话选择规则的算法复杂程度低于根据RNN语言模型计算困惑度的算法复杂程度,因此,提高了从n个候选识别结果中选择目标结果的实时性。
可选地,语音识别设备将目标结果发送至语音接收设备,该语音接收设备根据目标结果获取对话信息。比如:语音接收设备为智能音箱,且目标结果为你在做什么,则智能音箱接收到该目标结果后,根据对话模型生成对话信息。
可选地,语音识别设备根据目标结果生成对话信息,并将对话信息发送至语音接收设备,该语音接收设备播放该对话信息对应的音频信息。比如:语音接收设备为智能音箱,且目标结果为你在做什么,则语音识别设备根据目标结果生成对话信息,并将对话信息发送至智能音箱,智能音箱播放对话信息对应的音频信息。
需要说明的是,图3所示实施例、图5所示实施例和图6所示实施例可以两两组合形成新的实施例,或者三个实施例组合形成新的实施例。以m=3为例,命令选择规则为第一个选择规则、功能选择规则为第二个选择规则,对话选择规则为第三个选择规则。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图7,其示出了本申请一个实施例提供的语音识别装置的框图。该 装置具有执行上述方法示例的功能,功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以包括:信号获取模块610、语音识别模块620、确定模块630。
信号获取模块610,用于获取语音信号;
语音识别模块620,用于根据语音识别算法对所述信号获取模块610获取到的所述语音信号进行识别,得到n个候选识别结果,所述候选识别结果是指所述语音信号对应的文本信息,所述n为大于1的整数;
确定模块630,用于根据m种选择规则中执行顺序为j的选择规则确定所述语音识别模块620识别出的所述n个候选识别结果中的目标结果,所述目标结果是指所述n个候选识别结果中与所述语音信号匹配度最高的候选识别结果,所述m为大于1的整数,所述j的初始值为1;
所述确定模块630,用于当根据所述执行顺序为j的选择规则未确定出所述目标结果时,根据执行顺序为j+1的选择规则确定所述n个候选识别结果中的所述目标结果。
可选地,m种选择规则的执行顺序根据各自的算法复杂程度确定,执行顺序与算法复杂程度呈正相关关系。
可选地,m种选择规则包括命令选择规则、功能选择规则和对话选择规则中的至少两种,命令选择规则的算法复杂程度低于功能选择规则的算法复杂程度,功能选择规则的算法复杂程度低于对话选择规则的算法复杂程度,
命令选择规则用于指示语音识别设备根据命令词库中是否包括与第i个候选识别结果相匹配的命令关键词来检测第i个候选识别结果是否为目标结果,1≤i≤n;
功能选择规则用于指示语音识别设备根据语音词库中是否包括与语音关键词相匹配的词库关键词来检测第i个候选识别结果是否为目标结果,语音关键词是第i个候选识别结果中的至少一个关键词;
对话选择规则用于指示语音识别设备根据训练出的语言模型确定每个候选识别结果与语音信号的相似程度来选择目标结果。
可选地,确定模块630,包括:第一检测单元和第一确定单元。
第一检测单元,用于检测命令词库的第一对应关系是否包括与第i个候选识别结果相匹配的命令关键词,1≤i≤n;
第一确定单元,用于在第一对应关系包括与第i个候选识别结果相匹配的 命令关键词时,确定第i个候选识别结果为目标结果;
其中,第一对应关系至少包括命令关键词。
可选地,该确定模块630还包括:第二检测单元、关键字查找模块、第二确定单元和第三确定单元;
第二检测单元,用于在第一对应关系不包括与n个候选识别结果中的任意一个候选识别结果相匹配的命令关键词时,检测命令词库中的第二对应关系是否包括与第i个候选识别结果中的任意一个字相匹配的关键字;
关键字查找单元,用于在第二对应关系包括与第i个候选识别结果中的字相匹配的关键字时,根据第二对应关系中关键字对应的索引值,在第一对应关系中查找索引值对应的命令关键词;
第二确定单元,用于确定第i个候选识别结果与命令关键词之间的编辑距离,编辑距离用于指示第i个候选识别结果转换为命令关键词所需执行的操作次数;
第三确定单元,用于在编辑距离小于预设数值时,确定第i个候选识别结果为目标结果;
其中,第一对应关系包括索引值与命令关键词之间的对应关系,第二对应关系包括索引值与关键字之间的对应关系。
可选地,确定模块630,包括:模板分析单元、第三检测单元和第四确定单元。
模板分析单元,用于分析第i个候选识别结果的功能模板,1≤i≤n;
第三检测单元,用于检测语音词库是否包括与第i个候选识别结果中的语音关键词相匹配的词库关键词;
第四确定单元,用于在语音词库包括与第i个候选识别结果中的语音关键词相匹配的词库关键词时,将第i个候选识别结果确定为目标结果,语音关键词是第i个候选识别结果中的至少一个关键词;
其中,第i个候选识别结果包括功能模板和语音关键词。
可选地,确定模块630,包括:困惑度计算单元和第五确定单元。
困惑度计算单元,用于根据语言模型计算每个候选识别结果的困惑度;
第五确定单元,用于确定n个候选识别结果中困惑度的最小值,将最小值对应的第i个候选识别结果确定为目标结果;
其中,困惑度用于指示候选识别结果与语音信号的相似程度,且困惑度与 相似程度呈负相关关系;语言模型是根据至少一个领域对应的专用语料库生成的N-gram语言模型,N-gram语言模型用于根据当前词的前N-1个词的出现概率确定当前词的出现概率,N为正整数。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质可以是存储器中所包含的计算机可读存储介质;也可以是单独存在,未装配入语音识别设备中的计算机可读存储介质。计算机可读存储介质存储有至少一条指令、至少一段程序、代码集或指令集,该至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现上述各个方法实施例提供的语音识别方法。
图8是本申请一个实施例提供的语音识别设备的结构示意图。语音识别设备700包括中央处理单元(英文:Central Processing Unit,简称:CPU)701、包括随机存取存储器(英文:random access memory,简称:RAM)702和只读存储器(英文:read-only memory,简称:ROM)703的系统存储器704,以及连接系统存储器704和中央处理单元701的系统总线705。所述语音识别设备700还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)706,和用于存储操作系统713、应用程序714和其他程序模块715的大容量存储设备707。
所述基本输入/输出系统706包括有用于显示信息的显示器708和用于用户输入信息的诸如鼠标、键盘之类的输入设备709。其中所述显示器708和输入设备709都通过连接到系统总线705的输入/输出控制器710连接到中央处理单元701。所述基本输入/输出系统706还可以包括输入输出控制器710以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入/输出控制器710还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备707通过连接到系统总线705的大容量存储控制器(未示出)连接到中央处理单元701。所述大容量存储设备707及其相关联的计算机可读介质为语音识别设备700提供非易失性存储。也就是说,所述大容量存储设备707可以包括诸如硬盘或者只读光盘(英文:Compact Disc Read-Only Memory,简称:CD-ROM)驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或 其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、可擦除可编程只读存储器(英文:erasable programmable read-only memory,简称:EPROM)、电可擦除可编程只读存储器(英文:electrically erasable programmable read-only memory,简称:EEPROM)、闪存或其他固态存储其技术,CD-ROM、数字通用光盘(英文:Digital Versatile Disc,简称:DVD)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器704和大容量存储设备707可以统称为存储器。
根据本申请的各种实施例,所述语音识别设备700还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即语音识别设备700可以通过连接在所述系统总线705上的网络接口单元711连接到网络712,或者说,也可以使用网络接口单元711来连接到其他类型的网络或远程计算机系统(未示出)。
具体在本申请实施例中,语音识别设备700还包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行。上述一个或者一个以上程序包含用于执行上述语音识别方法的指令。
在一个可选的实施例中,本申请实施例还提供了一种语音识别系统,该语音识别系统包括:智能音箱和服务器。该智能音箱可以是如图1所示的语音采集设备,服务器可以是如图1所示的语音识别设备。
智能音箱,用于采集语音信号,向所述服务器发送所述语音信号。
服务器,用于获取语音信号;根据语音识别算法对所述语音信号进行识别,得到n个候选识别结果,所述候选识别结果是指所述语音信号对应的文本信息,所述n为大于1的整数;根据m种选择规则中执行顺序为j的选择规则确定所述n个候选识别结果中的目标结果,所述目标结果是指所述n个候选识别结果中与所述语音信号匹配度最高的候选识别结果,所述m为大于1的整数,所述j的初始值为1;当根据所述执行顺序为j的选择规则未确定出所述目标结果时,根据执行顺序为j+1的选择规则确定所述n个候选识别结果中的所述目标结果,向所述智能音箱发送所述目标结果。可选地,服务器根据上述图3至6任一所示语音识别方法进行目标结果的识别。
智能音箱,还用于根据所述目标结果进行响应。该响应包括但不限于:根据目标结果进行命令执行、根据目标结果进行功能响应、根据目标结果进行语音对话中的至少一种。
示意性的,根据目标结果进行命令执行,包括如下至少一种命令执行:播放、暂停、上一首、下一首。
示意性的,根据目标结果进行功能响应,包括如下至少一种功能响应:播放某个歌手或某个歌曲名或某个风格的歌曲、播放某个主持人或某个节目名或某个类型的音乐节目、语音导航、日程提醒、翻译。
示意性的,根据目标结果进行语音对话,包括如下至少一种对话场景:天气问答、知识问答、娱乐聊天、笑话讲解。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种语音识别方法,其特征在于,所述方法包括:
    获取语音信号;
    根据语音识别算法对所述语音信号进行识别,得到n个候选识别结果,所述候选识别结果是指所述语音信号对应的文本信息,所述n为大于1的整数;
    根据m种选择规则中执行顺序为j的选择规则确定所述n个候选识别结果中的目标结果,所述目标结果是指所述n个候选识别结果中与所述语音信号匹配度最高的候选识别结果,所述m为大于1的整数,所述j的初始值为1;
    当根据所述执行顺序为j的选择规则未确定出所述目标结果时,根据执行顺序为j+1的选择规则确定所述n个候选识别结果中的所述目标结果。
  2. 根据权利要求1所述的方法,其特征在于,所述m种选择规则的执行顺序根据各自的算法复杂程度确定,所述执行顺序与所述算法复杂程度呈正相关关系。
  3. 根据权利要求1所述的方法,其特征在于,所述m种选择规则包括命令选择规则、功能选择规则和对话选择规则中的至少两种,所述命令选择规则的算法复杂程度低于所述功能选择规则的算法复杂程度,所述功能选择规则的算法复杂程度低于所述对话选择规则的算法复杂程度,
    所述命令选择规则用于指示语音识别设备根据命令词库中是否包括与第i个候选识别结果相匹配的命令关键词,来检测所述第i个候选识别结果是否为所述目标结果,1≤i≤n;
    所述功能选择规则用于指示所述语音识别设备根据语音词库中是否包括与语音关键词相匹配的词库关键词,来检测所述第i个候选识别结果是否为所述目标结果,所述语音关键词是所述第i个候选识别结果中的至少一个关键词;
    所述对话选择规则用于指示所述语音识别设备根据训练出的语言模型确定每个候选识别结果与所述语音信号的相似程度来选择所述目标结果。
  4. 根据权利要求3所述的方法,其特征在于,所述执行顺序为j的选择规则包括所述命令选择规则,所述根据m种选择规则中执行顺序为j的选择规则 确定所述n个候选识别结果中的目标结果,包括:
    检测所述命令词库的第一对应关系是否包括所述与第i个候选识别结果相匹配的命令关键词,1≤i≤n;
    在所述第一对应关系包括所述与第i个候选识别结果相匹配的命令关键词时,确定所述第i个候选识别结果为所述目标结果;
    其中,所述第一对应关系至少包括所述命令关键词。
  5. 根据权利要求4所述的方法,其特征在于,所述检测所述命令词库的第一对应关系是否包括所述与第i个候选识别结果相匹配的命令关键词之后,还包括:
    在所述第一对应关系不包括与所述n个候选识别结果中的任意一个候选识别结果相匹配的命令关键词时,检测所述命令词库中的第二对应关系是否包括与所述第i个候选识别结果中的任意一个字相匹配的关键字;
    在所述第二对应关系包括与所述第i个候选识别结果中的字相匹配的关键字时,根据所述第二对应关系中所述关键字对应的索引值,在所述第一对应关系中查找所述索引值对应的命令关键词;
    确定所述第i个候选识别结果与所述命令关键词之间的编辑距离,所述编辑距离用于指示所述第i个候选识别结果转换为所述命令关键词所需执行的操作次数;
    在所述编辑距离小于预设数值时,确定所述第i个候选识别结果为所述目标结果;
    其中,所述第一对应关系包括所述索引值与所述命令关键词之间的对应关系,所述第二对应关系包括所述索引值与所述关键字之间的对应关系。
  6. 根据权利要求3所述的方法,其特征在于,所述执行顺序为j的选择规则包括所述功能选择规则,所述根据m种选择规则中执行顺序为j的选择规则确定所述n个候选识别结果中的目标结果,包括:
    分析第i个候选识别结果的功能模板,1≤i≤n;
    检测所述语音词库是否包括与所述第i个候选识别结果中的所述语音关键词相匹配的所述词库关键词;
    在所述语音词库包括与所述第i个候选识别结果中的语音关键词相匹配的 所述词库关键词时,将所述第i个候选识别结果确定为所述目标结果,所述语音关键词是所述第i个候选识别结果中的至少一个关键词;
    其中,所述第i个候选识别结果包括所述功能模板和所述语音关键词。
  7. 根据权利要求3所述的方法,其特征在于,所述执行顺序为j的选择规则包括所述对话选择规则,所述根据m种选择规则中执行顺序为j的选择规则确定所述n个候选识别结果中的目标结果,包括:
    根据所述语言模型计算每个所述候选识别结果的困惑度;
    确定所述n个候选识别结果中所述困惑度的最小值,将所述最小值对应的所述第i个候选识别结果确定为所述目标结果;
    其中,所述困惑度用于指示所述候选识别结果与所述语音信号的所述相似程度,且所述困惑度与所述相似程度呈负相关关系;所述语言模型是根据至少一个领域对应的专用语料库生成的N-gram语言模型,所述N-gram语言模型用于根据当前词的前N-1个词的出现概率确定所述当前词的出现概率,所述N为正整数。
  8. 一种语音识别装置,其特征在于,所述装置包括:
    信号获取模块,用于获取语音信号;
    语音识别模块,用于根据语音识别算法对所述信号获取模块获取到的所述语音信号进行识别,得到n个候选识别结果,所述候选识别结果是指所述语音信号对应的文本信息,所述n为大于1的整数;
    确定模块,用于根据m种选择规则中执行顺序为j的选择规则确定所述语音识别模块识别出的所述n个候选识别结果中的目标结果,所述目标结果是指所述n个候选识别结果中与所述语音信号匹配度最高的候选识别结果,所述m为大于1的整数,所述j的初始值为1;
    所述确定模块,用于当所述第一确定模块根据所述执行顺序为j的选择规则未确定出所述目标结果时,根据执行顺序为j+1的选择规则确定所述n个候选识别结果中的所述目标结果。
  9. 根据权利要求8所述的装置,其特征在于,所述m种选择规则的执行顺序根据各自的算法复杂程度确定,所述执行顺序与所述算法复杂程度呈正相关 关系。
  10. 根据权利要求8所述的装置,其特征在于,所述m种选择规则包括命令选择规则、功能选择规则和对话选择规则中的至少两种,所述命令选择规则的算法复杂程度低于所述功能选择规则的算法复杂程度,所述功能选择规则的算法复杂程度低于所述对话选择规则的算法复杂程度,
    所述命令选择规则用于指示语音识别设备根据命令词库中是否包括与第i个候选识别结果相匹配的命令关键词来检测所述第i个候选识别结果是否为所述目标结果,1≤i≤n;
    所述功能选择规则用于指示所述语音识别设备根据语音词库中是否包括与语音关键词相匹配的词库关键词来检测所述第i个候选识别结果是否为所述目标结果,所述语音关键词是所述第i个候选识别结果中的至少一个关键词;
    所述对话选择规则用于指示所述语音识别设备根据训练出的语言模型确定每个候选识别结果与所述语音信号的相似程度来选择所述目标结果。
  11. 根据权利要求10所述的装置,其特征在于,所述确定模块,包括:第一检测单元和第一确定单元;
    所述第一检测单元,用于检测所述命令词库的第一对应关系是否包括与第i个候选识别结果相匹配的命令关键词,1≤i≤n;
    所述第一确定单元,用于在所述第一对应关系包括所述与第i个候选识别结果相匹配的命令关键词时,确定所述第i个候选识别结果为所述目标结果;
    其中,所述第一对应关系至少包括所述命令关键词。
  12. 根据权利要求11所述的装置,其特征在于,所述确定模块还包括:第二检测单元、关键字查找单元、第二确定单元和第三确定单元;
    所述第二检测单元,用于在所述第一对应关系不包括与所述n个候选识别结果中的任意一个候选识别结果相匹配的命令关键词时,检测所述命令词库中的第二对应关系是否包括与所述第i个候选识别结果中的任意一个字相匹配的关键字;
    所述关键字查找单元,用于在所述第二对应关系包括与所述第i个候选识别结果中的字相匹配的关键字时,根据所述第二对应关系中所述关键字对应的索 引值,在所述第一对应关系中查找所述索引值对应的命令关键词;
    所述第二确定单元,用于确定所述第i个候选识别结果与所述命令关键词之间的编辑距离,所述编辑距离用于指示所述第i个候选识别结果转换为所述命令关键词所需执行的操作次数;
    所述第三确定单元,用于在所述编辑距离小于预设数值时,确定所述第i个候选识别结果为所述目标结果;
    其中,所述第一对应关系包括所述索引值与所述命令关键词之间的对应关系,所述第二对应关系包括所述索引值与所述关键字之间的对应关系。
  13. 根据权利要求10所述的装置,其特征在于,所述确定模块,包括:模板分析单元、第三检测单元和第四确定单元;
    所述模板分析单元,用于分析第i个候选识别结果的功能模板,1≤i≤n;
    所述第三检测单元,用于检测语音词库是否包括与第i个候选识别结果中的语音关键词相匹配的词库关键词;
    所述第四确定单元,用于在语音词库包括与第i个候选识别结果中的语音关键词相匹配的词库关键词时,将第i个候选识别结果确定为目标结果,语音关键词是第i个候选识别结果中的至少一个关键词;
    其中,第i个候选识别结果包括功能模板和语音关键词。
  14. 根据权利要求10所述的装置,其特征在于,所述确定模块,包括:困惑度计算单元和第五确定单元;
    所述困惑度计算单元,用于根据所述语言模型计算每个所述候选识别结果的困惑度;
    所述第五确定单元,用于确定所述n个候选识别结果中所述困惑度的最小值,将所述最小值对应的所述第i个候选识别结果确定为所述目标结果;
    其中,所述困惑度用于指示所述候选识别结果与所述语音信号的所述相似程度,且所述困惑度与所述相似程度呈负相关关系;所述语言模型是根据至少一个领域对应的专用语料库生成的N-gram语言模型,所述N-gram语言模型用于根据当前词的前N-1个词的出现概率确定所述当前词的出现概率,所述N为正整数。
  15. 一种语音识别方法,其特征在于,所述方法包括:
    语音识别设备获取语音信号;
    所述语音识别设备根据语音识别算法对所述语音信号进行识别,得到n个候选识别结果,所述候选识别结果是指所述语音信号对应的文本信息,所述n为大于1的整数;
    所述语音识别设备根据m种选择规则中执行顺序为j的选择规则确定所述n个候选识别结果中的目标结果,所述目标结果是指所述n个候选识别结果中与所述语音信号匹配度最高的候选识别结果,所述m为大于1的整数,所述j的初始值为1;
    当根据所述执行顺序为j的选择规则未确定出所述目标结果时,所述语音识别设备根据执行顺序为j+1的选择规则确定所述n个候选识别结果中的所述目标结果。
  16. 根据权利要求15所述的方法,其特征在于,所述m种选择规则的执行顺序根据各自的算法复杂程度确定,所述执行顺序与所述算法复杂程度呈正相关关系。
  17. 根据权利要求15所述的方法,其特征在于,所述m种选择规则包括命令选择规则、功能选择规则和对话选择规则中的至少两种,所述命令选择规则的算法复杂程度低于所述功能选择规则的算法复杂程度,所述功能选择规则的算法复杂程度低于所述对话选择规则的算法复杂程度,
    所述命令选择规则用于指示语音识别设备根据命令词库中是否包括与第i个候选识别结果相匹配的命令关键词,来检测所述第i个候选识别结果是否为所述目标结果,1≤i≤n;
    所述功能选择规则用于指示所述语音识别设备根据语音词库中是否包括与语音关键词相匹配的词库关键词,来检测所述第i个候选识别结果是否为所述目标结果,所述语音关键词是所述第i个候选识别结果中的至少一个关键词;
    所述对话选择规则用于指示所述语音识别设备根据训练出的语言模型确定每个候选识别结果与所述语音信号的相似程度来选择所述目标结果。
  18. 一种语音识别设备,其特征在于,所述语音识别设备包括处理器和存 储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一所述的语音识别方法。
  19. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一所述的语音识别方法。
  20. 一种语音识别系统,其特征在于,所述系统包括:智能音箱和服务器;
    所述智能音箱,用于采集语音信号,向所述服务器发送所述语音信号;
    所述服务器,用于获取语音信号;根据语音识别算法对所述语音信号进行识别,得到n个候选识别结果,所述候选识别结果是指所述语音信号对应的文本信息,所述n为大于1的整数;根据m种选择规则中执行顺序为j的选择规则确定所述n个候选识别结果中的目标结果,所述目标结果是指所述n个候选识别结果中与所述语音信号匹配度最高的候选识别结果,所述m为大于1的整数,所述j的初始值为1;当根据所述执行顺序为j的选择规则未确定出所述目标结果时,根据执行顺序为j+1的选择规则确定所述n个候选识别结果中的所述目标结果,向所述智能音箱发送所述目标结果;
    所述智能音箱,用于根据所述目标结果进行响应。
PCT/CN2018/088646 2017-06-29 2018-05-28 语音识别方法、装置、设备及存储介质 WO2019001194A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020197028881A KR102315732B1 (ko) 2017-06-29 2018-05-28 음성 인식 방법, 디바이스, 장치, 및 저장 매체
JP2019560155A JP6820058B2 (ja) 2017-06-29 2018-05-28 音声認識方法、装置、デバイス、及び記憶媒体
EP18825077.3A EP3648099B1 (en) 2017-06-29 2018-05-28 Voice recognition method, device, apparatus, and storage medium
US16/547,097 US11164568B2 (en) 2017-06-29 2019-08-21 Speech recognition method and apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710517737.4 2017-06-29
CN201710517737.4A CN108288468B (zh) 2017-06-29 2017-06-29 语音识别方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/547,097 Continuation US11164568B2 (en) 2017-06-29 2019-08-21 Speech recognition method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2019001194A1 true WO2019001194A1 (zh) 2019-01-03

Family

ID=62831578

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/088646 WO2019001194A1 (zh) 2017-06-29 2018-05-28 语音识别方法、装置、设备及存储介质

Country Status (6)

Country Link
US (1) US11164568B2 (zh)
EP (1) EP3648099B1 (zh)
JP (1) JP6820058B2 (zh)
KR (1) KR102315732B1 (zh)
CN (1) CN108288468B (zh)
WO (1) WO2019001194A1 (zh)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108600911B (zh) * 2018-03-30 2021-05-18 联想(北京)有限公司 一种输出方法及电子设备
CN109034418B (zh) * 2018-07-26 2021-05-28 国家电网公司 作业现场信息传输方法及系统
CN108922531B (zh) * 2018-07-26 2020-10-27 腾讯科技(北京)有限公司 槽位识别方法、装置、电子设备及存储介质
CN109256125B (zh) * 2018-09-29 2022-10-14 阿波罗智联(北京)科技有限公司 语音的离线识别方法、装置与存储介质
CN109634692A (zh) * 2018-10-23 2019-04-16 蔚来汽车有限公司 车载对话系统及用于其的处理方法和系统
CN111198936B (zh) * 2018-11-20 2023-09-15 北京嘀嘀无限科技发展有限公司 一种语音搜索方法、装置、电子设备及存储介质
CN109256133A (zh) * 2018-11-21 2019-01-22 上海玮舟微电子科技有限公司 一种语音交互方法、装置、设备及存储介质
CN109814831A (zh) * 2019-01-16 2019-05-28 平安普惠企业管理有限公司 智能对话方法、电子装置及存储介质
CN109920415A (zh) * 2019-01-17 2019-06-21 平安城市建设科技(深圳)有限公司 基于语音识别的人机问答方法、装置、设备和存储介质
CN109871441A (zh) * 2019-03-13 2019-06-11 北京航空航天大学 一种基于神经网络的导学问答系统及方法
US11158307B1 (en) * 2019-03-25 2021-10-26 Amazon Technologies, Inc. Alternate utterance generation
CN110570839A (zh) * 2019-09-10 2019-12-13 中国人民解放军陆军军医大学第一附属医院 基于人机交互的智能监护系统
KR102577589B1 (ko) * 2019-10-22 2023-09-12 삼성전자주식회사 음성 인식 방법 및 음성 인식 장치
CN110827802A (zh) * 2019-10-31 2020-02-21 苏州思必驰信息科技有限公司 语音识别训练和解码方法及装置
CN111028828A (zh) * 2019-12-20 2020-04-17 京东方科技集团股份有限公司 一种基于画屏的语音交互方法、画屏及存储介质
CN111554275B (zh) * 2020-05-15 2023-11-03 深圳前海微众银行股份有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN112151022A (zh) * 2020-09-25 2020-12-29 北京百度网讯科技有限公司 语音识别的优化方法、装置、设备以及存储介质
CN112331207A (zh) * 2020-09-30 2021-02-05 音数汇元(上海)智能科技有限公司 服务内容监控方法、装置、电子设备和存储介质
CN112614490B (zh) * 2020-12-09 2024-04-16 北京罗克维尔斯科技有限公司 生成语音指令的方法、装置、介质、设备、系统及车辆
CN112669848B (zh) * 2020-12-14 2023-12-01 深圳市优必选科技股份有限公司 一种离线语音识别方法、装置、电子设备及存储介质
CN113744736B (zh) * 2021-09-08 2023-12-08 北京声智科技有限公司 命令词识别方法、装置、电子设备及存储介质
CN114386401A (zh) * 2022-01-13 2022-04-22 中国工商银行股份有限公司 数字人播报方法、装置、设备、存储介质和程序产品
WO2023163254A1 (ko) * 2022-02-28 2023-08-31 엘지전자 주식회사 Tv와 리모컨을 포함하는 시스템 및 그 제어 방법

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101345051A (zh) * 2008-08-19 2009-01-14 南京师范大学 带定量参数的地理信息系统语音控制方法
CN102395013A (zh) * 2011-11-07 2012-03-28 康佳集团股份有限公司 一种对智能电视机的语音控制方法和系统
US20160078866A1 (en) * 2014-09-14 2016-03-17 Speaktoit, Inc. Platform for creating customizable dialog system engines
CN105654946A (zh) * 2014-12-02 2016-06-08 三星电子株式会社 用于语音识别的设备和方法
CN106126714A (zh) * 2016-06-30 2016-11-16 联想(北京)有限公司 信息处理方法及信息处理装置
CN106531160A (zh) * 2016-10-26 2017-03-22 安徽省云逸智能科技有限公司 一种基于词网语言模型的连续语音识别系统
CN106649514A (zh) * 2015-10-16 2017-05-10 百度(美国)有限责任公司 用于受人启发的简单问答(hisqa)的系统和方法

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS57109997A (en) * 1980-12-26 1982-07-08 Tokyo Shibaura Electric Co Word information input device
JP2764277B2 (ja) * 1988-09-07 1998-06-11 株式会社日立製作所 音声認識装置
JP3822990B2 (ja) * 1999-01-07 2006-09-20 株式会社日立製作所 翻訳装置、記録媒体
DE10306022B3 (de) * 2003-02-13 2004-02-19 Siemens Ag Dreistufige Einzelworterkennung
US8489398B1 (en) * 2011-01-14 2013-07-16 Google Inc. Disambiguation of spoken proper names
JP2012208218A (ja) * 2011-03-29 2012-10-25 Yamaha Corp 電子機器
US20170109676A1 (en) * 2011-05-08 2017-04-20 Panaya Ltd. Generation of Candidate Sequences Using Links Between Nonconsecutively Performed Steps of a Business Process
KR101914548B1 (ko) * 2012-01-05 2018-11-02 엘지전자 주식회사 음성 인식 기능을 구비한 이동 단말기 및 그 검색 결과 제공 방법
US8908879B2 (en) * 2012-05-23 2014-12-09 Sonos, Inc. Audio content auditioning
KR101971513B1 (ko) * 2012-07-05 2019-04-23 삼성전자주식회사 전자 장치 및 이의 음성 인식 오류 수정 방법
CN103915095B (zh) * 2013-01-06 2017-05-31 华为技术有限公司 语音识别的方法、交互设备、服务器和系统
KR102072826B1 (ko) * 2013-01-31 2020-02-03 삼성전자주식회사 음성 인식 장치 및 응답 정보 제공 방법
JP6236805B2 (ja) * 2013-03-05 2017-11-29 日本電気株式会社 発話コマンド認識システム
JP5728527B2 (ja) * 2013-05-13 2015-06-03 日本電信電話株式会社 発話候補生成装置、発話候補生成方法、及び発話候補生成プログラム
US9208779B2 (en) * 2013-09-06 2015-12-08 Google Inc. Mixture of n-gram language models
CN103500579B (zh) * 2013-10-10 2015-12-23 中国联合网络通信集团有限公司 语音识别方法、装置及系统
JP6004452B2 (ja) * 2014-07-24 2016-10-05 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 言語モデル用の学習テキストを選択する方法及び当該学習テキストを使用して言語モデルを学習する方法、並びに、それらを実行するためのコンピュータ及びコンピュータ・プログラム
WO2016167779A1 (en) * 2015-04-16 2016-10-20 Mitsubishi Electric Corporation Speech recognition device and rescoring device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101345051A (zh) * 2008-08-19 2009-01-14 南京师范大学 带定量参数的地理信息系统语音控制方法
CN102395013A (zh) * 2011-11-07 2012-03-28 康佳集团股份有限公司 一种对智能电视机的语音控制方法和系统
US20160078866A1 (en) * 2014-09-14 2016-03-17 Speaktoit, Inc. Platform for creating customizable dialog system engines
CN105654946A (zh) * 2014-12-02 2016-06-08 三星电子株式会社 用于语音识别的设备和方法
CN106649514A (zh) * 2015-10-16 2017-05-10 百度(美国)有限责任公司 用于受人启发的简单问答(hisqa)的系统和方法
CN106126714A (zh) * 2016-06-30 2016-11-16 联想(北京)有限公司 信息处理方法及信息处理装置
CN106531160A (zh) * 2016-10-26 2017-03-22 安徽省云逸智能科技有限公司 一种基于词网语言模型的连续语音识别系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3648099A4 *

Also Published As

Publication number Publication date
US11164568B2 (en) 2021-11-02
CN108288468B (zh) 2019-07-19
JP2020518861A (ja) 2020-06-25
JP6820058B2 (ja) 2021-01-27
EP3648099B1 (en) 2021-06-30
KR20190120353A (ko) 2019-10-23
CN108288468A (zh) 2018-07-17
EP3648099A4 (en) 2020-07-08
KR102315732B1 (ko) 2021-10-21
EP3648099A1 (en) 2020-05-06
US20190385599A1 (en) 2019-12-19

Similar Documents

Publication Publication Date Title
WO2019001194A1 (zh) 语音识别方法、装置、设备及存储介质
KR102390940B1 (ko) 음성 인식을 위한 컨텍스트 바이어싱
US11016968B1 (en) Mutation architecture for contextual data aggregator
WO2017084334A1 (zh) 一种语种识别方法、装置、设备及计算机存储介质
JP7300435B2 (ja) 音声インタラクションするための方法、装置、電子機器、およびコンピュータ読み取り可能な記憶媒体
WO2018045646A1 (zh) 基于人工智能的人机交互方法和装置
WO2017084185A1 (zh) 基于语义分析的智能终端控制方法、系统及智能终端
CN108055617B (zh) 一种麦克风的唤醒方法、装置、终端设备及存储介质
WO2020024620A1 (zh) 语音信息的处理方法以及装置、设备和存储介质
CN110070859B (zh) 一种语音识别方法及装置
US11881209B2 (en) Electronic device and control method
US11532301B1 (en) Natural language processing
US11289075B1 (en) Routing of natural language inputs to speech processing applications
JP7063937B2 (ja) 音声対話するための方法、装置、電子デバイス、コンピュータ読み取り可能な記憶媒体、及びコンピュータプログラム
US20230074681A1 (en) Complex natural language processing
WO2024045475A1 (zh) 语音识别方法、装置、设备和介质
KR20210060897A (ko) 음성 처리 방법 및 장치
CN112151015A (zh) 关键词检测方法、装置、电子设备以及存储介质
CN112669842A (zh) 人机对话控制方法、装置、计算机设备及存储介质
JP2022120024A (ja) オーディオ信号処理方法、モデルトレーニング方法、並びにそれらの装置、電子機器、記憶媒体及びコンピュータプログラム
US11626107B1 (en) Natural language processing
CN113111658B (zh) 校验信息的方法、装置、设备和存储介质
JP2020086332A (ja) キーワード抽出装置、キーワード抽出方法、およびプログラム
CN111968646A (zh) 一种语音识别方法及装置
WO2021097629A1 (zh) 数据处理方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18825077

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20197028881

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2019560155

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018825077

Country of ref document: EP

Effective date: 20200129