WO2021232746A1 - 一种语音识别方法、装置、设备及存储介质 - Google Patents

一种语音识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021232746A1
WO2021232746A1 PCT/CN2020/133286 CN2020133286W WO2021232746A1 WO 2021232746 A1 WO2021232746 A1 WO 2021232746A1 CN 2020133286 W CN2020133286 W CN 2020133286W WO 2021232746 A1 WO2021232746 A1 WO 2021232746A1
Authority
WO
WIPO (PCT)
Prior art keywords
hot word
audio
hot
current decoding
related features
Prior art date
Application number
PCT/CN2020/133286
Other languages
English (en)
French (fr)
Inventor
熊世富
刘聪
魏思
刘庆峰
高建清
潘嘉
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Priority to US17/925,483 priority Critical patent/US20230186912A1/en
Priority to JP2022563214A priority patent/JP7407968B2/ja
Priority to EP20936660.8A priority patent/EP4156176A4/en
Priority to KR1020227043996A priority patent/KR102668530B1/ko
Publication of WO2021232746A1 publication Critical patent/WO2021232746A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • Voice recognition is to recognize the input voice data to obtain the recognized text content corresponding to the voice.
  • end-to-end modeling methods have become a research hotspot in the field of speech recognition.
  • the existing end-to-end speech recognition framework based on the attention mechanism as shown in Figure 1 can encode input speech, process the encoded audio based on the attention mechanism, and obtain the recognized text corresponding to the input speech after decoding and classification.
  • This kind of speech recognition method requires a lot of training data, which leads to the problem of over-confidence in the trained model.
  • the performance of the model is that the calculated posterior probability score is very sharp, that is, for high frequency
  • the word recognition effect is very good, and the score is very high; but the recognition effect of low-frequency words is very poor, and the score is very low.
  • hot words such as professional nouns, professional terms, and real-time hot words generated in daily social activities, they are low-frequency words relative to the model, and the recognition effect of such hot word models is very poor.
  • this application provides a voice recognition method, device, equipment and storage medium to solve the problem of poor recognition of hot words by existing voice recognition solutions.
  • the technical solutions are as follows:
  • a voice recognition method including:
  • the recognition result of the speech to be recognized at the current decoding moment is determined.
  • the determining the audio-related features required at the current decoding moment based on the to-be-recognized speech and the hot word database includes:
  • the audio-related features required at the current decoding moment are determined from the to-be-recognized speech.
  • the audio-related features required at the current decoding moment are determined based on the to-be-recognized speech and the hot word library; and the audio-related features required at the current decoding moment are determined from the hot word library based on the audio-related features.
  • Hot word related features; based on the audio related features and the hot word related features, the process of determining the recognition result of the speech to be recognized at the current decoding moment includes:
  • the pre-trained speech recognition model is used to process the to-be-recognized speech and the hot vocabulary to obtain the recognition result of the to-be-recognized speech output by the speech recognition model, where:
  • the voice recognition model has the ability to receive and process the to-be-recognized voice and the hot vocabulary to output the recognition result of the to-be-recognized voice.
  • the speech recognition model includes an audio encoder module, a hot word encoder module, a joint attention module, a decoder module, and a classifier module;
  • the audio encoder module encodes the speech to be recognized to obtain an audio encoding result;
  • the hot word encoder module encodes each hot word in the hot word database to obtain a hot word encoding result;
  • the joint attention module receives and processes the audio encoding result and the hot word encoding result to obtain splicing features required at the current decoding moment, and the splicing features include audio-related features and hot word-related features;
  • the decoder module receives and processes the splicing features required at the current decoding time to obtain the output features of the decoder module at the current decoding time;
  • the classifier module uses the output characteristics of the decoder module at the current decoding time to determine the recognition result of the speech to be recognized at the current decoding time.
  • the joint attention module includes:
  • the first attention model and the second attention model are The first attention model and the second attention model.
  • the first attention model is based on the state vector representing the decoded result information output by the decoder module at the current decoding moment, and the hot word encoding result, from the audio encoding result to determine the audio correlation required at the current decoding moment feature;
  • the second attention model is based on the audio related features, and determines the hot word related features required at the current decoding moment from the hot word encoding result;
  • the audio related features and the hot word related features are combined into the splicing features required at the current decoding moment.
  • the first attention model is based on the state vector representing the information of the decoded result output by the decoder module at the current decoding moment, and the hot word encoding result, from the audio encoding result to determine the current decoding moment required Audio-related features of, including:
  • the state vector and the hot word encoding result are used as the input of a first attention model, and the first attention model determines the audio-related features required at the current decoding moment from the audio encoding result.
  • the second attention model determines from the hot word encoding result the hot word related characteristics required at the current decoding moment based on the audio related characteristics, including:
  • the audio related features are used as the input of the second attention model, and the second attention model determines the hot word related features required at the current decoding moment from the hot word encoding result.
  • the classification nodes of the classifier module include fixed common character nodes and dynamically expandable hot word nodes;
  • the classifier module uses the output characteristics of the decoder module at the current decoding time to determine the recognition result of the speech to be recognized at the current decoding time, including:
  • the classifier module uses the output characteristics of the decoder module at the current decoding moment to determine the probability score of each of the commonly used character nodes and the probability score of each of the hot word nodes;
  • the recognition result of the speech to be recognized at the current decoding moment is determined.
  • the dynamically expandable hot word node corresponds to the hot word in the hot word database in a one-to-one correspondence.
  • the acquiring the voice to be recognized and the configured hot word database includes:
  • the acquiring the voice to be recognized and the configured hot word database includes:
  • it also includes:
  • an interactive response that matches the recognition result is determined, and the interactive response is output.
  • a speech recognition device including:
  • the data acquisition unit is used to acquire the voice to be recognized and the configured hot word database
  • An audio-related feature acquisition unit configured to determine the audio-related features required at the current decoding moment based on the to-be-recognized speech and the hot vocabulary
  • the hot word related feature acquisition unit is configured to determine the hot word related features required at the current decoding moment from the hot word database based on the audio related features;
  • the recognition result obtaining unit is configured to determine the recognition result of the voice to be recognized at the current decoding moment based on the audio related features and the hot word related features.
  • a speech recognition device including: a memory and a processor;
  • the memory is used to store programs
  • the processor is configured to execute the program to implement each step of the voice recognition method described above.
  • a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, each step of the above voice recognition method is realized.
  • a computer program product is provided.
  • the terminal device executes each step of the above-mentioned voice recognition method.
  • the speech recognition method provided by this application is equipped with a hot word database, that is, hot words that may exist in the speech to be recognized, and then the recognition process of the speech to be recognized is determined based on the speech to be recognized and the hot word database. Audio-related features required at the current decoding time. Since the audio-related feature determination process uses hot word information, if the speech segment at the current decoding time contains a hot word, the determined audio-related features can include the hot word corresponding Complete audio information instead of partial information, and further based on the audio-related features to determine the hot word-related features required at the current decoding moment from the hot word database.
  • Hot word related features can accurately indicate whether the speech segment at the current decoding moment contains hot words and which hot words it contains. Finally, based on the audio-related features and hot word related features, the recognition result of the speech to be recognized at the current decoding moment is determined. The recognition of hot words is more accurate.
  • Figure 1 illustrates the existing end-to-end speech recognition framework based on the attention mechanism
  • Figure 2 illustrates an improved end-to-end speech recognition framework based on the attention mechanism
  • FIG. 3 is a flowchart of a voice recognition method provided by an embodiment of this application.
  • FIG. 4 is another improved end-to-end speech recognition framework based on the attention mechanism provided by an embodiment of this application;
  • FIG. 5 is a schematic diagram of encoding hot words by a hot word encoder with a bidirectional long and short-term memory layer according to an example of an embodiment of the application;
  • FIG. 6 is a schematic structural diagram of a speech recognition device provided by an embodiment of this application.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • the inventor further proposes a scheme to increase the probability of the hot word score at the model level, which is achieved by modifying the structure of the speech recognition model.
  • the schematic frame of the modified speech recognition model is shown in Figure 2.
  • the hot word encoder module Bias encoder is added, which can encode hot words. Further, the state information of the decoder is used to operate the audio encoding feature and the hot word encoding feature through the attention mechanism, respectively, to obtain the audio related features and the hot word related features required for decoding. Furthermore, it decodes and classifies based on the audio-related features and hot word-related features to obtain the recognized text corresponding to the input speech.
  • the state information of the decoder only contains the historical text and historical audio information of the decoded result, and simply uses a state information that only contains historical information as the query item of the attention mechanism, and the audio related to the audio obtained by performing the attention operation on the audio coding feature
  • the features are not necessarily complete, and at the same time, the hot word related features obtained by performing the attention operation on the hot word encoding feature are not necessarily accurate, which leads to the final hot word recognition accuracy not particularly high.
  • the inventor further proposes another improvement scheme to solve the above-mentioned problem.
  • the speech recognition method proposed in this case will be introduced in detail.
  • the speech recognition method of this case can be applied to any scene where speech recognition is required.
  • the voice recognition method can be implemented based on electronic devices, such as mobile phones, translators, computers, servers and other devices with data processing capabilities.
  • Step S100 Obtain a voice to be recognized and a configured hot vocabulary.
  • the voice that needs to be recognized for this voice recognition task is used as the voice to be recognized.
  • the configured hot word database can be obtained, and there are multiple hot words stored in the hot word database. It is understandable that the hot word database may be composed of hot words related to the speech recognition task, for example, all the hot words that may exist in the speech to be recognized, such as professional terms, are formed into a hot word database.
  • Step S110 Based on the to-be-recognized speech and the hot vocabulary, determine the audio-related features required at the current decoding moment.
  • the hot word database is taken into consideration, that is, the hot word database is involved in determining the audio-related features. In the calculation process, it has the function of detecting whether the character to be decoded at the current decoding moment is a hot word.
  • the finally obtained audio-related features can contain the complete audio information of the character currently to be decoded.
  • Step S120 Based on the audio related features, determine the hot word related features required at the current decoding moment from the hot word database.
  • the audio-related features required at the current decoding moment have been determined. Therefore, based on the audio-related features, the hot word-related features required at the current decoding time can be determined from the hot word database, and the hot word-related features represent The content of hot words that may appear at the current decoding moment.
  • the audio-related features can contain the complete audio information of the current character to be decoded, based on the hot word-related features determined from the hot word database at the current decoding moment, it is more suitable for hot words with different lengths. Condition.
  • Step S130 Determine the recognition result of the voice to be recognized at the current decoding moment based on the audio related features and the hot word related features.
  • the characters to be decoded at the current moment can be decoded and recognized based on the two, so as to determine the recognition result of the speech to be recognized at the current decoding moment.
  • the speech recognition method provided by the embodiments of the present application is configured with a hot word database, that is, hot words that may exist in the voice to be recognized, and then in the recognition process of the voice to be recognized, the current decoding time is determined based on the voice to be recognized and the hot word database Required audio-related features. Since the audio-related feature determination process uses hot word information, if the speech segment at the current decoding moment contains a hot word, the determined audio-related features can include the complete audio information corresponding to the hot word Rather than local information, the audio-related features are further used to determine the hot word related features required at the current decoding moment from the hot word database.
  • the determined hot word is related
  • the feature can accurately indicate whether the speech segment at the current decoding moment contains hot words and which hot words it contains.
  • the recognition result of the speech to be recognized at the current decoding moment is determined. The recognition is more accurate.
  • the embodiment of the present application introduces the implementation of the above step S100 to obtain the voice to be recognized and the configured hot word database.
  • the conversation scene of the voice to be recognized may be determined.
  • a hot word database related to the conversation scene can be obtained as the hot word database configured in this case.
  • this application can pre-determine the hot-word database corresponding to each conversation scene, and then determine the speech to be recognized. After the conversation scene, obtain the corresponding hot vocabulary.
  • the user's interaction with the machine will involve the user's voice control instructions, that is, the user sends voice control instructions to the machine to achieve a predetermined purpose.
  • the user's voice controls the smart TV to perform related operations such as switching stations, increasing and decreasing the volume
  • the user's voice controlling the smart robot to play songs, query the weather, perform predetermined actions, and so on.
  • the present application can form a hot word database by operating keywords in the user's voice control instructions.
  • the embodiment of the present application can obtain the voice produced by the user in the human-computer interaction scenario as the voice to be recognized, and obtain the pre-configured human-computer interaction scenario, consisting of the operation keywords in the user's voice control instruction Of hot words.
  • an interactive response that matches the recognition result can be determined based on the recognition result, and the interactive response is output.
  • step S110 based on the to-be-recognized speech and the hot vocabulary, the audio-related features required at the current decoding moment are determined.
  • the information of the decoded result before the current decoding moment may be obtained first.
  • the decoded result information may include text information and audio information of decoded characters.
  • the hot word database is taken into consideration, that is, the hot word database is involved in determining the audio-related features.
  • the hot word database is involved in determining the audio-related features.
  • the hot word database is involved in determining the audio-related features.
  • the hot word database is involved in determining the audio-related features.
  • the hot word database is involved in determining the audio-related features.
  • the hot word related features required at the current decoding moment can be determined from the hot word database in the subsequent, which is more able to adapt to the situation of different lengths of hot words.
  • step S110 may include:
  • the audio features of the speech to be recognized can be determined first, and the audio features can be selected from the FilterBank feature of the filter bank, the Mel frequency cepstral coefficient MFCC feature, the perceptual linear prediction PLP feature, and the like. Further, based on the decoded result information and the hot vocabulary, the audio-related features required at the current decoding moment are determined from the audio features of the speech to be recognized.
  • the audio-related features carry complete audio information of the character to be decoded at the current decoding moment. On this basis, sufficient audio-related features can be provided for subsequent accurate recognition of hot words.
  • steps S110-S130 in the above embodiment are introduced to determine the audio-related features and hot word-related features required at the current decoding time, and based on this, determine the recognition of the speech to be recognized at the current decoding time
  • the speech recognition model provided in this embodiment is different from the traditional speech recognition model.
  • the speech recognition model in this embodiment is configured to receive and process the to-be-recognized speech and the hot word library to output the recognition of the to-be-recognized speech The ability to result.
  • the speech recognition model may have information based on the decoded result before the current decoding time and a hot word database, determine the audio-related features required at the current decoding time from the speech to be recognized, and determine from the hot word database based on the audio-related features For the hot word related features required at the current decoding moment, based on the audio related features and the hot word related features, the ability to determine the recognition result of the speech to be recognized at the current decoding moment.
  • the voice recognition model can be used to process the to-be-recognized voice and the hot vocabulary obtained in step S100, and the voice recognition model can output the recognition result of the to-be-recognized voice.
  • the audio features of the voice to be recognized and the hot vocabulary can be input into the voice recognition model to obtain the recognition result of the voice to be recognized output by the model.
  • the speech recognition model can include an audio encoder module, a hot word encoder module, a joint attention module, a decoder module, and a classifier module. Each module cooperates to realize the process of processing the hot words in the received hot word database and the audio features of the speech to be recognized, and finally outputting the recognition result. Next, we will introduce each module separately.
  • X [x 1 ,x 2 ,...,x k ], where x k represents the audio feature vector of the k-th frame, and k is the total speech frame of the speech to be recognized number.
  • Audio encoder module
  • the audio encoder module encodes the speech to be recognized, and obtains the audio encoding result.
  • the audio encoder module can encode the audio feature X of the speech to be recognized, and obtain an audio feature vector sequence composed of audio feature vectors of each frame of speech after encoding.
  • the audio feature vector sequence obtained after encoding is expressed as:
  • the audio encoder module can contain one or more coding layers.
  • the coding layer can be a one-way or two-way long and short-term memory layer in a long and short-term memory neural network or a convolutional layer of a convolutional neural network.
  • the specific structure can be based on Application requirements are determined. For example, for speech recognition with real-time requirements, 3 to 5 layers of one-way long and short-term memory layers can be used, and for speech recognition without real-time requirements, 3 to 5 layers of two-way long and short-term memory layers can be used. Among them, the real-time requirement refers to the recognition while speaking, rather than waiting for the voice to finish the recognition result at one time.
  • you can choose to use a 5-layer unidirectional long-short-term memory layer to process the input audio features X [x 1 , x 2 ,..., x k ], and output a set of encoded audio feature vector sequences
  • Hot word encoder module
  • the hot word encoder module encodes each hot word in the hot word database to obtain the hot word encoding result.
  • the hot word encoder module can independently encode each hot word in the hot word database, and obtain a hot word feature vector sequence composed of hot word feature vectors independently encoded by each hot word.
  • z N is the Nth hot word.
  • z 0 is a special hot word " ⁇ no-bias>", which means that there is no hot word.
  • the hot word selected in the decoding process is z 0 , it means that the character to be decoded at the current decoding moment is not any hot word , Used to deal with the situation that the speech segment being recognized is not a hot word.
  • the total number of hot words is N+1, then the hot word encoder encodes each hot word independently, and the obtained hot word feature vector sequence is expressed as:
  • variable-length hot words can be uniformly coded into vectors of the same dimension.
  • the hot word encoder module may independently encode each hot word into a hot word feature vector of the same dimension according to a set dimension.
  • the hot word encoder module can include one or more coding layers, and the coding layer can be a one-way or two-way long and short-term memory layer in a long- and short-term memory neural network or a convolutional layer of a convolutional neural network.
  • the two-way long and short-term memory layer that can see all the left and right information at the same time encodes hot words better than the one-way long and short-term memory layer. If you choose to use a two-way long and short-term memory layer, use the hot word "As an example, the hot word is composed of the four characters "ke", “da”, "xun” and "fei".
  • the hot word encoder module with a two-way long and short-term memory layer encodes it as shown in Figure 5. Show:
  • the left side is the forward part of the two-way long and short-term memory layer, and the right side is the reverse part.
  • the output vector of the last step is forward and reverse.
  • the obtained vector h z is the encoding vector representation of the hot word.
  • the joint attention module receives and processes the audio encoding result and the hot word encoding result, and obtains the splicing feature required at the current decoding moment, and the splicing feature includes audio-related features and hot word-related features.
  • the joint attention module may include:
  • the first attention model and the second attention model are identical to:
  • the first attention model may be based on the state vector representing the decoded result information output by the decoder module at the current decoding moment, and the hot word encoding result, to determine the audio-related features required at the current decoding moment from the audio encoding result.
  • the state vector and the hot word encoding result can be used as the input of the first attention model, and the first attention model determines the audio-related features required at the current decoding moment from the audio encoding result.
  • the second attention model may be based on audio-related features to determine the hot-word related features required at the current decoding moment from the hot-word encoding result.
  • the audio-related features can be used as the input of the second attention model, and the second attention model determines the hot word related features required at the current decoding moment from the hot word encoding result.
  • the audio-related features and the hot word-related features are combined into the splicing features required at the current decoding moment.
  • the state vector output by the decoder module at the current decoding time can represent the information of the decoded result. Therefore, based on the state vector and the hot word encoding result, the audio encoding result can be operated by the attention mechanism to determine the current decoding time. Audio-related features. That is, the first attention model in this embodiment uses a joint attention mechanism of audio and hot words to allow hot words to participate in the calculation of audio-related features. Since hot word information is used, if the character to be decoded at the current decoding moment If it is a hot word, the audio-related features can be extracted to the complete audio information corresponding to the hot word.
  • the audio-related features to perform attention mechanism operations on the hot word encoding results to determine the hot word-related features required at the current decoding moment, because the audio-related features contain the complete audio information of the hot words, and the hot word correlations obtained by this The features are also more accurate.
  • the attention mechanism uses a vector as the query item, performs attention mechanism operations on a set of feature vector sequences, and selects the feature vector that best matches the query item as the output, specifically: the query item and the feature vector sequence Calculate a matching coefficient for each feature vector in, and then multiply and sum these matching coefficients with the corresponding feature vector to obtain a new feature vector that is the feature vector that best matches the query item.
  • the first attention model is determined from the audio feature vector sequence H x based on the state vector d t and the hot word feature vector sequence H z Audio-related features required at the current decoding moment
  • the second attention model is based on For the query item, the attention mechanism operation is performed on the hot word feature vector sequence H z to determine the hot word related features required at the current decoding moment
  • the first attention model uses each hot word feature vector in the hot word feature vector sequence H z
  • the combination with the state vector d t is the query item.
  • For each audio feature vector in the audio feature vector sequence H x Perform the attention mechanism operation to obtain the matching coefficient matrix E t , and the matching coefficient matrix E t contains the matching degree between any hot word and any frame of speech Indicates the matching degree between the i-th hot word and the j-th frame of speech, that is, the possibility that the j-th frame of speech is the i-th hot word.
  • W d , W z , and W x are model parameters
  • D d , D z , and D x denote vectors d t
  • the dimensions of the three matrices have the same number of rows, all of which are D.
  • the operator ⁇ .,.> means to find the inner product of the vector.
  • E t represents an element in row i and column j
  • the column vector E t represents a row vector of a degree of matching between the word and the hot audio feature vector sequence
  • E t represents a frame in the audio feature vector and the heat word feature The matching degree of the vector sequence.
  • the first attention model determines the audio-related features required at the current decoding moment from the audio feature vector sequence H x according to the above-mentioned matching coefficient matrix E t
  • the process may include the following steps:
  • E t The element in the i-th row and j-th column of E t indicates the possibility that the j-th frame of audio is the i-th hot word, then each row in E t is softmax normalized, and then all the row vectors are added together to average , Get an N+1-dimensional row vector, denoted as:
  • Probability S2 the matching coefficient matrix E t and the probability of each of the heat as a current word w t of the character to be decoded decoding time of each frame of speech is determined as the current time required for decoding the speech content of a t.
  • each column E t is performed softmax normalization, to obtain the matrix A t the column vector of normalized, then w t of the element as the matrix A t weighting coefficient column vector of the matrix A t
  • the weighted summation of all the column vectors in to get a K-dimensional row vector denoted as:
  • a t to the elements as the audio feature vector sequence The weighting coefficient of the audio feature vector at the corresponding position in the corresponding position, and the audio feature vector is weighted and summed to obtain the audio related feature vector
  • the second attention model is based on the above audio-related features Determine the hot word related features required at the current decoding moment from the hot word feature vector sequence H z
  • the process may include the following steps:
  • the second attention model uses audio-related features As a query item, the attention mechanism operation is performed on the hot word feature vector sequence H z to obtain the hot word matching coefficient vector b t .
  • the hot word matching coefficient vector b t contains the probability of each hot word as the character to be decoded at the current decoding moment .
  • b t is denoted as:
  • a matching coefficient is calculated with each hot word feature vector through a small neural network, and then these matching coefficients are normalized by softmax to obtain
  • the hot word feature vector of each hot word in the hot word feature vector sequence H z is weighted and summed to obtain the hot word related features required at the current decoding moment
  • splicing is required to obtain the splicing feature c t required at the current decoding moment, and the splicing feature c t is sent to the decoder module.
  • the probability b t of the character to be decoded at the current decoding moment determined above can also be sent to the classifier module for use in classifying hot words.
  • the decoder module receives and processes the splicing features required by the current decoding moment output by the joint attention module to obtain the output features of the decoder module at the current decoding moment.
  • the decoder module can use the splicing feature c t-1 required at the previous decoding time t-1 at the current decoding time t and the recognition result character at the previous decoding time t-1 to calculate the state at the current decoding time t The vector d t .
  • d t has two functions. One is to send to the joint attention module for the joint attention module to execute the operation process introduced in the above-mentioned embodiment, and calculate the c t at the current decoding moment.
  • the decoder module uses the state vector d t at the current decoding time and the splicing feature c t required by the current decoding time to calculate the output feature of the decoder module at the current decoding time
  • the decoder module can include multiple neural network layers, and two unidirectional long and short-term memory layers can be selected in this application.
  • the first layer of long and short-term memory layer takes the recognition result character at time t-1 and the splicing feature c t-1 output by the attention module as input, and calculates the current decoding time of the decoder module The state vector d t .
  • the decoder module uses d t and c t as the input of the second long and short-term memory layer, and calculates the output characteristics of the decoder module
  • Classifier module
  • the classifier module uses the output characteristics of the decoder module at the current decoding time to determine the recognition result of the speech to be recognized at the current decoding time.
  • the classifier module can use the output characteristics of the decoder module at the current decoding moment Determine the recognition result of the speech to be recognized at the current decoding moment.
  • the output characteristics It is jointly determined according to the state vector d t of the decoder module and the splicing feature c t required at the current decoding moment, and the splicing feature c t Contains the complete audio information of the potential hot words, rather than the partial information of the hot words, based on the determined hot word related features It is also more accurate, which can determine the final output characteristics It will be more accurate, and the recognition result based on this determination will be more accurate, and the recognition accuracy of hot words can be improved.
  • the classifier module two implementations of the classifier module are provided.
  • One is to use an existing conventional static classifier.
  • the number of classification nodes in the static classifier always remains the same and contains common characters.
  • the classifier module can be based on the output characteristics To determine the score probability of each classification node character, and then combine it into the final recognition result.
  • this conventional static classifier expresses hot words as a combination of commonly used characters, and decodes hot words character by character, which can easily lead to false triggers of hot words in non-hot word segments. For example, for the speech data whose speech content to be recognized is "This model is trained to fly”, if the hot word is "KTU Xunfei", the recognition result using the static classifier may be "This model is Xunfei”. Because "Xunfei” is pronounced the same as the word “Xunfei” in the hot word "HKUST Xunfei”, because the static classifier decodes the hot words word by word and encourages word by word, every word has the possibility of being excited.
  • the classification nodes of the classifier module include not only fixed commonly used character nodes, but also dynamically expandable hot word nodes, so as to directly recognize hot words.
  • the purpose of this is that there is no need to split the hot words like in the prior art, and individual characters are recognized and motivated character by character.
  • the voice data "This model is trained to fly”, because "Xunfei” is just the pronunciation of some of the characters in the hot word "HKUST Xunfei", it is the same as the hot word "HKUST Xunfei”.
  • the matching degree of the whole word is far from high, so the whole hot word misrecognition problem will not occur.
  • the number of hot word nodes in the classifier module of this embodiment can be dynamically adjusted according to the scene. For example, if there are N hot words in the hot word database corresponding to the current scene, N hot word nodes of the same number can be set. Taking Chinese speech recognition as an example, using Chinese characters as the modeling unit, assuming that the number of commonly used Chinese characters is V, the number of fixed commonly used character nodes of the classifier module is V. When there are N hot words in the hot word database, Then the number of hot word nodes in the classifier module can be N, that is, the number of all classification nodes in the classifier module is V+N.
  • the process of speech recognition by the classifier module can include:
  • the classifier module uses the output characteristics of the decoder module at the current decoding moment Determine the probability score of each commonly used character node and the probability score of each hot word node; and then determine the final recognition result.
  • the classifier module can use the output characteristics of the decoder module at the current decoding moment Determine the probability score of each commonly used character node, and determine the probability score of each hot word node.
  • the classifier module can use the output characteristics of the decoder module at the current decoding moment Determine the probability score of each commonly used character node. Further, the hot word matching coefficient vector b t introduced in the foregoing embodiment is used to determine the probability score of each hot word node.
  • the probability score of the fixed common character node in the classifier module can be determined by the static classifier.
  • the static classifier uses the output characteristics of the decoder module at the current decoding moment Determine the probability score of each commonly used character node.
  • the static classifier outputs a V-dimensional probability distribution, denoted as:
  • y t represents the character to be decoded at the current decoding time t
  • matrix W is the model parameter of the static classifier, assuming the output characteristics of the decoder module The dimension of is M, then W is a V*M matrix, and the elements in P v (y t ) represent the probability scores of common characters corresponding to common character nodes.
  • the hot word classifier may use the hot word matching coefficient vector b t to determine the probability score of each hot word node.
  • the hot word matching coefficient vector b t contains the probability that each hot word is the character to be decoded at the current decoding moment, so this probability can be used as the probability score of the corresponding hot word node.
  • the recognition results of the speech to be recognized at the current decoding moment can be determined according to the probability scores of the two types of nodes.
  • the classifier module can also add a judge to decide which classifier result to use as the final result.
  • the judger outputs A scalar probability value Indicates the probability score of using the result of the hot word classifier/static classifier as the final output result at the current decoding time t.
  • the current decoding time t uses the result of the hot word classifier as the probability score of the final output result as an example for description. It can be expressed as:
  • w b is the model parameter, which is a The weight vector of the same dimension, sigmoid is the neural network activation function.
  • the determiner can determine the recognition result of the speech to be recognized at the current decoding moment according to the probability scores output by the two classifiers, which can specifically include:
  • the specific process can include:
  • the text annotation sequence of the speech training data can be expressed as:
  • y t represents the t-th character in the text labeling sequence
  • T+1 is the total number of characters in the recognized text.
  • y 0 is the beginning of the sentence " ⁇ s>”
  • y T is the end of the sentence " ⁇ /s>”.
  • Y [ ⁇ s>, Huan, welcome, come, arrive, science, university, news, fly, ⁇ /s>].
  • the audio features can be selected as the FilterBank feature of the filter bank, the Mel frequency cepstrum coefficient MFCC feature, and the perceptual linear prediction PLP feature.
  • this application can pre-set two parameters P and N, P is the probability of whether a certain sentence of training data is selected for training hot words, and N is the maximum number of words of the selected training hot words. Then, the probability that any sentence of training data is selected to select training hot words is P, and from the text labeling sequence of the training data of this sentence, at most N consecutive characters are selected as training hot words. Take "Welcome to HKUST IFLYTEK" as an example, the label comparison of hot words selected for training from this sentence is shown in the following table:
  • the first type of annotation in the above table is: when "HKUST IFLYTEK" is selected as the hot word for training; the second type of annotation is: when "KUST" is selected as the hot word for training.
  • the training hot words and audio features are used as training sample input, and the recognized text of the voice training data is used as the sample label to train the voice recognition model.
  • the embodiment of the present application also provides a voice recognition device.
  • the voice recognition device provided in the embodiment of the present application will be described below.
  • the voice recognition device described below and the voice recognition method described above can be referred to each other.
  • FIG. 6 shows a schematic structural diagram of a voice recognition device provided by an embodiment of the present application.
  • the voice recognition device may include:
  • the data acquisition unit 11 is used to acquire the voice to be recognized and the configured hot word database
  • the audio-related feature acquisition unit 12 is configured to determine the audio-related features required at the current decoding moment based on the to-be-recognized speech and the hot vocabulary;
  • the hot word related feature acquiring unit 13 is configured to determine the hot word related features required at the current decoding moment from the hot word database based on the audio related features;
  • the recognition result obtaining unit 14 is configured to determine the recognition result of the speech to be recognized at the current decoding moment based on the audio related features and the hot word related features.
  • the aforementioned audio-related feature acquisition unit may include:
  • the first audio-related feature acquiring subunit is used to acquire decoded result information before the current decoding moment
  • the second audio-related feature acquisition subunit is used to determine the audio-related features required for the current decoding moment from the to-be-recognized speech based on the decoded result information and the hot vocabulary.
  • the above-mentioned audio-related feature acquisition unit, hot word-related feature acquisition unit, and recognition result acquisition unit can be implemented through a voice recognition model.
  • a pre-trained voice recognition model is used to process the voice to be recognized and the recognition result.
  • State the hot vocabulary to obtain the recognition result of the speech to be recognized output by the speech recognition model, where:
  • the voice recognition model has the ability to receive and process the to-be-recognized voice and the hot vocabulary to output the recognition result of the to-be-recognized voice.
  • the speech recognition model can be equipped with information based on the decoded result before the current decoding time and a hot word database, from the audio characteristics to determine the audio-related features required at the current decoding time, and to determine the current from the hot word database based on the audio-related features
  • the hot word-related features required at the decoding moment are capable of determining the recognition result of the speech to be recognized at the current decoding moment based on the audio-related characteristics and the hot word-related characteristics.
  • the speech recognition model may include an audio encoder module, a hot word encoder module, a joint attention module, a decoder module, and a classifier module;
  • the audio encoder module encodes the to-be-recognized speech to obtain an audio encoding result.
  • the audio feature is encoded by the audio encoder module to obtain an audio feature vector sequence composed of audio feature vectors of each frame of speech.
  • the hot word encoder module encodes each hot word in the hot word database to obtain a hot word encoding result.
  • each of the hot words is independently encoded by the hot word encoder module to obtain a hot word feature vector sequence composed of hot word feature vectors after each hot word is independently encoded.
  • the joint attention module receives and processes the audio encoding result and the hot word encoding result to obtain splicing features required at the current decoding moment, and the splicing features include audio-related features and hot word-related features.
  • the decoder module receives and processes the splicing features required at the current decoding time to obtain the output features of the decoder module at the current decoding time.
  • the classifier module uses the output characteristics of the decoder module at the current decoding time to determine the recognition result of the speech to be recognized at the current decoding time.
  • the joint attention module may include:
  • the first attention model and the second attention model are the first attention model and the second attention model.
  • the first attention model is based on the state vector representing the decoded result information output by the decoder module at the current decoding moment, and the hot word encoding result, from the audio encoding result to determine the audio correlation required at the current decoding moment feature.
  • the state vector and the hot word encoding result may be used as the input of the first attention model, and the first attention model determines the audio-related features required at the current decoding moment from the audio encoding result .
  • the second attention model determines from the hot word encoding result the hot word related characteristics required at the current decoding moment based on the audio related features.
  • the audio-related features may be used as the input of the second attention model, and the second attention model determines the hot-word-related features required at the current decoding moment from the hot-word encoding result.
  • the audio related features and the hot word related features are combined into the splicing features required at the current decoding moment.
  • the process of independently encoding each of the hot words by the hot word encoder module may include:
  • the hot word encoder module independently encodes each of the hot words into hot word feature vectors of the same dimension according to the set dimensions.
  • the above-mentioned first attention model is based on the state vector representing the decoded result information and the hot word feature vector sequence output by the decoder module at the current decoding moment, and determines from the audio feature vector sequence what is needed at the current decoding moment
  • the process of audio-related features can include:
  • the first attention model uses the combination of each hot word feature vector in the hot word feature vector sequence and the state vector as a query item, and performs an attention mechanism operation on the audio feature vector sequence to obtain a matching coefficient matrix ,
  • the matching coefficient matrix includes the degree of matching between any hot word and any frame of speech;
  • the audio-related features required at the current decoding moment are determined from the audio feature vector sequence.
  • the above-mentioned first attention model determines from the audio feature vector sequence the audio-related features required at the current decoding moment according to the matching coefficient matrix, which may include:
  • the matching coefficient matrix determine the probability of each hot word as the character to be decoded at the current decoding moment
  • the matching coefficient matrix and the probability of each hot word as the character to be decoded at the current decoding moment determine the probability of each frame of speech as the required speech content at the current decoding moment
  • the weighted sum of the audio feature vectors of each frame of speech in the audio feature vector sequence is obtained to obtain the audio related features required at the current decoding time.
  • the above-mentioned second attention model determines the hot word related features required at the current decoding moment from the hot word feature vector sequence based on the audio related features, which may include:
  • the second attention model uses the audio-related features as query items, and performs attention mechanism operations on the hot word feature vector sequence to obtain a hot word matching coefficient vector.
  • the hot word matching coefficient vector contains each hot word as The probability of the character to be decoded at the current decoding moment;
  • the hot word feature vector of each hot word in the hot word feature vector sequence is weighted and summed to obtain the hot word required at the current decoding moment
  • the above-mentioned joint attention module may also send the hot word matching coefficient vector to the classifier module; then the classifier module specifically uses the output characteristics of the decoder module at the current decoding moment and The hot word matching coefficient vector determines the recognition result of the speech to be recognized at the current decoding moment.
  • the classification nodes of the above-mentioned classifier module may include fixed common character nodes and dynamically expandable hot word nodes. Based on,
  • the classifier module can use the output characteristics of the decoder module at the current decoding moment to determine the probability score of each common character node and the probability score of each hot word node; according to the probability score of each common character node and each The probability score of the hot word node determines the recognition result of the speech to be recognized at the current decoding moment.
  • the classifier module may use the output characteristics of the decoder module at the current decoding moment to determine the probability score of each of the commonly used character nodes;
  • the classifier module uses the hot word matching coefficient vector to determine the probability score of each hot word node
  • the recognition result of the speech to be recognized at the current decoding moment is determined.
  • the device of the present application may further include a model training unit for:
  • the audio features and the recognized text of the voice training data to train a voice recognition model.
  • the process of acquiring the audio feature of the voice to be recognized by the data acquiring unit may include:
  • FilterBank feature Mel frequency cepstrum coefficient MFCC feature, perceptual linear prediction PLP feature.
  • FIG. 7 shows a schematic structural diagram of the electronic device.
  • the electronic device may include: at least one processor 1001, at least one communication interface 1002, at least one memory 1003, and At least one communication bus 1004;
  • the number of the processor 1001, the communication interface 1002, the memory 1003, and the communication bus 1004 is at least one, and the processor 1001, the communication interface 1002, and the memory 1003 communicate with each other through the communication bus 1004;
  • the processor 1001 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 1003 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory;
  • the memory stores a program
  • the processor can call the program stored in the memory, and the program is used for:
  • the recognition result of the speech to be recognized at the current decoding moment is determined.
  • the embodiments of the present application also provide a readable storage medium, the readable storage medium may store a program suitable for execution by a processor, and the program is used for:
  • the recognition result of the speech to be recognized at the current decoding moment is determined.
  • the embodiments of the present application also provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute any one of the foregoing voice recognition methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别方法、装置、设备及存储介质,配置有热词库,在对待识别语音进行识别过程中,基于待识别语音及热词库,确定当前解码时刻所需的音频相关特征,由于音频相关特征确定过程利用了热词信息,如果当前解码时刻的语音片段中包含某个热词,则确定的音频相关特征中能够包含该热词对应的完整音频信息,进一步基于该音频相关特征从热词库中确定当前解码时刻所需的热词相关特征,热词相关特征能够准确表示当前解码时刻的语音片段是否包含热词以及具体包含哪个热词,最终基于音频相关特征和热词相关特征,确定待识别语音在当前解码时刻的识别结果,该识别结果对热词的识别更加准确。

Description

一种语音识别方法、装置、设备及存储介质 技术领域
本申请要求于2020年05月18日提交至中国国家知识产权局、申请号为202010418728.1、发明名称为“一种语音识别方法、装置、设备及存储介质”的专利申请的优先权,其全部内容通过引用结合在本申请中。
背景技术
语音识别即对输入的语音数据进行识别,以得到语音对应的识别文本内容。随着深度学习序列建模的发展,端到端建模方法成为语音识别领域的研究热点。
如图1示例的现有基于注意力机制的端到端语音识别框架,能够对输入语音进行编码,并基于注意力机制对编码音频进行处理,经过解码、分类得到输入语音对应的识别文本。这种语音识别方法对训练数据的需求量很大,导致训练后的模型出现过度自信(over-confidence)的问题,表现在模型上就是计算出的后验概率得分很尖锐,也即对于高频词的识别效果很好,得分很高;但是对于低频词的识别效果很差,得分很低。对于一些热词如,专业名词、专业术语、日常社会活动中产生的实时热点词汇,相对于模型来说就属于低频词,对于此类热词模型的识别效果很差。
发明内容
有鉴于此,本申请提供了一种语音识别方法、装置、设备及存储介质,用以解决现有语音识别方案对热词识别效果差的问题,其技术方案如下:
在本申请的第一方面中,提供了一种语音识别方法,包括:
获取待识别语音以及配置的热词库;
基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;
基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;
基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果。
优选地,所述基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征,包括:
获取当前解码时刻之前的已解码结果信息;
基于所述已解码结果信息及所热词库,从所述待识别语音中确定当前解码时刻所需的音频相关特征。
优选地,所述基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果的过程,包括:
利用预先训练的语音识别模型处理所述待识别语音及所述热词库,得到语音识别模型输出的待识别语音的识别结果,其中:
所述语音识别模型具备接收并处理待识别语音及热词库,以输出待识别语音的识别结果的能力。
优选地,所述语音识别模型包括音频编码器模块、热词编码器模块、联合注意力模块、解码器模块及分类器模块;
所述音频编码器模块对所述待识别语音进行编码,得到音频编码结果;所述热词编码器模块对所述热词库中各热词进行编码,得到热词编码结果;
所述联合注意力模块接收并处理所述音频编码结果和所述热词编码结果,得到当前解码时刻所需的拼接特征,所述拼接特征包括音频相关特征和热词相关特征;
所述解码器模块接收并处理所述当前解码时刻所需的拼接特征,得到解码器模块当前解码时刻的输出特征;
所述分类器模块利用解码器模块当前解码时刻的输出特征,确定待识别语音在当前解码时刻的识别结果。
优选地,所述联合注意力模块包括:
第一注意力模型和第二注意力模型;
所述第一注意力模型基于解码器模块在当前解码时刻输出的表示已解码结果信息的状态向量,以及所述热词编码结果,从所述音频编码结果中确定当前解码时刻所需的音频相关特征;
所述第二注意力模型基于所述音频相关特征,从所述热词编码结果中确定当前解码时刻所需的热词相关特征;
由所述音频相关特征和所述热词相关特征组合成当前解码时刻所需的拼接特征。
优选地,所述第一注意力模型基于解码器模块在当前解码时刻输出的表示已解码结果信息的状态向量,以及所述热词编码结果,从所述音频编码结果中确定当前解码时刻所需的音频相关特征,包括:
将所述状态向量、所述热词编码结果作为第一注意力模型的输入,由所述第一注意力模型从所述音频编码结果中确定当前解码时刻所需的音频相关特征。
优选地,所述第二注意力模型基于所述音频相关特征,从所述热词编码结果中确定当前解码时刻所需的热词相关特征,包括:
将所述音频相关特征作为第二注意力模型的输入,由所述第二注意力模型从所述热词编码结果中确定当前解码时刻所需的热词相关特征。
优选地,所述分类器模块的分类节点包括固定的常用字符节点和可动态扩展的热词节点;
分类器模块利用解码器模块当前解码时刻的输出特征,确定待识别语音在当前解码时刻的识别结果,包括:
分类器模块利用解码器模块在当前解码时刻的输出特征,确定各所述常用字符节点的概率得分和各所述热词节点的概率得分;
根据各所述常用字符节点的概率得分和各所述热词节点的概率得分,确定待识别语音在当前解码时刻的识别结果。
优选地,所述可动态扩展的热词节点,与所述热词库中的热词一一对应。
优选地,所述获取待识别语音以及配置的热词库,包括:
获取待识别语音,并确定所述待识别语音的会话场景;
获取与所述会话场景相关的热词库。
优选地,所述获取待识别语音以及配置的热词库,包括:
获取人机交互场景下,用户所产出的语音,作为待识别语音;
获取预先配置的在人机交互场景下,由用户语音操控指令中的操作关键词组成的热词库。
优选地,还包括:
基于所述待识别语音的识别结果,确定与所述识别结果相匹配的交互响应,并输出该交互响应。
在本申请的第二方面,提供了一种语音识别装置,包括:
数据获取单元,用于获取待识别语音以及配置的热词库;
音频相关特征获取单元,用于基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;
热词相关特征获取单元,用于基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;
识别结果获取单元,用于基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果。
在本申请的第三方面,提供了一种语音识别设备,包括:存储器和处理器;
所述存储器,用于存储程序;
所述处理器,用于执行所述程序,实现如上所述的语音识别方法的各个步骤。
在本申请的第四方面,提供了一种可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上的语音识别方法的各个步骤。
在本申请的第五方面中,提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行上述语音识别方法的各个步骤。
经由上述方案可知,本申请提供的语音识别方法,配置有热词库,也即待识别语音中可能存在的热词,进而在对待识别语音进行识别过程,基于待识别语音及热词库,确定当前解码时刻所需的音频相关特征,由于音频相关特征确定过程利用了热词信息,如果当前解码时刻的语音片段中包 含某个热词,则确定的音频相关特征中能够包含该热词对应的完整音频信息而非局部信息,进一步基于该音频相关特征从热词库中确定当前解码时刻所需的热词相关特征,由于音频相关特征中能够包含热词对应的完整的音频信息,因此确定的热词相关特征能够准确表示当前解码时刻的语音片段是否包含热词以及具体包含哪个热词,最终基于音频相关特征和热词相关特征,确定待识别语音在当前解码时刻的识别结果,该识别结果对热词的识别更加准确。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1示例的现有基于注意力机制的端到端语音识别框架;
图2示例了一种改进的基于注意力机制的端到端语音识别框架;
图3为本申请实施例提供的一种语音识别方法流程图;
图4为本申请实施例提供的另一种改进的基于注意力机制的端到端语音识别框架;
图5为本申请实施例示例的一种一层双向长短时记忆层的热词编码器对热词的编码示意图;
图6为本申请实施例提供的一种语音识别装置结构示意图;
图7为本申请实施例提供的电子设备的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
为了解决现有语音识别方法对热词识别效果差的问题,本案发明人进行了研究,首先想到的就是对热词得分进行激励,也即对于语音识别模型输出的各候选识别字符中,属于热词的候选识别字符的得分进行激励,以达到提高热词识别率的目的。
但是,进一步研究发现,对于端到端的语音识别模型,其对于热词这种低频词的得分过低,导致热词在解码过中很容易被裁减掉,恶劣的情况下甚至没有被激励的机会,从而无法真正实现对热词提升识别度的目的。
为此,发明人进一步提出了一种在模型层面提高热词得分概率的方案,通过修改语音识别模型的结构来实现。修改后的语音识别模型的示意框架如图2所示。
相比于现有语音识别模型,增加了热词编码器模块Bias encoder,其能够对热词进行编码。进一步,利用解码器Decoder的状态信息通过注意力机制分别对音频编码特征和热词编码特征进行操作,得到解码所需的音频相关特征和热词相关特征。进而基于音频相关特征和热词相关特征进行解码、分类得到输入语音对应的识别文本。
这种方案由于从模型结构层面即考虑了热词,因而相比于直接对模型输出的热词得分进行激励的方式,效果更好。
但是,经过发明人深入研究发现,不同热词的长度可能不一样,要准确地确定音频中是否包含热词、包含哪个热词,不同热词所需的信息是不一样的。而解码器的状态信息只包含已解码结果的历史文本和历史音频信息,单纯用一个只包含历史信息的状态信息作为注意力机制的查询项,对音频编码特征执行注意力操作所得到的音频相关特征并不一定是完整的,同时对热词编码特征执行注意力操作所得到的热词相关特征也不一定是准确的,进而导致最终的热词识别准确度也不是特别高。
为此,发明人进一步提出了另一种改进方案,来解决上述问题。接下来,对本案提出的语音识别方法进行详细介绍。
可以理解的是,本案的语音识别方法可以应用于任何需要进行语音识别的场景下。语音识别方法可以基于电子设备实现,如通过手机、翻译机、电脑、服务器等具备数据处理能力的设备。
接下来,结合附图3示例的流程,对本案的语音识别方法进行介绍,详细包括如下步骤:
步骤S100、获取待识别语音以及配置的热词库。
具体的,对于本次语音识别任务所需要识别的语音作为待识别语音。在语音识别之前,可以获取配置的热词库,热词库中存储有多个热词。可以理解的是,热词库可以是由与语音识别任务相关的热词组成,如将待识别语音中所有可能存在的热词,如专业术语等组成热词库。
此外,还可以直接调用已有的一些热词库作为本实施例中配置的热词库。步骤S110、基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征。
具体的,为了提高语音识别对热词的识别度,当待解码字符为一个潜在的热词时,需要获取该潜在的热词的完整音频信息。因此,本步骤中,为了确保得到的当前解码时刻所需的音频相关特征中包含有潜在热词的完整音频信息,将热词库考虑进来,也即让热词库参与到确定音频相关特征的计算过程中,起到检测当前解码时刻的待解码字符是否为热词的功能。
最终得到的音频相关特征能够包含当前待解码字符的完整音频信息。
步骤S120、基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征。
上一步骤中已经确定了当前解码时刻所需的音频相关特征,因此可以基于该音频相关特征,从热词库中确定出当前解码时刻所需的热词相关特征,该热词相关特征表示了当前解码时刻所可能出现的热词内容。
可以理解的是,由于音频相关特征能够包含当前待解码字符的完整音频信息,基于此从热词库中确定的当前解码时刻所需的热词相关特征,其更加能够适应热词长度不一的情况。
步骤S130、基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果。
在得到当前解码时刻所需的音频相关特征和热词相关特征之后,可以基于二者对当前时刻的待解码字符进行解码识别,从而确定待识别语音在当前解码时刻的识别结果。
本申请实施例提供的语音识别方法,配置有热词库,也即待识别语音中可能存在的热词,进而在对待识别语音进行识别过程,基于待识别语音及热词库,确定当前解码时刻所需的音频相关特征,由于音频相关特征确定过程利用了热词信息,如果当前解码时刻的语音片段中包含某个热词,则确定的音频相关特征中能够包含该热词对应的完整音频信息而非局部信息,进一步基于该音频相关特征从热词库中确定当前解码时刻所需的热词相关特征,由于音频相关特征中能够包含热词对应的完整的音频信息,因此确定的热词相关特征能够准确表示当前解码时刻的语音片段是否包含热词以及具体包含哪个热词,最终基于音频相关特征和热词相关特征,确定待识别语音在当前解码时刻的识别结果,该识别结果对热词的识别更加准确。
本申请实施例介绍了上述步骤S100,获取待识别语音以及配置的热词库的实现方式。
可选的,在获取到待识别语音后,可以确定待识别语音的会话场景。进一步,可以获取与该会话场景相关的热词库,作为本案中配置的热词库。
可以理解的是,不同的会话场景下产出的待识别语音中,包含的热词也可能不同,因此本申请可以预先确定与各会话场景相对应的热词库,进而在确定了待识别语音的会话场景后,获取对应的热词库。
另一种可选的方式,当本申请方案应用于人机交互场景下对语音进行识别时:
可以理解的是,在人机交互场景下,用户与机器进行交互,会涉及到用户语音操控指令,也即用户向机器下发语音操控指令,以实现预定的目的。例如,用户语音控制智能电视实现切换台、增减音量等相关操作,再比如,用户语音控制智能机器人实现播放歌曲、查询天气、执行预定动作等。
在此基础上,若要使得机器能够正确响应用户,则需要机器能够准确识别语音操控指令。为此,本申请可以将用户语音操控指令中的操作关键词组成热词库。
基于此,本申请实施例可以获取人机交互场景下,用户所产出的语音, 作为待识别语音,并获取预先配置的在人机交互场景下,由用户语音操控指令中的操作关键词组成的热词库。
在上述基础上,按照本申请方案确定了待识别语音的识别结果后,可以基于该识别结果,确定与该识别结果相匹配的交互响应,并输出该交互响应。
按照本实施例介绍的方案,能够准确识别人机交互过程中用户的操控指令,从而使得机器能够基于正确的识别结果做出匹配的交互响应。
本申请的另一个实施例中,对上述步骤S110,基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征。
具体的,待识别语音的各帧语音之间存在上下文关系,为了确定当前解码时刻所需的音频相关特征,本实施例中可以先获取当前解码时刻之前的已解码结果信息。已解码结果信息可以包含已解码字符的文本信息、音频信息。
此外,为了提高语音识别对热词的识别度,当待解码字符为一个潜在的热词时,需要获取该潜在的热词的完整音频信息。因此,本步骤中,为了确保得到的当前解码时刻所需的音频相关特征中包含有潜在热词的完整音频信息,将热词库考虑进来,也即让热词库参与到确定音频相关特征的计算过程中,起到检测当前解码时刻的待解码字符是否为热词的功能。在此基础上,后续可以基于该音频相关特征,从热词库中确定当前解码时刻所需的热词相关特征,其更加能够适应热词长度不一的情况。
由上可知,步骤S110确定音频相关特征的过程可以包括:
S1、获取当前解码时刻之前的已解码结果信息。
S2、基于已解码结果信息及热词库,从待识别语音中确定当前解码时刻所需的音频相关特征。
具体的,本实施例中可以先确定待识别语音的音频特征,其音频特征可以选用滤波器组FilterBank特征、梅尔频率倒谱系数MFCC特征、感知线性预测PLP特征等。进一步的,基于已解码结果信息及热词库,从待识别语音的音频特征中确定当前解码时刻所需的音频相关特征。
其中,音频相关特征携带有当前解码时刻的待解码字符的完整音频信息。在此基础上,才能够为后续准确识别热词提供足够的音频相关特征。
在本申请的另一个实施例中,介绍了上述实施例中步骤S110-S130,确定当前解码时刻所需的音频相关特征、热词相关特征,并基于此确定待识别语音在当前解码时刻的识别结果的一种可选实现方式。
具体的,可以通过语音识别模型来实现。
当然,本实施例中提供的语音识别模型和传统的语音识别模型不同,本实施例中的语音识别模型被配置为,具备接收并处理待识别语音及热词库,以输出待识别语音的识别结果的能力。
具体的,语音识别模型可以具备基于当前解码时刻之前的已解码结果信息及热词库,从待识别语音中确定当前解码时刻所需的音频相关特征,并基于音频相关特征从热词库中确定当前解码时刻所需的热词相关特征,基于所述音频相关特征和所述热词相关特征,确定待识别语音在当前解码时刻的识别结果的能力。
在此基础上,本实施例中可以将利用语音识别模型处理前述步骤S100中获取的待识别语音及热词库,则语音识别模型即可输出待识别语音的识别结果。
具体的,可以将待识别语音的音频特征及热词库输入语音识别模型,得到模型输出的待识别语音的识别结果。
接下来,结合图4,对语音识别模型的框架进行介绍。
语音识别模型可以包括音频编码器模块、热词编码器模块、联合注意力模块、解码器模块及分类器模块。由各个模块配合实现对接收的热词库中各热词和待识别语音的音频特征进行处理,并最终输出识别结果的过程。接下来分别对各个模块进行介绍。
为了便于说明,定义待识别语音的音频特征为X=[x 1,x 2,...,x k],其中,x k表示第k帧音频特征向量,k为待识别语音的总语音帧数目。
1、音频编码器模块:
音频编码器模块对待识别语音进行编码,得到音频编码结果。
具体的,音频编码器模块可以对待识别语音的音频特征X进行编码,得到编码后由每帧语音的音频特征向量组成的音频特征向量序列。
其中,编码后得到的音频特征向量序列表示为:
Figure PCTCN2020133286-appb-000001
其中,
Figure PCTCN2020133286-appb-000002
表示第k帧音频特征向量,
Figure PCTCN2020133286-appb-000003
对应x k经过音频编码器模块编码后的结果。
音频编码器模块可以包含为一层或多层编码层,编码层可以采用单向或双向长短时记忆神经网络中长短时记忆层或卷积神经网络的卷积层,具体采用哪种结构可以根据应用需求确定。如对于有实时性要求的语音识别可以使用3~5层的单向长短时记忆层,对于没有实时性要求的语音识别可以使用3~5层的双向长短时记忆层。其中,实时性要求指的是边说边识别,而不是等语音说完了才一次性出识别结果。
本实施例中可以选择使用5层单向长短时记忆层对输入的音频特征X=[x 1,x 2,...,x k]进行处理,输出一组编码后的音频特征向量序列
Figure PCTCN2020133286-appb-000004
2、热词编码器模块:
热词编码器模块对热词库中各热词进行编码,得到热词编码结果。
具体的,热词编码器模块可以对热词库中每个热词进行独立编码,得到由各热词独立编码后的各热词特征向量组成的热词特征向量序列。
定义热词库中共有N+1个热词:
Z=[z 0,z 1,...,z N]
其中,z N为第N个热词。其中,z 0是一个特殊的热词“<no-bias>”,其表示不存在热词,当解码过程选中的热词为z 0时,表示当前解码时刻的待解码字符不是任何一个热词,用于处理正在识别的语音片段不是热词的情况。
热词的总个数为N+1,则热词编码器对每个热词独立编码,得到的热 词特征向量序列表示为:
Figure PCTCN2020133286-appb-000005
其中,
Figure PCTCN2020133286-appb-000006
为第N个热词经过热词编码器模块独立编码之后的热词特征向量。
可以理解的是,不同热词包含的字符数可能不一样,比如定义“中科大”和“科大讯飞”都是热词,则包含的字符数分别为3和4。
为了便于模型处理,本实施例中可以将变长的热词,统一编码成相同维度的向量。具体的,热词编码器模块可以按照设定的维度,将各个热词分别独立编码成相同维度的热词特征向量。
热词编码器模块可以包含为一层或多层编码层,编码层可以采用单向或双向长短时记忆神经网络中长短时记忆层或卷积神经网络的卷积层。通常来讲,能同时看到左右全部信息的双向长短时记忆层对热词的编码效果好于单向长短时记忆层,如选择使用一层双向长短时记忆层,以热词“科大讯飞”为例,该热词由“科”、“大”、“讯”、“飞”四个字组成,一层双向长短时记忆层的热词编码器模块对它编码的过程如图5所示:
图5中的左边为双向长短时记忆层的正向部分,右边为反向部分,将正向和反向最后一步的输出向量
Figure PCTCN2020133286-appb-000007
Figure PCTCN2020133286-appb-000008
进行拼接,得到的向量h z即为热词的编码向量表示。
3、联合注意力模块:
联合注意力模块接收并处理音频编码结果和热词编码结果,得到当前解码时刻所需的拼接特征,该拼接特征包括音频相关特征和热词相关特征。
本实施例中介绍了联合注意力模块的一种可选架构,如图4所示,联合注意力模块可以包括:
第一注意力模型和第二注意力模型。其中:
第一注意力模型可以基于解码器模块在当前解码时刻输出的表示已解码结果信息的状态向量,以及热词编码结果,从音频编码结果中确定当前解码时刻所需的音频相关特征。
具体的,可以将状态向量、热词编码结果作为第一注意力模型的输入, 由第一注意力模型从音频编码结果中确定当前解码时刻所需的音频相关特征。
第二注意力模型可以基于音频相关特征,从热词编码结果中确定当前解码时刻所需的热词相关特征。
具体的,可以将音频相关特征作为第二注意力模型的输入,由第二注意力模型从热词编码结果中确定当前解码时刻所需的热词相关特征。
最后,由所述音频相关特征和所述热词相关特征组合成当前解码时刻所需的拼接特征。
由上可知,解码器模块在当前解码时刻输出的状态向量能够表示已解码结果信息,因此可以基于该状态向量和热词编码结果,对音频编码结果进行注意力机制操作,确定当前解码时刻所需的音频相关特征。也即,本实施例中第一注意力模型使用了音频、热词联合注意力机制,让热词参与到音频相关特征的计算中,由于利用了热词信息,如果当前解码时刻的待解码字符为某个热词,则音频相关特征可以抽取到该热词对应的完整音频信息。
进一步,再利用音频相关特征对热词编码结果进行注意力机制操作,确定当前解码时刻所需的热词相关特征,因为音频相关特征包含了热词的完整音频信息,以此得到的热词相关特征也更加准确。
其中,注意力机制使用一个向量作为查询项(query),对一组特征向量序列进行注意力机制操作,选出与查询项最匹配的特征向量作为输出,具体为:将查询项与特征向量序列中每个特征向量计算一个匹配系数,然后将这些匹配系数与对应的特征向量相乘并求和,得到一个新的特征向量即为与查询项最匹配的特征向量。
定义当前时刻为t时刻,解码器模块在t时刻输出的状态向量为d t,则第一注意力模型基于状态向量d t和热词特征向量序列H z,从音频特征向量序列H x中确定当前解码时刻所需的音频相关特征
Figure PCTCN2020133286-appb-000009
第二注意力模型以
Figure PCTCN2020133286-appb-000010
为查询项,对热词特征向量序列H z执行注意力机制操作,确定当前解码时刻所需的热词相关特征
Figure PCTCN2020133286-appb-000011
接下来,对第一注意力模型的实施方式进行详细说明:
首先,第一注意力模型分别以热词特征向量序列H z中的每一热词特征向量
Figure PCTCN2020133286-appb-000012
与状态向量d t的组合为查询项,对音频特征向量序列H x中每一个音频特征向量
Figure PCTCN2020133286-appb-000013
进行注意力机制操作,得到匹配系数矩阵E t,所述匹配系数矩阵E t中包含任一热词与任一帧语音的匹配度
Figure PCTCN2020133286-appb-000014
表示第i个热词与第j帧语音的匹配度,也就是第j帧语音是第i个热词的可能性。
其中,
Figure PCTCN2020133286-appb-000015
的计算过程参照下式:
Figure PCTCN2020133286-appb-000016
其中,W d、W z、W x为模型参数,
Figure PCTCN2020133286-appb-000017
D d、D z、D x分别表示向量d t
Figure PCTCN2020133286-appb-000018
的维度,三个矩阵的行数相同,均为D,操作符<.,.>表示求向量内积。
元素
Figure PCTCN2020133286-appb-000019
组成热词与语音帧的匹配系数矩阵E t,E t∈R K×(N+1)。其中
Figure PCTCN2020133286-appb-000020
表示E t中第i行第j列的元素,E t中的列向量表示某个热词与音频特征向量序列间的匹配度,E t中的行向量表示某帧音频特征向量与热词特征向量序列的匹配度。
进一步,第一注意力模型根据上述匹配系数矩阵E t,从音频特征向量序列H x中确定当前解码时刻所需的音频相关特征
Figure PCTCN2020133286-appb-000021
具体的,该过程可以包括如下步骤:
S1、根据匹配系数矩阵E t,确定每个热词作为当前解码时刻的待解码字符的概率w t
E t中第i行第j列的元素表示第j帧音频是第i个热词的可能性,那么将E t中的每一行进行softmax归一化,然后再把所有行向量相加求平均,得到一个N+1维的行向量,记为:
Figure PCTCN2020133286-appb-000022
其中,
Figure PCTCN2020133286-appb-000023
表示当前解码时刻t所要解码的字符为第i个热词的可能性。也即,确定出当前解码时刻t,语音中最有可能出现的是哪个热词。
S2、根据匹配系数矩阵E t及每个热词作为当前解码时刻的待解码字符的概率w t,确定每一帧语音作为当前解码时刻所需的语音内容的概率a t
具体的,将E t中的每一列进行softmax归一化,得到列向量归一化后的矩阵A t,然后将w t中的元素作为矩阵A t中列向量的加权系数,将矩阵A t中 所有的列向量加权求和,得到一个K维的行向量,记为:
Figure PCTCN2020133286-appb-000024
其中,
Figure PCTCN2020133286-appb-000025
表示第j帧音频特征是当前解码时刻t解码所需的语音内容的可能性。
S3、以每一帧语音作为当前解码时刻所需的语音内容的概率a t作为加权系数,将音频特征向量序列H x中各帧语音的音频特征向量加权求和,得到当前解码时刻所需的音频相关特征
Figure PCTCN2020133286-appb-000026
具体的,以a t中的元素作为音频特征向量序列
Figure PCTCN2020133286-appb-000027
中对应位置音频特征向量的加权系数,将音频特征向量加权求和,得到音频相关特征向量
Figure PCTCN2020133286-appb-000028
再进一步的,对第二注意力模型的实施方式进行详细说明:
第二注意力模型根据上述音频相关特征
Figure PCTCN2020133286-appb-000029
从热词特征向量序列H z中确定当前解码时刻所需的热词相关特征
Figure PCTCN2020133286-appb-000030
具体的,该过程可以包括如下步骤:
S1、第二注意力模型以音频相关特征
Figure PCTCN2020133286-appb-000031
作为查询项,对热词特征向量序列H z进行注意力机制操作,得到热词匹配系数向量b t,热词匹配系数向量b t中包含每一热词作为当前解码时刻的待解码字符的概率。b t记为:
Figure PCTCN2020133286-appb-000032
其中,
Figure PCTCN2020133286-appb-000033
表示第i个热词作为当前解码时刻的解码字符的概率。
具体的,将
Figure PCTCN2020133286-appb-000034
与每一个热词特征向量通过一个小型神经网络计算得到一个匹配系数,然后将这些匹配系数进行softmax归一化,得到
Figure PCTCN2020133286-appb-000035
S2、以每一热词作为当前解码时刻的待解码字符的概率
Figure PCTCN2020133286-appb-000036
作为加权系数,将热词特征向量序列H z中每个热词的热词特征向量加权求和,得到当前解码时刻所需的热词相关特征
Figure PCTCN2020133286-appb-000037
由于
Figure PCTCN2020133286-appb-000038
包含了潜在的热词的完整音频信息,而非热词的局部信息,基于此确定的热词相关特征
Figure PCTCN2020133286-appb-000039
也更加准确。
Figure PCTCN2020133286-appb-000040
Figure PCTCN2020133286-appb-000041
确定之后需要进行拼接,得到当前解码时刻所需的拼接特征c t, 将拼接特征c t送入解码器模块。
进一步的,还可以将上述确定的当前解码时刻的待解码字符的概率b t送入分类器模块供对热词分类使用。
4、解码器模块:
解码器模块接收并处理联合注意力模块输出的当前解码时刻所需的拼接特征,得到解码器模块当前解码时刻的输出特征。
具体的,解码器模块可以利用当前解码时刻t的前一解码时刻t-1所需的拼接特征c t-1及前一解码时刻t-1的识别结果字符,计算得到当前解码时刻t的状态向量d t
其中,d t有两个作用,其一是发送给联合注意力模块,以供联合注意力模块执行上述实施例介绍的操作过程,计算得到当前解码时刻的c t
其二,解码器模块利用当前解码时刻的状态向量d t和当前解码时刻所需的拼接特征c t,计算得到解码器模块在当前解码时刻的输出特征
Figure PCTCN2020133286-appb-000042
需要说明的是,解码器模块可以包含多个神经网络层,本申请中可以选用两层单向长短时记忆层。在解码当前时刻t的待解码字符时,第一层长短时记忆层以t-1时刻的识别结果字符和注意力模块输出的拼接特征c t-1为输入,计算得到解码器模块当前解码时刻的状态向量d t。解码器模块将d t与c t作为第二层长短时记忆层的输入,计算得到解码器模块的输出特征
Figure PCTCN2020133286-appb-000043
5、分类器模块:
分类器模块利用解码器模块当前解码时刻的输出特征,确定待识别语音在当前解码时刻的识别结果。
具体的,分类器模块可以利用解码器模块在当前解码时刻的输出特征
Figure PCTCN2020133286-appb-000044
确定待识别语音在当前解码时刻的识别结果。
由上可知,输出特征
Figure PCTCN2020133286-appb-000045
是根据解码器模块的状态向量d t和当前解码时刻所需的拼接特征c t联合确定的,而拼接特征c t中的
Figure PCTCN2020133286-appb-000046
包含了潜在的热词的完整音频信息,而非热词的局部信息,基于此确定的热词相关特征
Figure PCTCN2020133286-appb-000047
也更加准确,由此可以确定最终得到的输出特征
Figure PCTCN2020133286-appb-000048
也会更加准确,进而基于此 确定的识别结果也更加准确,并且能够提高热词的识别准确率。
本申请的一个实施例中,提供了分类器模块的两种实现方式,一种是采用现有的常规静态分类器,该静态分类器中分类节点的数目始终维持不变,包含有常见的字符。分类器模块可以根据输出特征
Figure PCTCN2020133286-appb-000049
来确定各个分类节点字符的得分概率,进而组合成最终的识别结果。
但是,这种常规静态分类器将热词表示为常用字符的组合,对热词进行逐字符解码,很容易导致非热词片段的热词误触发。例如,对于待识别语音内容为“这个模型训飞了”的语音数据,假设热词为“科大讯飞”,则使用静态分类器的识别结果可能是“这个模型讯飞了”。因为“训飞”和热词“科大讯飞”中的“讯飞”二字发音一样,由于静态分类器对热词进行逐字解码、逐字激励,每个单字都存在被激励的可能,很可能将语音片段中与热词存在部分发音匹配的内容错误得激励为热词的一部分,也就是将“训飞”中的“训”错误识别为热词“科大讯飞”中的“讯”。
为此,本申请提供了一种分类器模块的新结构,分类器模块的分类节点既包含固定的常用字符节点,还包含可动态扩展的热词节点,从而实现直接对热词进行整词识别的目的,不需要像现有技术那样将热词拆分,单个字符逐字符进行识别、激励。仍以上述示例的例子说明,对于语音数据“这个模型训飞了”,由于“训飞”只是与热词“科大讯飞”中部分字符的读音相同,其与热词“科大讯飞”这个整词的匹配程度远远不高,因此不会发生整个热词误识别的问题。而当语音数据中包含某个热词时,按照本实施例的分类器模块,由于分类节点中包含了热词这个整词,因此可以直接识别出热词这个整词,很好的提升了热词识别效果。
本实施例的分类器模块中的热词节点个数可以根据场景而动态调整,如当前场景对应的热词库中中N个热词,则可以设置相同个数的N个热词节点。以中文语音识别为例,以汉字为建模单元,假设常用汉字个数为V个,则分类器模块的固定常用字符节点的个数为V,当热词库中共存在N个热词时,则分类器模块的热词节点的个数可以为N,也即分类器模块的所有分类节点的数目为V+N。
基于上述这种新型结构的分类器模块,分类器模块进行语音识别的过 程可以包括:
分类器模块利用解码器模块在当前解码时刻的输出特征
Figure PCTCN2020133286-appb-000050
确定各常用字符节点的概率得分和各热词节点的概率得分;进而确定最终的识别结果。
一种可选的方式下,分类器模块可以利用解码器模块在当前解码时刻的输出特征
Figure PCTCN2020133286-appb-000051
分别确定各常用字符节点的概率得分,以及确定各热词节点的概率得分。
另一种可选的方式下,分类器模块可以利用解码器模块在当前解码时刻的输出特征
Figure PCTCN2020133286-appb-000052
确定各常用字符节点的概率得分。进一步,利用前述实施例介绍的热词匹配系数向量b t,确定各热词节点的概率得分。
可以理解的是,对于分类器模块中的固定常用字符节点,其概率得分可以通过静态分类器来确定。具体的,静态分类器利用解码器模块在当前解码时刻的输出特征
Figure PCTCN2020133286-appb-000053
确定各常用字符节点的概率得分。
静态分类器输出一个V维的概率分布,记为:
Figure PCTCN2020133286-appb-000054
其中,y t表示当前解码时刻t的待解码字符,矩阵W为静态分类器的模型参数,假设解码器模块的输出特征
Figure PCTCN2020133286-appb-000055
的维度为M,则W为一个V*M的矩阵,P v(y t)中的元素表示对应常用字符节点的常用字符的概率得分。
对于分类器模块中的动态可扩展热词节点,其概率得分可以通过热词分类器来确定。具体的,热词分类器可以利用热词匹配系数向量b t来确定各热词节点的概率得分。
前述已经介绍过程,热词匹配系数向量b t中包含每一热词作为当前解码时刻的待解码字符的概率,因此可以用该概率,作为对应热词节点的概率得分。
Figure PCTCN2020133286-appb-000056
其中,
Figure PCTCN2020133286-appb-000057
表示第i个热词作为当前解码时刻的解码字符的概率,可以将此作为第i个热词节点的概率得分。第0个热词为“<no-bias>”表示“不是热词”,当i等于0时,
Figure PCTCN2020133286-appb-000058
表示解码结果“不是热词”的概率得分。
在确定了常用字符节点和热词节点的概率得分之后,可以根据两种类型节点的概率得分,确定待识别语音在当前解码时刻的识别结果。
可以理解的是,由于同时存在静态分类器和热词分类器两个分类器,因此分类器模块还可以增加一个判断器,用于决定到底使用哪个分类器的结果作为最终结果,该判断器输出一个标量的概率值
Figure PCTCN2020133286-appb-000059
表示当前解码时刻t使用热词分类器/静态分类器的结果作为最终输出结果的概率得分。
Figure PCTCN2020133286-appb-000060
表示当前解码时刻t使用热词分类器的结果作为最终输出结果的概率得分为例进行说明,
Figure PCTCN2020133286-appb-000061
可以表示为:
Figure PCTCN2020133286-appb-000062
其中,w b为模型参数,是一个与
Figure PCTCN2020133286-appb-000063
维度相同的权重向量,sigmoid为神经网络激活函数。
则判断器可以根据两个分类器输出的概率得分,确定待识别语音在当前解码时刻的识别结果,具体可以包括:
对于N个热词中的第i个热词节点(i的取值范围为[1,N]),在静态分类器输出的概率分布中它的得分为0,在热词分类器中它的概率得分为
Figure PCTCN2020133286-appb-000064
因此最终它的概率得分为
Figure PCTCN2020133286-appb-000065
对于V个常用字符y t,在静态分类器输出的概率分布中它的得分为P v(y t),在热词分类器中它的概率得分为
Figure PCTCN2020133286-appb-000066
因此最终它的概率得分为
Figure PCTCN2020133286-appb-000067
在本申请的又一个实施例中,介绍了一种上述语音识别模型的训练方式。
由于本申请提出的语音识别模型需要具备支持任意热词识别的能力,这就说明在模型训练中不能限定热词。因此本申请可以从训练数据的文本标注中随机挑选标注片段作为热词,参与整个模型训练。具体流程可以包括:
S1、获取标注有识别文本的语音训练数据。
其中,语音训练数据的文本标注序列可以表示为:
Y=[y 0,y 1,...y t,...y T]
其中,y t表示文本标注序列中第t个字符,T+1为识别文本的总字符数目。其中,y 0为句子开始符“<s>”,y T为句子结束符“</s>”。
以中文语音识别为例,并用单个汉字作为建模单元。假设某句话的文 本内容为“欢迎来到科大讯飞”,共有8个汉字,加上句子开始符和结束符,文本标注序列总共有10个字符,则文本标注序列表示为:
Y=[<s>,欢,迎,来,到,科,大,讯,飞,</s>]。
S2、获取所述语音训练数据的音频特征。
其中,音频特征可以选用滤波器组FilterBank特征、梅尔频率倒谱系数MFCC特征、感知线性预测PLP特征等。
S3、从所述语音训练数据的标注文本中随机挑选标注片段作为训练热词。
具体的,本申请可以预先设置P和N两个参数,P为某句训练数据是否挑选训练热词的概率,N为挑选的训练热词的最大字数。则任意一句训练数据被挑选出选取训练热词的概率为P,从该句训练数据的文本标注序列中,挑选最多连续N个字符作为训练热词。以“欢迎来到科大讯飞”为例,从该句话中挑选训练热词的标注对比如下表所示:
Figure PCTCN2020133286-appb-000068
其中,上述表格中第一种标注为:当“科大讯飞”被选为训练热词后的标注;第二种标注为:当“科大”“被选为训练热词后的标注。
由上可知,当原始标注中“科”、“大”、“讯”、“飞”被挑选为训练热词,则需要把这四个字合并为整词“科大讯飞”,并且它的后面添加特殊标记符“<bias>”。“<bias>”的作用是引入训练错误,强迫模型训练时更新训练热词相关的模型参数,比如热词编码器模块。当“科大讯飞”或者“科大”被选为训练热词后,需要将它加入这次模型更新的训练热词列表中, 作为热词编码器模块的输入和分类器模块的训练热词分类节点。每次模型更新时训练热词挑选工作独立进行,初始时刻训练热词列表为空。
S4、利用所述训练热词、所述音频特征及语音训练数据的识别文本,训练语音识别模型。
具体的,以训练热词和音频特征作为训练样本输入,以语音训练数据的识别文本作为样本标签,训练语音识别模型。
本申请实施例还提供了一种语音识别装置,下面对本申请实施例提供的语音识别装置进行描述,下文描述的语音识别装置与上文描述的语音识别方法可相互对应参照。
请参阅图6,示出了本申请实施例提供的语音识别装置的结构示意图,该语音识别装置可以包括:
数据获取单元11,用于获取待识别语音以及配置的热词库;
音频相关特征获取单元12,用于基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;
热词相关特征获取单元13,用于基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;
识别结果获取单元14,用于基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果。
可选的,上述音频相关特征获取单元可以包括:
第一音频相关特征获取子单元,用于获取当前解码时刻之前的已解码结果信息;
第二音频相关特征获取子单元,用于基于所述已解码结果信息及所热词库,从所述待识别语音中确定当前解码时刻所需的音频相关特征。
可选的,上述音频相关特征获取单元、热词相关特征获取单元及识别结果获取单元的实现过程可以通过语音识别模型实现,具体的,利用预先训练的语音识别模型处理所述待识别语音及所述热词库,得到语音识别模型输出的待识别语音的识别结果,其中:
所述语音识别模型具备接收并处理待识别语音及热词库,以输出待识 别语音的识别结果的能力。
具体的,语音识别模型可以具备基于当前解码时刻之前的已解码结果信息及热词库,从音频特征中确定当前解码时刻所需的音频相关特征,并基于音频相关特征从热词库中确定当前解码时刻所需的热词相关特征,基于所述音频相关特征和所述热词相关特征,确定待识别语音在当前解码时刻的识别结果的能力。
可选的,语音识别模型可以包括音频编码器模块、热词编码器模块、联合注意力模块、解码器模块及分类器模块;
其中,所述音频编码器模块对所述待识别语音进行编码,得到音频编码结果。
具体的,由所述音频编码器模块对所述音频特征进行编码,得到由每帧语音的音频特征向量组成的音频特征向量序列。
所述热词编码器模块对所述热词库中各热词进行编码,得到热词编码结果。
具体的,由所述热词编码器模块对每个所述热词进行独立编码,得到由各热词独立编码后的各热词特征向量组成的热词特征向量序列。
所述联合注意力模块接收并处理所述音频编码结果和所述热词编码结果,得到当前解码时刻所需的拼接特征,所述拼接特征包括音频相关特征和热词相关特征。
所述解码器模块接收并处理所述当前解码时刻所需的拼接特征,得到解码器模块当前解码时刻的输出特征。
所述分类器模块利用解码器模块当前解码时刻的输出特征,确定待识别语音在当前解码时刻的识别结果。
其中,可选的,所述联合注意力模块可以包括:
第一注意力模型和第二注意力模型。
所述第一注意力模型基于解码器模块在当前解码时刻输出的表示已解码结果信息的状态向量,以及所述热词编码结果,从所述音频编码结果中确定当前解码时刻所需的音频相关特征。
具体的,可以将所述状态向量、所述热词编码结果作为第一注意力模 型的输入,由所述第一注意力模型从所述音频编码结果中确定当前解码时刻所需的音频相关特征。
所述第二注意力模型基于所述音频相关特征,从所述热词编码结果中确定当前解码时刻所需的热词相关特征。
具体的,可以将所述音频相关特征作为第二注意力模型的输入,由所述第二注意力模型从所述热词编码结果中确定当前解码时刻所需的热词相关特征。
由所述音频相关特征和所述热词相关特征组合成当前解码时刻所需的拼接特征。
可选的,上述热词编码器模块对每个所述热词进行独立编码的过程,可以包括:
所述热词编码器模块按照设定的维度,将各个所述热词分别独立编码成相同维度的热词特征向量。
可选的,上述第一注意力模型基于解码器模块在当前解码时刻输出的表示已解码结果信息的状态向量及热词特征向量序列,从所述音频特征向量序列中确定当前解码时刻所需的音频相关特征的过程,可以包括:
第一注意力模型分别以所述热词特征向量序列中的每一热词特征向量与所述状态向量的组合为查询项,对所述音频特征向量序列进行注意力机制操作,得到匹配系数矩阵,所述匹配系数矩阵中包含任一热词与任一帧语音的匹配度;
根据所述匹配系数矩阵,从所述音频特征向量序列中确定当前解码时刻所需的音频相关特征。
可选的,上述第一注意力模型根据所述匹配系数矩阵,从所述音频特征向量序列中确定当前解码时刻所需的音频相关特征的过程,可以包括:
根据所述匹配系数矩阵,确定每个热词作为当前解码时刻的待解码字符的概率;
根据所述匹配系数矩阵及每个热词作为当前解码时刻的待解码字符的概率,确定每一帧语音作为当前解码时刻所需的语音内容的概率;
以每一帧语音作为当前解码时刻所需的语音内容的概率作为加权系 数,将所述音频特征向量序列中各帧语音的音频特征向量加权求和,得到当前解码时刻所需的音频相关特征。
可选的,上述第二注意力模型基于音频相关特征从热词特征向量序列中确定当前解码时刻所需的热词相关特征的过程,可以包括:
第二注意力模型以所述音频相关特征作为查询项,对所述热词特征向量序列进行注意力机制操作,得到热词匹配系数向量,所述热词匹配系数向量中包含每一热词作为当前解码时刻的待解码字符的概率;
以每一热词作为当前解码时刻的待解码字符的概率作为加权系数,将所述热词特征向量序列中每个热词的热词特征向量加权求和,得到当前解码时刻所需的热词相关特征。
可选的,上述联合注意力模块还可以将所述热词匹配系数向量发送至所述分类器模块;则所述分类器模块具体为,利用所述解码器模块在当前解码时刻的输出特征及所述热词匹配系数向量,确定待识别语音在当前解码时刻的识别结果。
可选的,上述分类器模块的分类节点可以包括固定的常用字符节点和可动态扩展的热词节点。基于此,
分类器模块可以利用解码器模块在当前解码时刻的输出特征,确定各所述常用字符节点的概率得分和各所述热词节点的概率得分;根据各所述常用字符节点的概率得分和各所述热词节点的概率得分,确定待识别语音在当前解码时刻的识别结果。
具体的,分类器模块可以利用解码器模块在当前解码时刻的输出特征,确定各所述常用字符节点的概率得分;
分类器模块利用所述热词匹配系数向量,确定各所述热词节点的概率得分;
根据各所述常用字符节点的概率得分和各所述热词节点的概率得分,确定待识别语音在当前解码时刻的识别结果。
可选的,本申请的装置还可以包括模型训练单元,用于:
获取标注有识别文本的语音训练数据;
获取所述语音训练数据的音频特征;
从所述语音训练数据的标注文本中随机挑选标注片段作为训练热词;
利用所述训练热词、所述音频特征及语音训练数据的识别文本,训练语音识别模型。
可选的,上述数据获取单元获取待识别语音的音频特征的过程,可以包括:
获取待识别语音的以下任意一项音频特征:
滤波器组FilterBank特征、梅尔频率倒谱系数MFCC特征、感知线性预测PLP特征。
本申请实施例还提供了一种电子设备,请参阅图7,示出了该电子设备的结构示意图,该电子设备可以包括:至少一个处理器1001,至少一个通信接口1002,至少一个存储器1003和至少一个通信总线1004;
在本申请实施例中,处理器1001、通信接口1002、存储器1003、通信总线1004的数量为至少一个,且处理器1001、通信接口1002、存储器1003通过通信总线1004完成相互间的通信;
处理器1001可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;
存储器1003可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:
获取待识别语音以及配置的热词库;
基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;
基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;
基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
本申请实施例还提供一种可读存储介质,该可读存储介质可存储有适于处理器执行的程序,所述程序用于:
获取待识别语音以及配置的热词库;
基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;
基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;
基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果。
进一步地,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行上述语音识别方法中的任意一种实现方式。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间可以根据需要进行组合,且相同相似部分互相参见即可。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使 用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (16)

  1. 一种语音识别方法,其特征在于,包括:
    获取待识别语音以及配置的热词库;
    基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;
    基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;
    基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征,包括:
    获取当前解码时刻之前的已解码结果信息;
    基于所述已解码结果信息及所述热词库,从所述待识别语音中确定当前解码时刻所需的音频相关特征。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;基于所述音频相关特征和所述热词相关特征,确定所述待识别语音在当前解码时刻的识别结果的过程,包括:
    利用预先训练的语音识别模型处理所述待识别语音及所述热词库,得到语音识别模型输出的待识别语音的识别结果,其中:
    所述语音识别模型具备接收并处理待识别语音及热词库,以输出待识别语音的识别结果的能力。
  4. 根据权利要求3所述的方法,其特征在于,所述语音识别模型包括音频编码器模块、热词编码器模块、联合注意力模块、解码器模块及分类器模块;
    所述音频编码器模块对所述待识别语音进行编码,得到音频编码结果;
    所述热词编码器模块对所述热词库中各热词进行编码,得到热词编码结果;
    所述联合注意力模块接收并处理所述音频编码结果和所述热词编码结果,得到当前解码时刻所需的拼接特征,所述拼接特征包括音频相关特征和热词相关特征;
    所述解码器模块接收并处理所述当前解码时刻所需的拼接特征,得到解码器模块当前解码时刻的输出特征;
    所述分类器模块利用解码器模块当前解码时刻的输出特征,确定待识别语音在当前解码时刻的识别结果。
  5. 根据权利要求4所述的方法,其特征在于,所述联合注意力模块包括:
    第一注意力模型和第二注意力模型;
    所述第一注意力模型基于解码器模块在当前解码时刻输出的表示已解码结果信息的状态向量,以及所述热词编码结果,从所述音频编码结果中确定当前解码时刻所需的音频相关特征;
    所述第二注意力模型基于所述音频相关特征,从所述热词编码结果中确定当前解码时刻所需的热词相关特征;
    由所述音频相关特征和所述热词相关特征组合成当前解码时刻所需的拼接特征。
  6. 根据权利要求5所述的方法,其特征在于,所述第一注意力模型基于解码器模块在当前解码时刻输出的表示已解码结果信息的状态向量,以及所述热词编码结果,从所述音频编码结果中确定当前解码时刻所需的音频相关特征,包括:
    将所述状态向量、所述热词编码结果作为第一注意力模型的输入,由所述第一注意力模型从所述音频编码结果中确定当前解码时刻所需的音频相关特征。
  7. 根据权利要求5所述的方法,其特征在于,所述第二注意力模型基于所述音频相关特征,从所述热词编码结果中确定当前解码时刻所需的热词相关特征,包括:
    将所述音频相关特征作为第二注意力模型的输入,由所述第二注意力模型从所述热词编码结果中确定当前解码时刻所需的热词相关特征。
  8. 根据权利要求4所述的方法,其特征在于,所述分类器模块的分类节点包括固定的常用字符节点和可动态扩展的热词节点;
    分类器模块利用解码器模块当前解码时刻的输出特征,确定待识别语音在当前解码时刻的识别结果,包括:
    分类器模块利用解码器模块在当前解码时刻的输出特征,确定各所述常用字符节点的概率得分和各所述热词节点的概率得分;
    根据各所述常用字符节点的概率得分和各所述热词节点的概率得分,确定待识别语音在当前解码时刻的识别结果。
  9. 根据权利要求8所述的方法,其特征在于,所述可动态扩展的热词节点,与所述热词库中的热词一一对应。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述获取待识别语音以及配置的热词库,包括:
    获取待识别语音,并确定所述待识别语音的会话场景;
    获取与所述会话场景相关的热词库。
  11. 根据权利要求1-9任一项所述的方法,其特征在于,所述获取待识别语音以及配置的热词库,包括:
    获取人机交互场景下,用户所产出的语音,作为待识别语音;
    获取预先配置的在人机交互场景下,由用户语音操控指令中的操作关键词组成的热词库。
  12. 根据权利要求11所述的方法,其特征在于,还包括:
    基于所述待识别语音的识别结果,确定与所述识别结果相匹配的交互响应,并输出该交互响应。
  13. 一种语音识别装置,其特征在于,包括:
    数据获取单元,用于获取待识别语音以及配置的热词库;
    音频相关特征获取单元,用于基于所述待识别语音及所述热词库,确定当前解码时刻所需的音频相关特征;
    热词相关特征获取单元,用于基于所述音频相关特征,从所述热词库中确定当前解码时刻所需的热词相关特征;
    识别结果获取单元,用于基于所述音频相关特征和所述热词相关特征, 确定所述待识别语音在当前解码时刻的识别结果。
  14. 一种语音识别设备,其特征在于,包括:存储器和处理器;
    所述存储器,用于存储程序;
    所述处理器,用于执行所述程序,实现如权利要求1~12中任一项所述的语音识别方法的各个步骤。
  15. 一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1~12中任一项所述的语音识别方法的各个步骤。
  16. 一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行权利要求1至12中任一项所述的方法。
PCT/CN2020/133286 2020-05-18 2020-12-02 一种语音识别方法、装置、设备及存储介质 WO2021232746A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US17/925,483 US20230186912A1 (en) 2020-05-18 2020-12-02 Speech recognition method, apparatus and device, and storage medium
JP2022563214A JP7407968B2 (ja) 2020-05-18 2020-12-02 音声認識方法、装置、設備及び記憶媒体
EP20936660.8A EP4156176A4 (en) 2020-05-18 2020-12-02 SPEECH RECOGNITION METHOD, DEVICE AND APPARATUS AND STORAGE MEDIUM
KR1020227043996A KR102668530B1 (ko) 2020-05-18 2020-12-02 음성 인식 방법, 장치 및 디바이스, 및 저장 매체

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010418728.1 2020-05-18
CN202010418728.1A CN111583909B (zh) 2020-05-18 2020-05-18 一种语音识别方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021232746A1 true WO2021232746A1 (zh) 2021-11-25

Family

ID=72126794

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/133286 WO2021232746A1 (zh) 2020-05-18 2020-12-02 一种语音识别方法、装置、设备及存储介质

Country Status (5)

Country Link
US (1) US20230186912A1 (zh)
EP (1) EP4156176A4 (zh)
JP (1) JP7407968B2 (zh)
CN (1) CN111583909B (zh)
WO (1) WO2021232746A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114005438A (zh) * 2021-12-31 2022-02-01 科大讯飞股份有限公司 语音识别方法、语音识别模型的训练方法以及相关装置
CN117437909A (zh) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583909B (zh) * 2020-05-18 2024-04-12 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN112037775B (zh) * 2020-09-08 2021-09-14 北京嘀嘀无限科技发展有限公司 语音识别方法、装置、设备及存储介质
CN112489651B (zh) * 2020-11-30 2023-02-17 科大讯飞股份有限公司 语音识别方法和电子设备、存储装置
CN112634904A (zh) * 2020-12-22 2021-04-09 北京有竹居网络技术有限公司 热词识别方法、装置、介质和电子设备
CN112767917B (zh) * 2020-12-31 2022-05-17 科大讯飞股份有限公司 语音识别方法、装置及存储介质
CN112951209B (zh) * 2021-01-27 2023-12-01 中国科学技术大学 一种语音识别方法、装置、设备及计算机可读存储介质
CN113470619B (zh) * 2021-06-30 2023-08-18 北京有竹居网络技术有限公司 语音识别方法、装置、介质及设备
CN113436614B (zh) * 2021-07-02 2024-02-13 中国科学技术大学 语音识别方法、装置、设备、系统及存储介质
CN113808592A (zh) * 2021-08-17 2021-12-17 百度在线网络技术(北京)有限公司 通话录音的转写方法及装置、电子设备和存储介质
CN115631746B (zh) * 2022-12-20 2023-04-07 深圳元象信息科技有限公司 热词识别方法、装置、计算机设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592595A (zh) * 2012-03-19 2012-07-18 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
US20150127594A1 (en) * 2013-11-04 2015-05-07 Google Inc. Transfer learning for deep neural network based hotword detection
CN105955953A (zh) * 2016-05-03 2016-09-21 成都数联铭品科技有限公司 一种分词系统
CN110415705A (zh) * 2019-08-01 2019-11-05 苏州奇梦者网络科技有限公司 一种热词识别方法、系统、装置及存储介质
CN111583909A (zh) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4520499A (en) * 1982-06-25 1985-05-28 Milton Bradley Company Combination speech synthesis and recognition apparatus
CN103310790A (zh) * 2012-03-08 2013-09-18 富泰华工业(深圳)有限公司 电子装置及语音识别方法
CN102968987A (zh) * 2012-11-19 2013-03-13 百度在线网络技术(北京)有限公司 一种语音识别方法及系统
US8719039B1 (en) * 2013-12-05 2014-05-06 Google Inc. Promoting voice actions to hotwords
CN105719649B (zh) * 2016-01-19 2019-07-05 百度在线网络技术(北京)有限公司 语音识别方法及装置
CN109523991B (zh) * 2017-09-15 2023-08-18 阿里巴巴集团控股有限公司 语音识别的方法及装置、设备
CN109559752B (zh) * 2017-09-27 2022-04-26 北京国双科技有限公司 语音识别方法和装置
CN108228565A (zh) * 2018-01-11 2018-06-29 廖良平 一种商品信息关键词的识别方法
CN108831456B (zh) * 2018-05-25 2022-04-15 深圳警翼智能科技股份有限公司 一种通过语音识别对视频标记的方法、装置及系统
CN108899030A (zh) * 2018-07-10 2018-11-27 深圳市茁壮网络股份有限公司 一种语音识别方法及装置
CN108984529B (zh) * 2018-07-16 2022-06-03 北京华宇信息技术有限公司 实时庭审语音识别自动纠错方法、存储介质及计算装置
US11295739B2 (en) 2018-08-23 2022-04-05 Google Llc Key phrase spotting
CN109215662B (zh) * 2018-09-18 2023-06-20 平安科技(深圳)有限公司 端对端语音识别方法、电子装置及计算机可读存储介质
US11093560B2 (en) 2018-09-21 2021-08-17 Microsoft Technology Licensing, Llc Stacked cross-modal matching
CN110047467B (zh) * 2019-05-08 2021-09-03 广州小鹏汽车科技有限公司 语音识别方法、装置、存储介质及控制终端
CN110517692A (zh) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 热词语音识别方法和装置
CN110956959B (zh) * 2019-11-25 2023-07-25 科大讯飞股份有限公司 语音识别纠错方法、相关设备及可读存储介质
CN110879839A (zh) * 2019-11-27 2020-03-13 北京声智科技有限公司 一种热词识别方法、装置及系统
CN111105799B (zh) * 2019-12-09 2023-07-07 国网浙江省电力有限公司杭州供电公司 基于发音量化和电力专用词库的离线语音识别装置及方法
CN111009237B (zh) * 2019-12-12 2022-07-01 北京达佳互联信息技术有限公司 语音识别方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592595A (zh) * 2012-03-19 2012-07-18 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
US20150127594A1 (en) * 2013-11-04 2015-05-07 Google Inc. Transfer learning for deep neural network based hotword detection
CN105955953A (zh) * 2016-05-03 2016-09-21 成都数联铭品科技有限公司 一种分词系统
CN110415705A (zh) * 2019-08-01 2019-11-05 苏州奇梦者网络科技有限公司 一种热词识别方法、系统、装置及存储介质
CN111583909A (zh) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4156176A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114005438A (zh) * 2021-12-31 2022-02-01 科大讯飞股份有限公司 语音识别方法、语音识别模型的训练方法以及相关装置
CN117437909A (zh) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法
CN117437909B (zh) * 2023-12-20 2024-03-05 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法

Also Published As

Publication number Publication date
JP2023522083A (ja) 2023-05-26
EP4156176A4 (en) 2024-05-08
CN111583909A (zh) 2020-08-25
KR20230040951A (ko) 2023-03-23
JP7407968B2 (ja) 2024-01-04
EP4156176A1 (en) 2023-03-29
US20230186912A1 (en) 2023-06-15
CN111583909B (zh) 2024-04-12

Similar Documents

Publication Publication Date Title
WO2021232746A1 (zh) 一种语音识别方法、装置、设备及存储介质
US11043205B1 (en) Scoring of natural language processing hypotheses
CN108829757B (zh) 一种聊天机器人的智能服务方法、服务器及存储介质
US11086918B2 (en) Method and system for multi-label classification
CN110516253B (zh) 中文口语语义理解方法及系统
US11823678B2 (en) Proactive command framework
CN107578771B (zh) 语音识别方法及装置、存储介质、电子设备
US11081104B1 (en) Contextual natural language processing
US20240153489A1 (en) Data driven dialog management
CN111243579B (zh) 一种时域单通道多说话人语音识别方法与系统
CN111159485B (zh) 尾实体链接方法、装置、服务器及存储介质
US11580145B1 (en) Query rephrasing using encoder neural network and decoder neural network
US11276403B2 (en) Natural language speech processing application selection
US10872601B1 (en) Natural language processing
US11289075B1 (en) Routing of natural language inputs to speech processing applications
CN110164416B (zh) 一种语音识别方法及其装置、设备和存储介质
Elshaer et al. Transfer learning from sound representations for anger detection in speech
US11626107B1 (en) Natural language processing
US11854535B1 (en) Personalization for speech processing applications
CN113158062A (zh) 一种基于异构图神经网络的用户意图识别方法及装置
KR102668530B1 (ko) 음성 인식 방법, 장치 및 디바이스, 및 저장 매체
CN112735380B (zh) 重打分语言模型的打分方法及语音识别方法
US11947912B1 (en) Natural language processing
JP7333490B1 (ja) 音声信号に関連するコンテンツを決定する方法、コンピューター可読保存媒体に保存されたコンピュータープログラム及びコンピューティング装置
US11626105B1 (en) Natural language processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20936660

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022563214

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020936660

Country of ref document: EP

Effective date: 20221219