WO2022227935A1 - 语音识别方法、装置、设备、存储介质及程序产品 - Google Patents

语音识别方法、装置、设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2022227935A1
WO2022227935A1 PCT/CN2022/082046 CN2022082046W WO2022227935A1 WO 2022227935 A1 WO2022227935 A1 WO 2022227935A1 CN 2022082046 W CN2022082046 W CN 2022082046W WO 2022227935 A1 WO2022227935 A1 WO 2022227935A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
speech
word
feature
word graph
Prior art date
Application number
PCT/CN2022/082046
Other languages
English (en)
French (fr)
Inventor
张玺霖
刘博�
刘硕
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22794411.3A priority Critical patent/EP4231283A4/en
Priority to US17/979,660 priority patent/US20230070000A1/en
Publication of WO2022227935A1 publication Critical patent/WO2022227935A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the field of computer technology, and in particular, to a speech recognition method, apparatus, device, storage medium and program product.
  • Speech recognition refers to converting the received speech information into text information.
  • Many applications provide speech-to-text services.
  • speech recognition includes streaming speech recognition and non-streaming speech recognition.
  • Streaming speech recognition requires real-time performance. Higher than the real-time requirements of non-streaming speech recognition.
  • common speech recognition systems include traditional speech recognition systems and E2E (End-to-End, end-to-end) speech recognition systems.
  • the traditional speech recognition system converts speech information into text information through the sequential mapping relationship between speech features, phonemes, words, and word strings; the traditional speech recognition system consists of acoustic models, pronunciation dictionaries, and language models. A combination of models.
  • the E2E speech recognition system uses a multi-head attention mechanism between the input end and the output end to realize the work content corresponding to the multiple models in the above-mentioned traditional speech recognition system.
  • the traditional speech recognition system includes multiple models. Due to the loss of information in the information transmission between the models, the corresponding recognition performance has certain limitations, resulting in a low recognition accuracy.
  • Embodiments of the present application provide a speech recognition method, apparatus, device, storage medium, and program product.
  • the technical solution is as follows.
  • a speech recognition method is provided, the method is performed by a computer device, and the method includes:
  • the voice content is the audio to be recognized
  • the recognition result of the speech content is determined according to the connection relationship between the candidate words indicated by the first word map and the second word map.
  • a speech recognition device comprising:
  • an acquisition module for acquiring voice content, the voice content being the audio to be recognized
  • a processing module configured to perform feature extraction on the voice content to obtain intermediate features, where the intermediate features are used to indicate the audio expression characteristics of the voice content
  • the first generation module is used to decode the intermediate feature based on the attention mechanism, and obtain a first word map, and the first word map is used to indicate the first candidate word composed of the first candidate vocabulary predicted based on the attention mechanism. a set of candidate sequences;
  • the second generation module is configured to perform feature mapping on the intermediate features based on the pronunciation of the speech content to obtain a second word map, where the second word map is used to indicate the second candidate vocabulary obtained based on the pronunciation. composed of the second candidate sequence set;
  • a determination module configured to determine the recognition result of the speech content according to the connection relationship between the candidate words indicated by the first word graph and the second word graph.
  • a computer device in another aspect, includes a processor and a memory, the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the present application
  • a computer-readable storage medium is provided, and at least one piece of program code is stored in the computer-readable storage medium, and the program code is loaded and executed by a processor to implement any one of the embodiments of the present application. method of speech recognition.
  • a computer program product includes at least one computer program, and the computer program is loaded and executed by a processor to implement the speech recognition method described in any of the foregoing embodiments.
  • the feature extraction is performed on the speech content to obtain an intermediate feature that can indicate the audio expression characteristics of the speech content, and then the intermediate feature is processed by two different processing methods to obtain two word graphs, two of which are Different processing methods include decoding the intermediate features based on the attention mechanism to obtain the first word graph, and performing feature mapping based on the pronunciation of the speech content to obtain the second word graph.
  • the first word graph and the second word graph are respectively used.
  • the recognition result is determined according to the connection relationship between the candidate words indicated by the first word map and the second word map, so as to realize the conversion of speech content into Function of text content.
  • both the first word graph and the second word graph are obtained through the same intermediate feature, server resources can be saved.
  • different processing methods are performed on the intermediate feature, and then the processing result is jointly determined according to the word graph obtained by the two processing methods. Improves the accuracy of speech recognition.
  • FIG. 1 is a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application
  • FIG. 2 is a schematic diagram of a speech recognition application scenario provided by an exemplary embodiment of the present application.
  • FIG. 3 is a schematic diagram of a speech recognition application scenario provided by another exemplary embodiment of the present application.
  • FIG. 4 is a flowchart of a speech recognition method provided by an exemplary embodiment of the present application.
  • FIG. 5 is a schematic diagram of the form of a first word graph provided by an exemplary embodiment of the present application.
  • FIG. 6 is a schematic diagram of an obfuscated network form provided by an exemplary embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a speech recognition model provided by an exemplary embodiment of the present application.
  • FIG. 8 is a flowchart of a speech recognition method provided by an exemplary embodiment of the present application.
  • FIG. 9 is a schematic diagram of a Hybrid speech recognition system provided by an exemplary embodiment of the present application.
  • FIG. 10 is a schematic diagram of an E2E speech recognition system provided by an exemplary embodiment of the present application.
  • FIG. 11 is a flowchart of a training method of a speech recognition model provided by an exemplary embodiment of the present application.
  • FIG. 12 is a structural block diagram of a speech recognition apparatus provided by an exemplary embodiment of the present application.
  • FIG. 13 is a structural block diagram of a speech recognition apparatus provided by another exemplary embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a server provided by an exemplary embodiment of the present application.
  • the implementation environment of the embodiment of the present application is described; please refer to FIG. 1 , the implementation environment includes a terminal 101 , a server 102 and a communication network 103 .
  • the terminal 101 may be an electronic device such as a mobile phone, a tablet computer, an e-book reader, a multimedia playback device, a wearable device, a laptop computer, a desktop computer, or an all-in-one voice recognition machine.
  • an application program for speech recognition is installed in the terminal 101, and the text conversion of the speech content to be recognized can be realized through the application program.
  • the speech recognition application may be traditional application software, cloud application software, may be implemented as a small program or application module in a host application, or may be a web platform, which is not limited herein.
  • the server 102 is used to provide the terminal 101 with a voice recognition service.
  • the terminal 101 transmits the to-be-recognized speech content to the server 102 through the communication network 103, and accordingly, the server 102 receives the to-be-recognized speech content uploaded by the terminal 101; the server 102 invokes the speech recognition model to recognize the to-be-recognized speech content, and generates corresponding text content, And the text content is returned to the terminal 101 through the communication network 103 .
  • the server 102 is a physical server or a cloud server.
  • the above-mentioned server 102 may also be implemented as a node in a blockchain system.
  • the server 102 can establish a communication connection with the terminal 101 through the communication network 103 .
  • the network can be a wireless network or a wired network.
  • the speech recognition methods provided in the embodiments of the present application can be used for both streaming speech recognition services and non-streaming speech recognition services; in the embodiments of the present application, the method is used in non-streaming speech recognition services as an example Be explained.
  • the speech recognition method provided in this embodiment of the present application may be applied to at least one of the following scenarios, including but not limited to.
  • the speech recognition service is applied to the scenario of text conversion of received speech information in social software; for example, the target object receives a piece of speech information in social software, for example, it receives a message sent by other objects during chatting.
  • Voice bar swipe a voice dynamic published by other objects in the dynamic interface, etc.
  • the target object can use the voice recognition service to convert the voice content into text content for display, which ensures that the target object can obtain the message content of the voice content in time when it is inconvenient to receive information by playing the voice.
  • the target object receives the voice information 201 sent by other objects, and the target object can call the menu control 202 by long-pressing the control corresponding to the voice information 201.
  • the menu control 202 includes the For the sub-control that provides the speech-to-text service, the target object converts the received voice information to text information by clicking the sub-control.
  • the terminal receives the triggering operation on the above-mentioned sub-control, the voice signal is uploaded to the server, the server performs voice recognition, converts it into text information, and returns the text information to the terminal; the terminal receives the text information returned by the server, and chats with the voice signal. It is displayed in the preset area 203 in the interface 200 .
  • the voice recognition service can be applied to the voice input function provided by the input method software.
  • the target object performs voice input through the preset controls in the input method software
  • the terminal sends the collected voice signal to the server
  • the server responds to
  • the voice signal is processed to obtain text information corresponding to the voice signal, and the text information is returned to the terminal; the terminal displays the text information as the content of the voice input of the target object.
  • the server may return a piece of text information, or may return multiple pieces of similar text information determined by the voice information to provide the target object for selection.
  • the target object can perform voice input through the voice input control 301.
  • the terminal When the target object clicks the voice input control 301, the terminal will call the microphone to record the voice information of the target object. After clicking on the voice input control 301, the terminal determines that the recording of the voice information is completed, and uploads the voice information to the server, and the server feeds back a plurality of text information 302 obtained by the recognition, and the terminal displays the plurality of text information 302, and the target object can be from multiple Among the pieces of text information 302 , the text information that conforms to one's own thoughts is selected, and the text information selected by the target object among the plurality of text information 302 will be displayed in the input box 303 .
  • the speech recognition service can be applied to the automatic subtitle generation function in the video software.
  • the target object publishes the video through the video software. Before publishing the target video, the target video is uploaded to the video software, and the video software can Provide some video processing functions for the target object, which can include the automatic subtitle generation function.
  • the server extracts the audio from the received target video, performs speech recognition on the audio, generates text information, and returns the text information to the terminal.
  • the object can choose to add the generated text information to the target video as subtitles of the target video.
  • the speech recognition method provided in the embodiment of the present application may also be applied to other application scenarios, which is only described here by an example, and does not limit the specific application scenario.
  • the server may instruct the terminal to display authorization inquiry information on the terminal interface, and after receiving a determination based on the authorization inquiry information After the operation, the server confirms the acquisition of the processing authority of the relevant information corresponding to the authorization inquiry information.
  • the authorization inquiry information may include at least one of message content authorization inquiry information, or input voice authorization inquiry information; when the authorization inquiry information includes message content authorization inquiry information, the server receives the authorization inquiry information from the target object. After the confirmation operation, it is determined that the target object can receive the voice information in the social software; when the authorization inquiry information includes the input voice authorization inquiry information, after the server receives the target object's confirmation operation for the authorization inquiry information, it is determined that the authorization inquiry information can be obtained. Obtain the voice content input by the target object; this application does not limit the content of the authorization inquiry information.
  • FIG. 4 shows a flowchart of a speech recognition method provided by an exemplary embodiment of the present application.
  • the speech recognition method may be executed by a computer device, and the method is performed by a computer in the above implementation environment.
  • Server execution is taken as an example to illustrate, and the method includes the following steps.
  • Step 401 Acquire voice content, where the voice content is the audio to be recognized.
  • the server obtains the voice content, which is the audio to be recognized.
  • the terminal compresses the recorded audio, packages the compressed audio and the voice-to-text request using a network protocol, and sends it to the server through the communication network.
  • the server decompresses the compressed audio corresponding to the voice-to-text request to obtain the voice content to be recognized.
  • the server may also acquire the voice content from the database, which is not limited herein.
  • the server After acquiring the speech content, the server invokes the speech recognition model to recognize the speech content according to the speech-to-text request.
  • step 402 feature extraction is performed on the speech content to obtain intermediate features.
  • the speech recognition model includes a shared network (Shared Network) sub-model, and the shared network sub-model is used to perform feature extraction on speech content to obtain intermediate features that can indicate the audio expression characteristics of the speech content; that is, , the server can perform feature extraction on the speech content through the shared network sub-model in the speech recognition model to obtain intermediate features.
  • the speech recognition model may be referred to as a speech recognition module, and the shared network sub-model may be referred to as a shared network sub-module.
  • the shared network sub-model includes at least one layer of convolutional neural networks (Convolutional Neural Networks, CNN); the server can perform feature extraction on the speech content through at least one layer of convolutional networks included in the shared network sub-model, and obtain intermediate Sub-features; after that, feature weighting is performed on the intermediate sub-features to obtain intermediate features.
  • Convolutional Neural Networks CNN
  • CNN convolutional Neural Networks
  • the preprocessing includes converting the speech content into a sequence of speech features, that is, through signal processing technology, from the speech signal corresponding to the input speech content.
  • the features extracted from the feature vector are used for subsequent processing by the shared network sub-model to minimize the influence of environmental noise, channel, speaker and other factors on feature extraction.
  • the preprocessing includes at least one of noise reduction processing, sampling processing, pre-emphasis processing, windowing and framing processing, and the like.
  • Noise reduction processing is to reduce the noise of the voice signal through a preset filter to ensure the accuracy of human voice recognition in the voice signal; sampling processing is to convert the voice signal as an analog signal into a digital signal; pre-emphasis processing is to The high-frequency part of the speech is emphasized to remove the influence of lip radiation and increase the high-frequency resolution of the speech; the windowing and framing processing is to use a movable finite-length window to weight the speech signal, and then pass the correlation filter to each frame. Transformation or operation to achieve the processing of the speech signal into some short segments (analysis frames).
  • the speech feature sequence obtained after preprocessing the speech content is input into the shared network sub-model to obtain intermediate features.
  • the shared network sub-model includes at least one layer of convolutional neural network, and at least one layer of convolutional neural network can perform feature extraction on speech content to obtain intermediate sub-features, and the intermediate sub-features are compared with the speech feature sequence. Higher-level feature representation.
  • the shared network sub-model also includes a Transformer (deep self-attention transformation network).
  • the Transformer obtains intermediate sub-features, and performs at least one weighting of the self-attention mechanism on the intermediate sub-features, thereby outputting intermediate features.
  • the shared network sub-model may also include LSTM (Long Short-Term Memory, long short-term memory network), BLSTM (Bi-directional Long Short-Term Memory, bidirectional long and short-term memory network), DFSMN (Deep Feedforward Sequential Memory network) Networks, deep feed-forward sequential storage network) and other networks to process intermediate sub-features to obtain intermediate features, which are not limited here.
  • Step 403 decoding the intermediate features based on the attention mechanism to obtain a first word graph.
  • the attention mechanism adjusts the direction of attention and the weighting model according to the specific task goal.
  • the content that does not conform to the attention model is weakened or forgotten.
  • the direction of attention is based on itself, it is called the Self-Attention mechanism.
  • the input is divided into multiple heads (heads) to form multiple subspaces. After each subspace completes the attention mechanism, it is recombined, which is called Multi-Headed Attention (MHA).
  • MHA Multi-Headed Attention
  • the multi-head attention mechanism can be Let the model learn relevant information in different subspaces.
  • the first word map is used to indicate the first candidate sequence set composed of the first candidate words predicted based on the attention mechanism.
  • the speech recognition model includes an E2E network sub-model, and the E2E network sub-model is used to decode the intermediate features based on the attention mechanism to obtain the first word graph; that is, the server can be based on the attention mechanism. , decode the intermediate features through the E2E network sub-model, and obtain the first word graph.
  • the E2E network sub-model can perform feature weighting on the channel indicating the expression of human voice in the intermediate features based on the attention mechanism, to obtain the first branch feature; and decode the first branch feature to obtain the first word graph.
  • the E2E network sub-model may be called an E2E network sub-module; the E2E network sub-model is used to indicate a recognition model for realizing end-to-end speech recognition through an attention mechanism-based neural network.
  • the E2E network sub-model includes an Attention processing layer, which is used as a hidden layer in the entire E2E network to focus on the direction of attention in the feature processing process and perform feature processing according to preset task goals.
  • the weighted weighting model is adjusted, that is, by increasing the feature weighting operation of the attention mechanism, the speech features that do not conform to the direction of attention are weakened or forgotten, wherein the direction of attention is determined during the training process of the speech recognition model. Therefore, after receiving the intermediate feature, the E2E network sub-model inputs the intermediate feature to the Attention processing layer to obtain the first branch feature.
  • the Attention processing layer can be implemented as an AED (Attention-based Encoder-Decoder, attention mechanism-based encoding-decoding) model, which is a model used to solve the sequence-to-sequence mapping problem.
  • AED tention-based Encoder-Decoder, attention mechanism-based encoding-decoding
  • the MHA controls the encoding sequence and the decoding sequence.
  • the unequal length mapping completes the construction of the E2E speech recognition system.
  • the E2E network sub-model also includes a decoding network, which is used to decode the first branch feature to obtain the first word graph; in an example, the above-mentioned Attention processing layer realizes the intermediate feature and the first branch.
  • the unequal length mapping between the features, the decoding network decodes the first branch feature, and determines the first candidate sequence set composed of the first candidate vocabulary, that is, multiple optimal candidate paths (N-best).
  • the optimal candidate path is used to generate the first word graph, that is, the first branch feature is decoded by the decoder to obtain the first candidate sequence set; the first candidate vocabulary corresponding to the first candidate sequence set is used as the path to generate the first word graph .
  • the decoder is pre-trained by the speech recognition system through training data.
  • Lattice is essentially a directed acyclic graph, and each node in the graph represents the end of the candidate vocabulary determined by the first branch feature At a time point, each edge represents a possible candidate word, and the candidate word's score, which is used to indicate the likelihood of the candidate word being determined as a word in the processing result.
  • FIG. 5 shows the form of a first word graph 500 , wherein edges 501 between nodes in the first word graph 500 are represented as the first candidate word and the score of the first candidate word .
  • Step 404 Perform feature mapping on the intermediate features based on the pronunciation of the speech content to obtain a second word graph.
  • the second word map is used to indicate the second candidate sequence set composed of the second candidate words obtained based on the pronunciation situation.
  • the speech recognition model further includes an acoustic processing sub-model, and the acoustic processing sub-model is used to perform feature mapping on the intermediate features based on the pronunciation to obtain a second word graph; that is, the server may Pronunciation, through the acoustic processing sub-model to perform feature mapping on the intermediate features, to obtain the second word map.
  • the process of obtaining the second word graph by performing feature mapping on the intermediate features through the acoustic processing sub-model can be implemented as follows:
  • the acoustic processing sub-model after receiving the intermediate feature, the acoustic processing sub-model inputs the intermediate feature into the fully connected layer to obtain the posterior probability of the phoneme of the speech to be recognized; based on the posterior probability of the phoneme, the target vocabulary set is determined; wherein , the phoneme is used to indicate the smallest phonetic unit divided according to the natural properties of the voice.
  • the fully connected layer consists of an activation function with softmax (softmax).
  • the acoustic processing sub-model further includes a pronunciation dictionary unit.
  • the pronunciation dictionary unit may determine the target vocabulary set of the to-be-recognized speech according to the received posterior probability of the phonemes of the to-be-recognized speech.
  • a pronunciation dictionary is stored in the pronunciation dictionary unit, and the pronunciation dictionary records a vocabulary set and pronunciations corresponding to the vocabulary in the vocabulary set, that is, the pronunciation dictionary includes a mapping relationship between vocabulary and pronunciation.
  • the process of determining the target vocabulary set can be implemented as:
  • the posterior probability of the phoneme determine the phoneme at each time point in the speech content
  • the target vocabulary set composed of phonemes at each time point is determined.
  • the acoustic processing sub-model further includes a language model unit.
  • the language model unit is configured to determine the second candidate sequence set of the target vocabulary set based on the target vocabulary set determined by the pronunciation dictionary unit.
  • the language model unit may be composed of at least one of language models such as an n-gram language model, a feedforward neural network-based model, and a recurrent neural network-based model, or may be composed of other language models. This is not limited.
  • the language model may calculate the possibility of the existence of the second candidate sequence when determining that the second candidate sequence is composed of the target vocabulary set.
  • the form of the second word graph is the same as that of the first word graph, and details are not described herein.
  • the intermediate feature is used to simultaneously input the E2E network sub-model and the acoustic processing model in the speech recognition model; that is, the process of acquiring the first word graph and the second word graph can be performed synchronously.
  • Step 405 Determine the recognition result of the speech content according to the connection relationship between the candidate words indicated by the first word graph and the second word graph.
  • the first word map indicates a first candidate sequence set composed of first candidate words predicted based on the attention mechanism
  • the second word map indicates a second sequence set composed of second candidate words predicted based on pronunciation Candidate sequence set. That is, the first word graph indicates the connection relationship between the first candidate words, and the second word graph indicates the connection relationship between the second candidate words.
  • the speech recognition model further includes a result generation sub-model, and the result generation sub-model is used to process the respective output results of the E2E network sub-model and the acoustic processing sub-model to generate a speech content recognition result.
  • the result generation sub-model may be called a result generation sub-module.
  • the result generation sub-model receives the first word graph and the second word graph, and determines a candidate sequence set according to the first word graph and the second word graph.
  • the candidate sequence set includes the first candidate sequence set. The corresponding candidate sequence and the candidate sequence corresponding to the second candidate sequence set.
  • the server may obtain n candidate sequences from the first candidate sequence set and the second candidate sequence set respectively, and determine the above 2n candidate sequences as the candidate sequence set, where n is a positive integer.
  • the server may determine the candidate sequence set according to the sequence score of the first candidate sequence in the first candidate sequence set or the sequence score of the second candidate sequence in the second candidate sequence set, wherein the sequence score is composed of The score of the candidate vocabulary is determined.
  • the result generating sub-model may determine at least one candidate sequence from the above-mentioned candidate sequence set as the identification result.
  • the result generation sub-model can also generate a target confusion network according to the first word graph and the second word graph, and the recognition result is determined by the target confusion network, and the target confusion network includes the third candidate words that form the candidate sequence.
  • the connection probability, the third candidate vocabulary is determined from the first candidate vocabulary and the second candidate vocabulary, and the connection probability between the third candidate vocabulary is determined by comparing the first connection relationship between the first candidate vocabulary and the second candidate vocabulary. The second connection relationship between them is weighted and merged.
  • the third candidate vocabulary corresponding to the target confusion network may be the union of the first candidate vocabulary and the second candidate vocabulary, or may be composed of a preset number of first candidate words and a preset number of second candidate words.
  • the preset number of candidate words and the preset number of second candidate words may be the same or different; wherein, taking the third candidate word consisting of a preset number of first candidate words and a preset number of second candidate words as an example, According to the preset rules, a preset number of candidate words are selected from the first candidate words, a preset number of candidate words are selected from the second candidate words, the candidate words selected from the first candidate words and the second candidate words are selected according to the preset rules.
  • the selected candidate vocabulary is merged to obtain the third candidate vocabulary, and the target confusion network is formed by the third candidate vocabulary; the preset rule can be determined according to the weight between the E2E network sub-model and the acoustic processing sub-model.
  • each edge between each node in the target confusion network corresponds to a third candidate word and the score of the third candidate word
  • the score of the third candidate word is used to indicate the relationship between the third candidate word and the preceding and following candidate words
  • the connection probability is determined by the first connection relationship and the second connection relationship; the connection probability is used to indicate the probability of a connection relationship between the third candidate word and the preceding and following candidate words.
  • the way to determine the recognition result through the target confusion network is: traverse each node of the target confusion network in the order from left to right, and splicing the edges with the highest score corresponding to the candidate words between the two nodes to each other to form a path,
  • the path is the path with the highest score in the target confusion network, and the candidate sequence formed by the path is the recognition result of the speech content.
  • FIG. 6 shows the form of a confusion network 600.
  • the confusion network 600 includes a plurality of nodes, and the lines 601 between the nodes correspond to the edges in the word graph, that is, each line 601 represents a candidate word and a candidate word
  • the score is used to indicate the connection probability between each candidate vocabulary.
  • the target confusion network is the confusion network 600 shown in FIG. 6
  • the determined processing result is: ABBC.
  • the method of generating a confusion network from a word graph includes: step a, selecting a path with the highest weight from the word graph as the initial confusion network, and the nodes in the path are the nodes in the confusion network; step b, gradually aligning other edges Added to the above initial confusion network, the edges of the same position and the same word are merged into one, and the weights are accumulated.
  • the result generation sub-model may also generate a first confusion network according to the first word graph, generate a second confusion network according to the second word graph, and combine the first confusion network and the second confusion network according to preset weighting rules Perform weighted merging to obtain a target confusion network.
  • the preset weighting rule is preset by the system.
  • the weighted merging process for the first confusion network and the second confusion network includes: step a, for the first confusion network.
  • Each edge on the word graph is multiplied by a factor m, and each edge on the second word graph is multiplied by a factor (1-m).
  • m is greater than 0.5, it means that the final processing result of the speech recognition model focuses on the processing result of the E2E network sub-model. If m If it is less than 0.5, it means that the final processing result of the speech recognition model focuses on the processing result of the acoustic processing sub-model; in step b, the first word graph and the second word graph are combined after two multiplication coefficients.
  • the The confusion network corresponding to the weighted second word graph is used as the initial confusion network, and starting from the initial confusion network, traverse each edge on the weighted first word graph to the initial confusion network to align and add until all are added. The merge is completed.
  • the speech information is input into the shared network sub-model 710, and the shared network sub-model 710 performs feature extraction on the speech information to obtain intermediate features, wherein , the shared network sub-model 710 includes a convolutional neural network 711 and a Transformer 712 .
  • the intermediate features are input to both the E2E network sub-model 720 and the acoustic processing sub-model 730.
  • the E2E network sub-model 720 processes the intermediate features, outputs the first word graph, and inputs the first word graph into the result generation sub-model 740, wherein the E2E network sub-model 720 includes an attention mechanism (Attention) processing layer 721 and a decoding network (Decoder) 722.
  • the acoustic processing sub-model 730 processes the intermediate features, outputs the second word graph, and inputs the second word graph into the result generation sub-model 740, wherein the acoustic processing sub-model 730 includes a fully connected layer (softmax) 731, a pronunciation dictionary unit ( Lexicon) 732 and Language Model Unit (LM) 733.
  • the result generating sub-model 740 generates a processing result according to the first word graph and the second word graph, where the processing result includes at least one piece of text information corresponding to the speech content.
  • the speech recognition method performs feature extraction on the speech content to be recognized to obtain intermediate features that can indicate the audio expression characteristics of the speech content, and then uses two different processing methods.
  • the intermediate feature is processed to obtain two word graphs.
  • Two different processing methods include decoding the intermediate feature based on the attention mechanism to obtain the first word graph, and performing feature mapping based on the pronunciation of the speech content to obtain the first word graph.
  • Two-word map, the first word map and the second word map are respectively used to indicate the candidate sequence set composed of the candidate words obtained by the above two processing methods, and finally according to the first word map and the second word map indicate the difference between the candidate words
  • the connection relationship of determines the recognition result, so as to realize the function of converting speech content into text content.
  • both the first word graph and the second word graph are obtained through the same intermediate feature, server resources can be saved.
  • different processing methods are performed on the intermediate feature, and then the processing result is jointly determined according to the word graph obtained by the two processing methods. Improves the accuracy of speech recognition.
  • FIG. 8 shows a flowchart of a speech recognition method provided by an exemplary embodiment of the present application.
  • the speech recognition method can be executed by a computer device, and the method includes the following steps.
  • Step 801 acquiring voice content.
  • the server may acquire the voice content from the terminal, or may acquire the voice content from a database, which is not limited herein.
  • Step 802 perform feature extraction on the speech content to obtain intermediate features.
  • feature extraction may be performed on the voice content through a shared network sub-model in the voice recognition model to obtain intermediate features.
  • the intermediate feature is used to indicate the audio expression characteristics of the speech content; the intermediate feature is used to simultaneously input the end-to-end E2E network sub-model and the acoustic processing sub-model in the speech recognition model.
  • the speech content is preprocessed to obtain the speech feature sequence.
  • the speech feature sequence is extracted through a shared network including at least one layer of convolutional neural network and Transformer to obtain intermediate features.
  • the first word graph is obtained through steps 803 to 804
  • the second word graph is obtained through steps 805 to 808 .
  • Step 803 Based on the attention mechanism, feature weighting is performed on the channel indicating the expression of the human voice in the intermediate features to obtain the first branch feature.
  • the E2E network sub-model may perform feature weighting on the channel indicating the expression of the human voice in the intermediate features, so as to obtain the first branch feature.
  • the intermediate features are weighted according to the direction of attention in the speech recognition process to obtain the first branch feature.
  • Step 804 Decode the first branch feature to obtain a first word graph.
  • the first branch feature is decoded by the decoder, and the decoder determines the first candidate word according to the first branch feature, and the score of the first candidate word in each time node corresponding to the speech information, according to the first candidate word and the first candidate word.
  • the scores of the candidate words generate a first word map, and the first word map is used to indicate a first candidate sequence set composed of the first candidate words predicted based on the attention mechanism.
  • Step 805 Input the intermediate features to the fully connected layer to obtain the posterior probability of the phoneme of the speech to be recognized.
  • the fully connected layer consists of an activation function with softmax (softmax).
  • Step 806 Determine the target vocabulary set based on the posterior probability of the phoneme and the pronunciation dictionary.
  • the posterior probability of the phoneme of the speech to be recognized is used to determine which first candidate vocabulary is included in the speech content, and the target vocabulary is formed by the above-mentioned first candidate vocabulary. set. That is, a pronunciation dictionary is obtained, and the pronunciation dictionary includes the mapping relationship between vocabulary and pronunciation; according to the posterior probability of the phonemes determined by the fully connected layer, the phonemes of each time point in the speech content are determined; according to the pronunciation dictionary, the phoneme of each time point is determined.
  • the target vocabulary set that phonemes can form.
  • Step 807 Determine the probability of at least one second candidate sequence composed of the target vocabulary set.
  • the above target vocabulary set is input into the language model, and the probability corresponding to at least one second candidate sequence and at least one second candidate sequence is determined.
  • the language model can be an n-gram language model, a feedforward neural network-based At least one of language models such as a model, a model based on a recurrent neural network, and the like.
  • the language model can calculate the possibility that the second candidate sequence exists when the target vocabulary set composes the second candidate sequence.
  • Step 808 Generate a second word graph based on the probability of at least one second candidate sequence.
  • a second word graph is generated from the second candidate vocabulary in the target vocabulary set, and the second word graph is used to indicate the second candidate sequence set composed of the second candidate vocabulary obtained based on the pronunciation.
  • Step 809 generate a first confusion network based on the first word graph.
  • the nodes in the word graph path are the nodes in the confusion network, and gradually add other edge alignments to the above first initial confusion network, The edges of the same position and the first candidate word are merged into one, and the weights are accumulated to finally obtain the first confusion network.
  • Step 810 generating a second confusion network based on the second word graph.
  • the nodes in the word graph path are the nodes in the confusion network, and gradually add other edge alignments to the second initial confusion network.
  • the edges of the same position and the second candidate vocabulary are combined into one, and the weights are accumulated to finally obtain the second confusion network.
  • Step 811 weighted and combined the first confusion network and the second confusion network to obtain a target confusion network.
  • each edge on the first word graph by the coefficient m and multiply each edge on the second word graph by the coefficient (1-m).
  • the value range of m is [0, 1]. If m is greater than 0.5, it means that the final processing result of the speech recognition model focuses on the processing results of the E2E network sub-model. If m is less than 0.5, it means the final processing result of the speech recognition model.
  • the processing results focus on the processing results of the acoustic processing sub-model, and combine the first word graph and the second word graph after two multiplication coefficients.
  • the confusion network corresponding to the weighted second word graph is used as the initial confusion network, and starting from the initial confusion network, each edge on the weighted first word graph is traversed to the initial confusion network. Align and add until all the additions are completed and merge to get the target confusion network.
  • Step 812 Determine the candidate sequence with the highest sum of connection probabilities among the third candidate words in the target confusion network as the recognition result.
  • the recognition of voice content is realized by setting up a shared network, which absorbs the advantages of the Hybrid voice recognition system and the E2E voice recognition system.
  • Hybrid Speech Recognition refers to the recognition method that converts speech information into text information by sequentially mapping speech features, phonemes, words, and word strings; AM), Pronunciation Dictionary (Pronunciation Dictionary), Language Model (Language Model, LM) and other models.
  • the Hybrid speech recognition system 900 includes an acoustic model 901, a pronunciation dictionary 902, and a language model 903; the server can perform feature extraction on the speech information to be recognized to obtain a speech feature sequence, and input the speech feature sequence into the Hybrid speech recognition The system 900 obtains text information corresponding to the speech information output by the speech recognition system 900 .
  • the acoustic model refers to a model used to calculate the mapping probability between speech features and phonemes, and phonemes are the smallest phonetic units divided according to the natural attributes of speech; among them, the natural attributes of speech include physical attributes and physiological attributes; physical attributes Including pitch (the level of the sound, which is determined by the vibration frequency of the sounding body, and is proportional to the vibrational frequency of the sounding body), sound intensity (the strength of the sound, which is determined by the amplitude of the sounding body, and is proportional to the vibrational frequency of the sounding body).
  • Amplitude is proportional
  • tone length the length of the sound, which is determined by the length of the vibration time of the sounding body, which is proportional to the vibration time of the sounding body
  • sound quality the personality or characteristics of the sound, also called timbre, which is determined by the sounding body.
  • the form of vibration the physiological attribute is to indicate the physiological vocalization position and vocalization action of speech.
  • phonemes are the smallest phonetic units divided from the perspective of sound quality, while in terms of physiological attributes, phonemes are the smallest phonetic units divided according to pronunciation actions, that is, a pronunciation action constitutes a phoneme, for example,
  • the phonetic ( ⁇ ) corresponding to "ah” corresponds to one phoneme
  • the phonetic "ài” corresponding to "love” corresponds to two phonemes.
  • the pronunciation dictionary includes the vocabulary set that can be processed by the above-mentioned speech recognition system and the pronunciation corresponding to the vocabulary in the vocabulary set, and provides the mapping between the acoustic model modeling unit and the language model unit.
  • the language model refers to a model used to calculate the mapping probability between words and word strings, that is, used to estimate the possibility of the existence of the target text when the recognized words are combined into the target text.
  • E2E speech recognition refers to end-to-end speech recognition.
  • the E2E speech recognition system no longer has independent models such as acoustic model, pronunciation dictionary, language model, etc., but from the input end (speech feature sequence) to the output end (word string sequence) ) are directly connected by a neural network, and the neural network undertakes the speech recognition system of all the original models; illustratively, the neural network can be a network model constructed based on a multi-head attention mechanism (Multi-Head Attention, MHA). As shown in FIG.
  • MHA multi-head attention mechanism
  • the E2E speech recognition system 1000 includes an encoder (Encoder) 1001, an attention mechanism (Attention) model 1002 and a decoder (Decoder) 1003; the server can perform feature extraction on the speech information to be recognized to obtain speech features sequence; input the speech feature sequence into the E2E speech recognition system 1000 to obtain text information corresponding to the speech information output by the speech recognition system 1000 .
  • a system (SNSC, Shared Network System Combination), a Hybrid speech recognition system, and an E2E speech recognition system that implement the speech recognition method provided by the embodiments of the present application are tested by the same physical machine, and the results shown in Table 1 are obtained.
  • the test results where the word error rate indicates the number of wrong words in every 100 words recognized, and the Real Time Factor (RTF) is a value used to measure the decoding speed of the speech recognition system. When the real time rate is equal to or less than 1, then The processing is considered real-time.
  • RTF Real Time Factor
  • the SNSC system has a lower word error rate than the Hybrid speech recognition system and the E2E speech recognition system, and the real-time rate measured by the SNSC system is lower than that of the Hybrid speech recognition system and the E2E speech recognition system.
  • the sum of the real-time rates meets the service deployment requirement that the real-time rate is less than 1, that is, the speech recognition method provided in the embodiments of the present application has efficient and accurate performance, and a low real-time rate meets the conditions for service deployment.
  • the speech recognition method performs feature extraction on the speech content to be recognized to obtain intermediate features that can indicate the audio expression characteristics of the speech content, and then uses two different processing methods.
  • the intermediate feature is processed to obtain two word graphs.
  • Two different processing methods include decoding the intermediate feature based on the attention mechanism to obtain the first word graph, and performing feature mapping based on the pronunciation of the speech content to obtain the first word graph.
  • Two-word map, the first word map and the second word map are respectively used to indicate the candidate sequence set composed of the candidate words obtained by the above two processing methods, and finally according to the first word map and the second word map indicate the difference between the candidate words
  • the connection relationship of determines the recognition result, so as to realize the function of converting speech content into text content.
  • both the first word graph and the second word graph are obtained through the same intermediate feature, server resources can be saved.
  • different processing methods are performed on the intermediate feature, and then the processing result is jointly determined according to the word graph obtained by the two processing methods. Improves the accuracy of speech recognition.
  • FIG. 11 shows a flowchart of a training method for a speech recognition model provided by an exemplary embodiment of the present application.
  • each functional sub-model in the speech recognition model is trained to obtain a method for training the speech recognition model.
  • a speech recognition model for recognizing speech content the method includes the following steps.
  • Step 1101 Obtain initialization network parameters.
  • the initialization network parameters are initialization parameters for the shared network sub-model and the E2E network sub-model.
  • the shared network sub-model and the E2E network sub-model form a first training network.
  • the shared network sub-model and the E2E network sub-model can form an E2E speech recognition system.
  • the shared network sub-model is implemented as an encoder in the E2E speech recognition system, and the shared network sub-model consists of two convolutional neural networks and Transformer.
  • the E2E network sub-model consists of two parts, the Attention processing layer and the decoder; the initialization network parameters include the initial parameters of the convolutional neural network, Transformer, Attention processing layer and decoder.
  • the initialized network parameters are randomly generated by the system.
  • step 1102 the initialized network parameters are trained through a back-propagation algorithm to obtain shared network parameters and E2E network parameters.
  • the training of initializing network parameters may also be performed by gradient descent method or other training methods, and the back-propagation algorithm is used as an example for description here.
  • Backpropagation Algorithm BP algorithm
  • the training data used to train and initialize network parameters can be the voice information-text information stored in the database It can also be the sample data of voice information-text information obtained from the network.
  • the samples in the training data are submitted one by one to the first training network composed of the shared network sub-model and the E2E network sub-model; the first training network calculates the output y for the sample input x, and then obtains the loss function through the loss function.
  • the weights of the first training network are updated as a whole, and the update function corresponding to the loss function is used for each sample submitted to the neural network.
  • the entire weights are updated once until the error values corresponding to all samples are less than a preset threshold, that is, training to convergence.
  • the first training network is trained by the cross entropy loss function (Cross Entropy Loss, CE Loss) until it converges, and then the training is performed by the minimum word error rate loss function until it converges again, then the training ends , to obtain the parameters corresponding to the first training network.
  • the parameters corresponding to the first training network include shared network parameters corresponding to the shared network sub-model and E2E network parameters corresponding to the E2E network sub-model.
  • Step 1103 Based on the shared network parameters, the acoustic processing sub-model is trained to obtain acoustic processing parameters.
  • the shared network sub-model and the acoustic processing sub-model can form a Hybrid speech recognition system.
  • the shared network sub-model and the fully connected layer in the acoustic processing sub-model together act as the acoustic model part of the Hybrid speech recognition system.
  • the shared network sub-model and the acoustic processing sub-model together form the second training network to be trained.
  • the shared network parameters of the shared network sub-model that has been trained are used as part of the parameters of the second training network to participate in the training process of the second training network.
  • the training process of the second training network includes: on the basis of the determined shared network parameters, complete random initialization of the fully connected layer, and then use the cross-entropy loss function on the aligned corpus to train the second training network to convergence, and then By using discriminative training on the prepared word graph until it converges again, the training is completed, and the acoustic processing parameters corresponding to the acoustic processing sub-model are obtained.
  • the training data (aligned corpus and word map) included in the above process can be read from the database.
  • the first training network composed of the shared network sub-model and the E2E network sub-model is trained first, and then the first training network composed of the shared network sub-model and the acoustic processing sub-model is trained. 2. Train the network to train.
  • Step 1104 build a speech recognition model based on the shared network parameters, E2E network parameters and acoustic processing parameters.
  • the shared network sub-model is constructed from the shared network parameters
  • the E2E network sub-model is constructed from the E2E network parameters
  • the acoustic processing sub-model is constructed from the acoustic processing parameters
  • the shared network sub-model, E2E network sub-model, acoustic processing sub-model and result generation sub-model The models together make up the speech recognition model.
  • the training method of the speech recognition model provided in the embodiment of the present application firstly obtains the network parameters of the shared network sub-model and the E2E network sub-model by the training method of the E2E speech recognition system from scratch, and then uses the shared network sub-model to obtain the network parameters. Some of the corresponding shared network parameters are used in the training of the acoustic processing sub-model.
  • the shared network sub-model and the acoustic processing sub-model are trained as a Hybrid speech recognition system to obtain the network parameters of the acoustic processing sub-model, and then obtained from the above training.
  • the parameters of the speech recognition model are jointly constructed, so that the speech recognition model obtained by training can not only ensure the accuracy of speech recognition, but also save the server resources occupied by the whole speech recognition model in the process of realizing speech recognition.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data for analysis, stored data, displayed data, etc.
  • signals involved in this application All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant national and regional laws, regulations and standards.
  • the speech content and model training data involved in this application are all obtained with full authorization.
  • FIG. 12 shows a structural block diagram of a speech recognition apparatus provided by an exemplary embodiment of the present application.
  • the apparatus includes the following modules:
  • an acquisition module 1210 configured to acquire voice content, where the voice content is audio to be recognized;
  • a processing module 1220 configured to perform feature extraction on the voice content to obtain intermediate features, where the intermediate features are used to indicate the audio expression characteristics of the voice content;
  • the first generation module 1230 is configured to decode the intermediate features based on the attention mechanism to obtain a first word graph, where the first word graph is used to indicate the composition of the first candidate vocabulary predicted based on the attention mechanism. the first candidate sequence set;
  • the second generation module 1240 is configured to perform feature mapping on the intermediate features based on the pronunciation of the speech content to obtain a second word graph, where the second word graph is used to indicate a second candidate obtained based on the pronunciation
  • the second candidate sequence set composed of vocabulary
  • the determining module 1250 is configured to determine the recognition result of the speech content according to the connection relationship between the candidate words indicated by the first word graph and the second word graph.
  • processing module 1220 is further configured to perform feature extraction on the speech content through at least one layer of convolutional network to obtain intermediate sub-features;
  • the processing module 1220 is further configured to perform feature weighting on the intermediate sub-features to obtain the intermediate features.
  • the first generation module 1230 further includes:
  • the first processing unit 1231 is configured to perform feature weighting on the channel indicating the expression of human voice in the intermediate feature based on the attention mechanism, to obtain the first branch feature;
  • the first decoding unit 1232 is configured to decode the first branch feature to obtain the first word graph.
  • the first decoding unit 1232 is further configured to decode the first branch feature through a decoder to obtain the first candidate sequence set;
  • the first generation module 1230 further includes:
  • the first generating unit 1233 is configured to use the corresponding first candidate words in the first candidate sequence set as paths to generate the first word graph.
  • the second generating module 1240 further includes:
  • a second determining unit 1241 configured to determine the target vocabulary set of the speech to be recognized based on the intermediate feature
  • the second generating unit 1242 is configured to generate the second word graph based on the target vocabulary set.
  • the second determining unit 1241 is further configured to input the intermediate feature into a fully connected layer to obtain a posteriori probability of the phoneme of the speech to be recognized, where the phoneme is used to indicate The smallest phonetic unit divided according to the natural properties of voice;
  • the second determining unit 1241 is further configured to determine the target vocabulary set based on the posterior probability of the phoneme.
  • the second generating module 1240 further includes:
  • the second obtaining unit 1243 is used to obtain a pronunciation dictionary, where the pronunciation dictionary includes the mapping relationship between vocabulary and pronunciation;
  • the second determining unit 1241 is further configured to determine the phonemes at each time point in the speech content according to the posterior probability of the phonemes;
  • the second determining unit 1241 is further configured to determine, according to the pronunciation dictionary, a target vocabulary set that can be composed of phonemes at the various time points.
  • the second determining unit 1241 is further configured to determine the probability of at least one second candidate sequence composed of the target vocabulary set
  • the second generating unit 1242 is further configured to generate the second word graph based on the probability of the at least one second candidate sequence.
  • the determining module 1250 further includes:
  • the generating unit 1251 is configured to generate a target confusion network based on the first word graph and the second word graph, where the target confusion network includes connection probabilities between third candidate words that form a candidate sequence, and the third
  • the candidate vocabulary is determined from the first candidate vocabulary and the second candidate vocabulary, and the connection probability between the third candidate vocabulary is determined by comparing the first connection relationship between the first candidate vocabulary and the The second connection relationship between the second candidate words is obtained by weighting and merging;
  • the determining unit 1252 is configured to determine the candidate sequence with the highest sum of connection probabilities among the third candidate words in the target confusion network as the recognition result.
  • the generating unit 1251 is further configured to generate a first confusion network based on the first word graph, where the first confusion network includes the first confusion network in the first candidate sequence set A connection probability between candidate words;
  • the generating unit 1251 is further configured to generate a second confusion network based on the second word graph, where the second confusion network includes connection probabilities between the second candidate words in the second candidate sequence set;
  • the determining unit 1252 is further configured to perform a weighted combination of the first confusion network and the second confusion network to obtain the target confusion network.
  • the speech recognition device performs feature extraction on the speech content to be recognized, and obtains intermediate features that can indicate the audio expression characteristics of the speech content, and then uses two different processing methods.
  • the intermediate feature is processed to obtain two word graphs.
  • Two different processing methods include decoding the intermediate feature based on the attention mechanism to obtain the first word graph, and performing feature mapping based on the pronunciation of the speech content to obtain the first word graph.
  • Two-word map, the first word map and the second word map are respectively used to indicate the candidate sequence set composed of the candidate words obtained by the above two processing methods, and finally according to the first word map and the second word map indicate the difference between the candidate words
  • the connection relationship of determines the recognition result, so as to realize the function of converting speech content into text content.
  • both the first word graph and the second word graph are obtained through the same intermediate feature, server resources can be saved.
  • different processing methods are performed on the intermediate feature, and then the processing result is jointly determined according to the word graph obtained by the two processing methods. Improves the accuracy of speech recognition.
  • the speech recognition device provided in the above-mentioned embodiment is only illustrated by the division of the above-mentioned functional modules. In practical applications, the above-mentioned functions can be allocated to different functional modules according to needs. It is divided into different functional modules to complete all or part of the functions described above.
  • the speech recognition device and the speech recognition method embodiments provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.
  • FIG. 14 shows a schematic structural diagram of a server provided by an exemplary embodiment of the present application.
  • the server may include the following structure.
  • the server 1400 includes a central processing unit (CPU) 1401, a system memory 1404 including a random access memory (RAM) 1402 and a read only memory (ROM) 1403, and a connection system memory 1404 And the system bus 1405 of the central processing unit 1401.
  • Server 1400 also includes mass storage device 1406 for storing operating system 1413 , application programs 1414 and other program modules 1415 .
  • Computer-readable media can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other solid-state storage Its technology, CD-ROM, Digital Versatile Disc (DVD) or other optical storage, cassette, magnetic tape, magnetic disk storage or other magnetic storage device.
  • the system memory 1404 and mass storage device 1406 described above may be collectively referred to as memory.
  • the server 1400 may also operate on a remote computer connected to a network through a network such as the Internet. That is, the server 1400 may be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or may also use the network interface unit 1411 to connect to other types of networks or remote computer systems (not shown).
  • the above-mentioned memory also includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.
  • the embodiments of the present application further provide a computer device, the computer device includes a processor and a memory, the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the above-mentioned methods provided by the embodiments. method of speech recognition.
  • the computer device may be a terminal or a server.
  • Embodiments of the present application further provide a computer-readable storage medium, where at least one piece of program code is stored on the computer-readable storage medium, and the program code is loaded and executed by a processor to implement the speech recognition method provided by the above method embodiments .
  • Embodiments of the present application also provide a computer program product, where the computer program product includes at least one computer program.
  • the processor of the computer device reads the computer program from the computer program product, and the processor executes the computer program to implement the speech recognition method described in any of the above embodiments.
  • the computer-readable storage medium may include: a read-only memory, a random access memory, a solid-state drive (SSD, Solid State Drives), an optical disc, and the like.
  • the random access memory may include a resistive random access memory (ReRAM, Resistance Random Access Memory) and a dynamic random access memory (DRAM, Dynamic Random Access Memory).
  • ReRAM resistive random access memory
  • DRAM Dynamic Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音识别方法、装置、设备、存储介质及程序产品,涉及计算机技术领域。该方法包括:获取语音内容(401);对语音内容进行特征提取,得到中间特征,中间特征用于指示语音内容的音频表达特性(402);基于注意力机制对中间特征进行解码,得到第一词图(403);基于语音内容的发音情况对中间特征进行特征映射,得到第二词图(404);根据第一词图和第二词图指示的候选词汇之间的连接关系,确定语音内容的识别结果(405)。通过该方法,可以在保证服务器资源不浪费的情况下,通过中间特征执行不同的处理方式,再根据两种处理方式获得的词图共同确定处理结果,从而提高了语音识别的准确度。

Description

语音识别方法、装置、设备、存储介质及程序产品
本申请要求于2021年04月26日提交的申请号为202110451736.0、发明名称为“语音识别方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种语音识别方法、装置、设备、存储介质及程序产品。
背景技术
语音识别是指将接收到的语音信息转化为文本信息,许多应用均提供有语音转文本服务;其中,语音识别包括流式语音识别和非流式语音识别,流式语音识别对实时性的要求高于非流式语音识别对实时性的要求。针对非流式语音识别,常见的语音识别系统有传统语音识别系统以及E2E(End-to-End,端到端)语音识别系统。
在相关技术中,传统语音识别系统是通过语音特征、音素、词语、词串之间的依次映射关系来将语音信息转化为文本信息;传统语音识别系统由声学模型、发音词典以及语言模型等多个模型组合而成。而E2E语音识别系统是将输入端和输出端之间通过多头注意力机制,来实现上述传统语音识别系统中的多个模型对应的工作内容。
然而,传统语音识别系统中包括多个模型,由于各个模型之间的信息传递存在信息损失,其对应的识别性能具有一定的局限性,使得识别准确率较低。
发明内容
本申请实施例提供了一种语音识别方法、装置、设备、存储介质及程序产品。该技术方案如下。
一方面,提供了一种语音识别方法,所述方法由计算机设备执行,所述方法包括:
获取语音内容,所述语音内容为待识别的音频;
对所述语音内容进行特征提取,得到中间特征,所述中间特征用于指示所述语音内容的音频表达特性;
基于注意力机制对所述中间特征进行解码,得到第一词图,所述第一词图用于指示基于所述注意力机制预测得到的第一候选词汇组成的第一候选序列集;
基于所述语音内容的发音情况对所述中间特征进行特征映射,得到第二词图,所述第二词图用于指示基于所述发音情况得到的第二候选词汇组成的第二候选序列集;
根据所述第一词图和所述第二词图指示的候选词汇之间的连接关系,确定所述语音内容的识别结果。
另一方面,提供了一种语音识别装置,所述装置包括:
获取模块,用于获取语音内容,所述语音内容为待识别的音频;
处理模块,用于对所述语音内容进行特征提取,得到中间特征,所述中间特征用于指示所述语音内容的音频表达特性;
第一生成模块,用于基于注意力机制对所述中间特征进行解码,得到第一词图,所述第一词图用于指示基于所述注意力机制预测得到的第一候选词汇组成的第一候选序列集;
第二生成模块,用于基于所述语音内容的发音情况对所述中间特征进行特征映射,得到 第二词图,所述第二词图用于指示基于所述发音情况得到的第二候选词汇组成的第二候选序列集;
确定模块,用于根据所述第一词图和所述第二词图指示的候选词汇之间的连接关系,确定所述语音内容的识别结果。
另一方面,提供了一种计算机设备,所述设备包括处理器和存储器,所述存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行以实现本申请实施例中任一所述的语音识别方法。
另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述程序代码由处理器加载并执行以实现本申请实施例中任一所述的语音识别方法。
另一方面,提供了一种计算机程序产品,该计算机程序产品包括至少一条计算机程序,所述计算机程序由处理器加载并执行以实现上述实施例中任一所述的语音识别方法。
本申请的提供的技术方案至少包括以下有益效果。
针对待识别的语音内容,对语音内容进行特征提取,得到能够指示语音内容的音频表达特性的中间特征,然后通过两种不同的处理方式对该中间特征进行处理,得到两个词图,其中两种不同的处理方式包括基于注意力机制对中间特征进行解码,得到第一词图,以及基于语音内容的发音情况进行特征映射,得到第二词图,第一词图和第二词图分别用于指示通过上述两种处理方式得到的候选词汇组成的候选序列集,最后根据第一词图和第二词图指示的候选词汇之间的连接关系确定出识别结果,以实现将语音内容转换为文本内容的功能。由于第一词图和第二词图均为通过同一中间特征得到,因此能够节省服务器资源,同时,对中间特征执行不同的处理方式,再根据两种处理方式获得的词图共同确定处理结果,提高了语音识别的准确度。
附图说明
图1是本申请一个示例性实施例提供的实施环境示意图;
图2是本申请一个示例性实施例提供的语音识别应用场景示意图;
图3是本申请另一个示例性实施例提供的语音识别应用场景示意图;
图4是本申请一个示例性实施例提供的语音识别方法流程图;
图5是本申请一个示例性实施例提供的第一词图的形式示意图;
图6是本申请一个示例性实施例提供的混淆网络形式示意图;
图7是本申请一个示例性实施例提供的语音识别模型结构示意图;
图8是本申请一个示例性实施例提供的语音识别方法流程图;
图9是本申请一个示例性实施例提供的Hybrid语音识别系统示意图;
图10是本申请一个示例性实施例提供的E2E语音识别系统示意图;
图11是本申请一个示例性实施例提供的语音识别模型的训练方法流程图;
图12是本申请一个示例性实施例提供的语音识别装置结构框图;
图13是本申请另一个示例性实施例提供的语音识别装置结构框图;
图14是本申请一个示例性实施例提供的服务器的结构示意图。
具体实施方式
示意性的,对本申请实施例的实施环境进行说明;请参考图1,该实施环境中包括终端101、服务器102和通信网络103。
终端101可以是手机、平板电脑、电子书阅读器、多媒体播放设备、可穿戴设备、膝上型便携计算机、台式计算机或语音识别一体机等电子设备。示意性的,终端101中安装有用 于语音识别的应用程序,通过该应用程序可以实现对待识别语音内容的文本转换。该语音识别应用程序可以是传统应用软件,可以是云应用软件,可以实现为宿主应用程序中的小程序或应用模块,也可以是某个网页平台,在此不进行限定。
服务器102用于向终端101提供语音识别服务。终端101将待识别语音内容通过通信网络103传输至服务器102,相应的,服务器102接收终端101上传的待识别语音内容;服务器102调用语音识别模型对待识别语音内容进行识别,生成对应的文本内容,并将该文本内容通过通信网络103返回至终端101。可选的,服务器102为物理服务器或云服务器。
在一些实施例中,上述服务器102还可以实现为区块链系统中的节点。
服务器102可以通过通信网络103与终端101建立通信连接。该网络可以是无线网络,也可以是有线网络。
结合上述实施环境,对本申请实施例的应用场景进行说明。
本申请实施例中提供的语音识别方法,可用于流式语音识别服务,也可用于非流式语音识别服务;在本申请实施例中,以该方法应用于非流式语音识别服务中为例进行说明。示意性的,本申请实施例提供的语音识别方法可以应用于包括但不限于如下场景中的至少一种场景。
第一,该语音识别服务应用于社交软件中的对接收到的语音信息进行文本转换的场景;例如,目标对象在社交软件中接收到一条语音信息,例如,聊天过程中接收到其他对象发送的语音条、在动态界面中刷到其他对象发布的一条语音动态等。目标对象可通过该语音识别服务,将语音内容转换为文本内容进行显示,保证了目标对象在不方便通过播放语音的方式接收信息时时,也能及时获取到该语音内容的消息内容。如图2所示,在聊天界面200中,目标对象接收到其他对象发送的语音信息201,目标对象可以通过长按该语音信息201对应的控件,调取菜单控件202,菜单控件202中包括用于提供语音转文本服务的子控件,目标对象通过点击该子控件对接收到的语音信息进行文本信息的转换。终端在接收到对上述子控件的触发操作时,将该语音信号上传至服务器,由服务器进行语音识别,转换为文本信息,将文本信息返回至终端;终端接收服务器返回的文本信息,并在聊天界面200中的预设区域203中进行显示。
第二,该语音识别服务可以应用于输入法软件提供的语音输入功能中,例如,目标对象通过输入法软件中的预设控件进行语音输入,终端将采集到的语音信号发送至服务器,服务器对该语音信号进行处理,得到与语音信号对应的文本信息,将该文本信息返回至终端;终端将该文本信息作为目标对象语音输入的内容进行显示。示意性的,服务器可以返回一条文本信息,也可以返回由该语音信息确定的多条相似的文本信息提供给目标对象选择。如图3所示,在输入软件区域300中,目标对象可以通过语音输入控件301进行语音输入,当目标对象点击语音输入控件301后,终端会调用麦克风录制目标对象的语音信息,当目标对象再次点击语音输入控件301后,终端确定语音信息录制完毕,并将该语音信息上传至服务器,服务器反馈识别得到的多个文本信息302,终端对该多个文本信息302进行显示,目标对象可以从多个文本信息302中,对符合自己想法的文本信息进行选择,输入框303内就会显示多个文本信息302中被目标对象选择的文本信息。
第三,该语音识别服务可以应用于视频软件中的字幕自动生成功能中,例如,目标对象通过视频软件进行视频的发布,在发布目标视频之前,将该目标视频上传至视频软件,视频软件可以为目标对象提供一些视频处理功能,其中可以包括字幕自动生成功能,服务器通过对接收到的目标视频进行音频提取,并对该音频进行语音识别,生成文本信息,将该文本信息返回至终端,目标对象可以选择将生成的文本信息作为目标视频的字幕添加至目标视频中。
示意性的,本申请实施例提供的语音识别方法也可以应用于其他应用场景,在此仅进行举例说明,并不对具体应用场景进行限定。
在本申请实施例中,当涉及到对语音内容进行语音识别时,为了保证语音识别操作的合法性,服务器可以指示终端在终端界面上显示授权询问信息,在接收到基于该授权询问信息的确定操作后,服务器确认获取到与授权询问信息对应的相关信息的处理权限。
其中,该授权询问信息可以包括消息内容授权询问信息,或者输入语音授权询问信息中的至少一种;当授权询问信息中包含消息内容授权询问信息时,在服务器接收到目标对象对该授权询问信息的确定操作后,确定可以获取目标对象在社交软件中接收到语音信息;当授权询问信息中包含输入语音授权询问信息时,在服务器接收到目标对象对该授权询问信息的确定操作后,确定可以获取目标对象输入的语音内容;本申请对授权询问信息的内容不进行限制。
请参考图4,其示出了本申请一个示例性实施例提供的语音识别方法流程图,在本申请实施例中,该语音识别方法可以由计算机设备执行,以该方法由上述实施环境中的服务器执行为例进行说明,该方法包括如下步骤。
步骤401,获取语音内容,语音内容为待识别的音频。
服务器获取语音内容,该语音内容为待识别的音频。
可选的,终端将录制得到的音频进行压缩处理,将压缩音频和语音转文本请求利用网络协议打包,通过通信网络送往服务器。服务器在接收终端发送的语音转文本请求后,将与该语音转文本请求对应的压缩音频进行解压,得到上述待识别的语音内容。示意性的,服务器也可以从数据库中获取语音内容,在此不进行限定。
服务器在获取到语音内容后,根据语音转文本请求调用语音识别模型对该语音内容进行识别。
步骤402,对语音内容进行特征提取,得到中间特征。
中间特征用于指示语音内容的音频表达特性。
在本申请实施例中,语音识别模型包括共享网络(Shared Network)子模型,该共享网络子模型用于对语音内容进行特征提取,得到能够指示语音内容的音频表达特性的中间特征;也就是说,服务器可以通过语音识别模型中的共享网络子模型对语音内容进行特征提取,得到中间特征。其中,该语音识别模型可以被称为语音识别模块,该共享网络子模型可以称为共享网络子模块。
示意性的,共享网络子模型中包括至少一层卷积神经网络(Convolutional Neural Networks,CNN);服务器可以通过共享网络子模型中包含的至少一层卷积网络对语音内容进行特征抽取,得到中间子特征;之后,对中间子特征进行特征加权,得到中间特征。
示意性的,语音内容在被输入至共享网络子模型之前,需要进行预处理;该预处理包括将语音内容转化为语音特征序列,即通过信号处理技术,从与输入的语音内容对应的语音信号中提取的特征,通过特征向量的表示形式供共享网络子模型进行后续处理,以尽可能降低环境噪声、信道、说话人等因素对特征提取造成的影响。在一个示例中,该预处理包括降噪处理、采样处理、预加重处理、加窗分帧处理等处理中的至少一种。降噪处理为通过预设滤波器对语音信号进行降噪,以保证对语音信号中人声语音识别的准确性;采样处理为将作为模拟信号的语音信号转化为数字信号;预加重处理为对语音的高频部分进行加重,去除口唇辐射的影响,增加语音的高频分辨率;加窗分帧处理为采用可移动的有限长度窗口对语音信号进行加权,然后对各帧通过相关滤波器进行变换或运算,以实现将语音信号分为一些短段(分析帧)来进行处理。
将对语音内容进行预处理后得到的语音特征序列输入至共享网络子模型,得到中间特征。示意性的,该共享网络子模型包括至少一层卷积神经网络,至少一层卷积神经网络可以对语音内容进行特征抽取,以得到中间子特征,该中间子特征是相较于语音特征序列更高层次的特征表达。
该共享网络子模型还包括Transformer(深度自注意力变换网络),Transformer获取中间子特征,对该中间子特征进行至少一次增加自注意力机制的加权,从而输出中间特征。示意性的,该共享网络子模型还可以包括LSTM(Long Short-Term Memory,长短期记忆网络)、BLSTM(Bi-directional Long Short-Term Memory,双向长短期记忆网络)、DFSMN(Deep Feedforward Sequential Memory Networks,深度前馈顺序存储网络)等网络中的至少一种网络来对中间子特征进行处理,从而得到中间特征,在此不进行限定。
步骤403,基于注意力机制对中间特征进行解码,得到第一词图。
其中,在神经网络的隐藏层中,注意力机制(Attention Mechanism)根据具体任务目标,对关注的方向和加权模型进行调整。通过增加注意力机制的加权,使不符合注意力模型的内容弱化或者遗忘。若关注的方向基于自身,则称之为自身注意力(Self-Attention)机制。而将输入分为多个head(头),形成多个子空间,在每个子空间完成注意力机制之后重新组合,称之为多头注意力机制(Multi-Headed Attention,MHA),多头注意力机制可让模型在不同的子空间里学习到相关的信息。
第一词图用于指示基于注意力机制预测得到的第一候选词汇组成的第一候选序列集。
在本申请实施例中,语音识别模型包括E2E网络子模型,该E2E网络子模型用于对中间特征基于注意力机制进行解码处理,得到第一词图;也就是说,服务器可以基于注意力机制,通过E2E网络子模型对中间特征进行解码,得到第一词图。示意性的,E2E网络子模型可以基于注意力机制对中间特征中指示人声语音表达的通道进行特征加权,得到第一分支特征;对第一分支特征进行解码,得到第一词图。其中,该E2E网络子模型可以称为E2E网络子模块;E2E网络子模型用于指示通过基于注意力机制的神经网络实现端到端语音识别的识别模型。
示意性的,E2E网络子模型中包括Attention(注意力)处理层,Attention处理层作为整个E2E网络中的隐藏层,用于根据预设的任务目标对特征处理过程中关注的方向以及对特征进行加权的加权模型进行调整,即通过增加注意力机制的特征加权操作,使不符合关注的方向的语音特征弱化或遗忘,其中,该关注的方向是语音识别模型在被训练过程中确定的。因此,E2E网络子模型在接收到中间特征后,将该中间特征输入至Attention处理层,得到第一分支特征。
该Attention处理层可以实现为AED(Attention-based Encoder-Decoder,基于注意力机制的编码-解码)模型,该模型是用于解决序列到序列映射问题的模型,通过MHA控制编码序列与解码序列的不等长映射,完成E2E语音识别系统的构建。
示意性的,E2E网络子模型中还包括解码网络,该解码网络用于对第一分支特征进行解码,得到第一词图;在一个示例中,上述Attention处理层实现了中间特征与第一分支特征之间的不等长映射,解码网络对该第一分支特征进行解码,确定由第一候选词汇组成的第一候选序列集,即多条最优候选路径(N-best),根据上述多条最优候选路径生成第一词图,即通过解码器对第一分支特征进行解码,得到第一候选序列集;将第一候选序列集中对应的第一候选词汇作为路径,生成第一词图。该解码器是语音识别系统预先通过训练数据训练得到的。以词图(Lattice)的方式保存N-best能够防止占用过多的内容空间,Lattice本质上是一个有向无环图,图中的每个节点代表由第一分支特征确定的候选词汇的结束时间点,每条边代表一个可能的候选词汇,以及该候选词汇的得分,候选词汇的得分用于指示候选词汇被确定为处理结果中的词汇的可能性。在一个示例中,请参考图5,其示出了第一词图500的形式,其中,第一词图500中各个节点之间的边501表示为第一候选词汇和第一候选词的得分。
步骤404,基于语音内容的发音情况对中间特征进行特征映射,得到第二词图。
第二词图用于指示基于发音情况得到的第二候选词汇组成的第二候选序列集。
在本申请实施例中,语音识别模型还包括声学处理子模型,该声学处理子模型用于对中间特征基于发音情况进行特征映射,得到第二词图;也就是说,服务器可以基于语音内容的 发音情况,通过声学处理子模型对中间特征进行特征映射,得到第二词图。
示意性的,基于语音内容的发音情况,通过声学处理子模型对中间特征进行特征映射,得到第二词图的过程可以实现为:
基于中间特征确定待识别语音的目标词汇集;
基于目标词汇集生成第二词图。
示意性的,声学处理子模型在接收到中间特征后,将该中间特征输入至全连接层,得到待识别语音的音素的后验概率;基于该音素的后验概率,确定目标词汇集;其中,音素用于指示根据语音的自然属性划分的最小语音单位。在一个示例中,该全连接层由带软最大化激活函数(softmax)组成。
在本申请实施例中,声学处理子模型中还包括发音词典单元。发音词典单元可以根据接收到的待识别语音的音素的后验概率,确定出待识别语音的目标词汇集。示意性的,发音词典单元中存储有发音词典,该发音词典记录有词汇集合及与词汇集合中的词汇对应的发音,即该发音词典包括词汇与发音之间的映射关系。
基于此,基于音素的后验概率,确定目标词汇集的过程可以实现为:
获取发音词典;
根据音素的后验概率,确定语音内容中各个时间点的音素;
根据发音词典,确定各个时间点的音素组成的目标词汇集。
在本申请实施例中,声学处理子模型中还包括语言模型单元。语言模型单元用于基于由发音词典单元确定的目标词汇集,确定该目标词汇集的第二候选序列集。示意性的,该语言模型单元可以由n-gram语言模型、基于前馈神经网络的模型以及基于循环神经网络的模型等语言模型中的至少一种模型组成,也可以由其他语言模型组成,在此不进行限定。语言模型可以在确定由目标词汇集组成第二候选序列时,计算第二候选序列存在的可能性。
在本申请实施例中,第二词图的形式与第一词图的形式相同,在此不进行赘述。
在本申请实施例中,该中间特征用于同时输入语音识别模型中的E2E网络子模型以及声学处理模型;也就是说,获取第一词图和第二词图的过程可以同步进行。
步骤405,根据第一词图和第二词图指示的候选词汇之间的连接关系,确定语音内容的识别结果。
在本申请实施例中,第一词图指示基于注意力机制预测得到的第一候选词汇组成的第一候选序列集,第二词图指示基于发音情况预测得到的第二候选词汇组成的第二候选序列集。即,第一词图指示出了第一候选词汇之间的连接关系,第二词图指示出了第二候选词汇之间的连接关系。
在本申请实施例中,语音识别模型还包括结果生成子模型,该结果生成子模型用于对E2E网络子模型和声学处理子模型各自的输出结果进行处理,生成语音内容的识别结果。其中,该结果生成子模型可以称为结果生成子模块。
示意性的,结果生成子模型接收第一词图和第二词图,并根据第一词图和第二词图确定候选序列集合,在一个示例中,该候选序列集合包括第一候选序列集对应的候选序列和第二候选序列集对应的候选序列。或者,服务器可以从第一候选序列集和第二候选序列集中分别获取n个候选序列,将上述2n个候选序列确定为候选序列集合,其中,n为正整数。在一个示例中,服务器可以根据第一候选序列在第一候选序列集中的序列得分或第二候选序列在第二候选序列集中的序列得分确定候选序列集合,其中,该序列得分是由组成序列的候选词汇的得分确定的。结果生成子模型可以从上述候选序列集合中确定至少一个候选序列作为识别结果。
示意性的,结果生成子模型还可以根据第一词图和第二词图生成目标混淆网络,由目标混淆网络确定识别结果,该目标混淆网络中包括组成候选序列的第三候选词汇之间的连接概率,第三候选词汇是从第一候选词汇和第二候选词汇中确定的,第三候选词汇之间的连接概 率通过对第一候选词汇之间的第一连接关系和第二候选词汇之间的第二连接关系进行加权合并得到。其中,目标混淆网络对应的第三候选词汇,可以是第一候选词汇和第二候选词汇的并集,也可以由预设数量的第一候选词汇和预设数量的第二候选词汇组成,第一候选词的预设数量与第二候选词的预设数量可以相同或不同;其中,以第三候选词汇由预设数量的第一候选词汇和预设数量的第二候选词汇组成为例,按照预设规则从第一候选词汇中选择预设数量的候选词汇,从第二候选词汇中选择预设数量的候选词汇,将从第一候选词汇中选择的候选词汇和从第二候选词汇中选择的候选词汇取并集,得到第三候选词汇,并由第三候选词汇组成目标混淆网络;该预设规则可以根据E2E网络子模型和声学处理子模型之间的权重确定。其中,目标混淆网络中每个节点之间的每一条边对应为一个第三候选词汇及第三候选词汇的得分,第三候选词汇的得分用于指示该第三候选词汇与前后候选词汇之间的连接概率,该连接概率由第一连接关系和第二连接关系确定;该连接概率用以指示第三候选词汇与前后候选词之间具有连接关系的概率。
通过目标混淆网络确定识别结果的方式为:按照从左向右的顺序,遍历目标混淆网络的每个节点,并将两个节点之间的候选词汇对应得分最高的边互相拼接,形成一条路径,该路径即为目标混淆网络中得分最高的一条路径,而该路径所形成的候选序列,即为语音内容的识别结果。
请参考图6,其示出混淆网络600的形式,混淆网络600包括多个节点,节点之间的连线601与词图中的边相对应,即每条连线601代表候选词汇以及候选词汇的得分,该得分用于指示各个候选词汇之间的连接概率。在一个示例中,目标混淆网络为图6中示出的混淆网络600,则根据混淆网络600指示的各个候选词汇之间的连接概率,确定出的处理结果为:ABBC。
由词图生成混淆网络的方法包括:步骤a,从词图中选择一条权重最高的路径当作初始混淆网络,路径中的节点即为混淆网络中的节点;步骤b,逐步将其他的边对齐添加到上述初始混淆网络中,同位置且同词语的边合并为一条,并将权重进行累加。
在本申请实施例中,结果生成子模型还可以根据第一词图生成第一混淆网络,根据第二词图生成第二混淆网络,将第一混淆网络和第二混淆网络根据预设加权规则进行加权合并,得到目标混淆网络,示意性的,该预设加权规则由系统预设,在一个示例中,对第一混淆网络和第二混淆网络的加权合并过程包括:步骤a,对第一词图上的每一条边乘以系数m,对第二词图上的每一条边乘以系数(1-m)。其中m的取值范围为[0,1],例如,m=0.49或m=0.5等,若m大于0.5,则表示该语音识别模型的最终处理结果侧重E2E网络子模型的处理结果,若m小于0.5,则表示该语音识别模型的最终处理结果侧重声学处理子模型的处理结果;步骤b,对两个乘以系数之后第一词图和第二词图进行合并,在一个示例中,将加权后的第二词图对应的混淆网络作为初始混淆网络,并以该初始混淆网络为起点,遍历加权后的第一词图上的每一条边往初始混淆网络上进行对齐添加,直到全部添加则完成合并。
示意性的,如图7所示,其示出了上述语音识别模型700的结构,语音信息输入至共享网络子模型710,共享网络子模型710对该语音信息进行特征提取,得到中间特征,其中,共享网络子模型710中包括卷积神经网络711和Transformer712。中间特征被同时输入至E2E网络子模型720和声学处理子模型730。E2E网络子模型720对中间特征进行处理,输出第一词图,将第一词图输入至结果生成子模型740,其中,E2E网络子模型720包括注意力机制(Attention)处理层721和解码网络(Decoder)722。声学处理子模型730对中间特征进行处理,输出第二词图,将第二词图输入至结果生成子模型740,其中,声学处理子模型730包括全连接层(softmax)731、发音词典单元(Lexicon)732和语言模型单元(LM)733。由结果生成子模型740根据第一词图和第二词图生成处理结果,该处理结果包括至少一条与语音内容对应的文本信息。
综上所述,本申请实施例提供的语音识别方法,针对待识别的语音内容,对语音内容进 行特征提取,得到能够指示语音内容的音频表达特性的中间特征,然后通过两种不同的处理方式对该中间特征进行处理,得到两个词图,其中两种不同的处理方式包括基于注意力机制对中间特征进行解码,得到第一词图,以及基于语音内容的发音情况进行特征映射,得到第二词图,第一词图和第二词图分别用于指示通过上述两种处理方式得到的候选词汇组成的候选序列集,最后根据第一词图和第二词图指示的候选词汇之间的连接关系确定出识别结果,以实现将语音内容转换为文本内容的功能。由于第一词图和第二词图均为通过同一中间特征得到,因此能够节省服务器资源,同时,对中间特征执行不同的处理方式,再根据两种处理方式获得的词图共同确定处理结果,提高了语音识别的准确度。
请参考图8,其示出了本申请一个示例性实施例提供的语音识别方法流程图,该语音识别方法可以由计算机设备执行,该方法包括如下步骤。
步骤801,获取语音内容。
上述语音内容为待识别的音频。示意性的,服务器可以从终端获取语音内容,也可以从数据库中获取语音内容,在此不进行限定。
步骤802,对语音内容进行特征提取,得到中间特征。
在本申请实施例中,可以通过语音识别模型中的共享网络子模型对语音内容进行特征提取,得到中间特征。
中间特征用于指示语音内容的音频表达特性;该中间特征用于同时输入语音识别模型中的端到端E2E网络子模型和声学处理子模型。
对语音内容进行预处理,得到语音特征序列。将语音特征序列通过包括至少一层卷积神经网络和Transformer的共享网络进行特征提取,得到中间特征。
其中,基于该中间特征,通过步骤803~步骤804得到第一词图,通过步骤805~步骤808得到第二词图。
步骤803,基于注意力机制对中间特征中指示人声语音表达的通道进行特征加权,得到第一分支特征。
在本申请实施例中,可以基于注意力机制,通过E2E网络子模型对中间特征中指示人声语音表达的通道进行特征加权,得到第一分支特征。
通过注意力机制根据语音识别过程中关注的方向,对中间特征进行加权处理,得到第一分支特征。
步骤804,对第一分支特征进行解码,得到第一词图。
通过解码器对第一分支特征进行解码,解码器根据第一分支特征确定第一候选词汇,以及第一候选词汇在语音信息对应的各个时间节点中的得分,根据上述第一候选词汇以及第一候选词汇的得分生成第一词图,第一词图用于指示基于注意力机制预测得到的第一候选词汇组成的第一候选序列集。
步骤805,将中间特征输入至全连接层,得到待识别语音的音素的后验概率。
示意性的,该全连接层由带软最大化激活函数(softmax)组成。
步骤806,基于音素的后验概率和发音词典,确定目标词汇集。
示意性的,根据发音词典中记录的词汇与发音之间的映射关系,由待识别语音的音素的后验概率,确定语音内容中包括哪些第一候选词汇,由上述第一候选词汇组成目标词汇集。即获取发音词典,发音词典包括词汇与发音之间的映射关系;根据上述由全连接层确定的音素的后验概率,确定语音内容中各个时间点的音素;根据发音词典,确定各个时间点的音素所能够组成的目标词汇集。
步骤807,确定目标词汇集组成的至少一个第二候选序列的概率。
将上述目标词汇集输入至语言模型中,确定至少一个第二候选序列及至少一个第二候选序列对应的概率,示意性的,该语言模型可以是n-gram语言模型、基于前馈神经网络的模型、 基于循环神经网络的模型等语言模型中的至少一种。语言模型可以计算由目标词汇集组成第二候选序列时,第二候选序列存在的可能性。
步骤808,基于至少一个第二候选序列的概率,生成第二词图。
根据第二候选序列存在的可能性,将目标词汇集中的第二候选词汇生成第二词图,第二词图用于指示基于发音情况得到的第二候选词汇组成的第二候选序列集。
步骤809,基于第一词图生成第一混淆网络。
从第一词图中选择一条权重最高的路径当作第一初始混淆网络,词图路径中的节点即为混淆网络中的节点,逐步将其他的边对齐添加到上述第一初始混淆网络中,同位置且同第一候选词汇的边合并为一条,并将权重进行累加,最终得到第一混淆网络。
步骤810,基于第二词图生成第二混淆网络。
从第二词图中选择一条权重最高的路径当作第二初始混淆网络,词图路径中的节点即为混淆网络中的节点,逐步将其他的边对齐添加到上述第二初始混淆网络中,同位置且同第二候选词汇的边合并为一条,并将权重进行累加,最终得到第二混淆网络。
步骤811,将第一混淆网络和第二混淆网络进行加权合并,得到目标混淆网络。
对第一词图上的每一条边乘以系数m,对第二词图上的每一条边乘以系数(1-m)。其中m的取值范围为[0,1],若m大于0.5,则表示该语音识别模型的最终处理结果侧重E2E网络子模型的处理结果,若m小于0.5,则表示该语音识别模型的最终处理结果侧重声学处理子模型的处理结果,对两个乘以系数之后第一词图和第二词图进行合并。在一个示例中,将加权后的第二词图对应的混淆网络作为初始混淆网络,并以该初始混淆网络为起点,遍历加权后的第一词图上的每一条边往初始混淆网络上进行对齐添加,直到全部添加则完成合并,得到目标混淆网络。
步骤812,将目标混淆网络中第三候选词汇之间连接概率之和最高的候选序列,确定为识别结果。
按照从左向右的顺序,遍历目标混淆网络的每个节点,并将两个节点之间的候选词汇对得分最高的边互相拼接,形成一条路基,该路径即为目标混淆网络中得分最高的一条路径,而该路径所形成的候选序列,即为语音内容的识别结果。
在本申请实施例中,通过设置共享网络实现了对语音内容的识别,其吸收了Hybrid语音识别系统和E2E语音识别系统的优点。
其中,Hybrid语音识别(Hybrid Speech Recognition):是指通过对语音特征、音素、词语、词串进行依次映射,将语音信息转化为文本信息的识别方式;Hybrid语音识别系统由声学模型(Acoustic Model,AM)、发音词典(Pronunciation Dictionary)、语言模型(Language Model,LM)等多个模型组成。如图9所示,Hybrid语音识别系统900包括声学模型901、发音词典902、语言模型903;服务器可以通过对待识别的语音信息进行特征提取,得到语音特征序列,将语音特征序列输入至Hybrid语音识别系统900,获得语音识别系统900输出的语音信息对应的文本信息。
其中,声学模型是指用于计算语音特征与音素之间的映射概率的模型,音素是根据语音的自然属性划分出来的最小语音单位;其中,语音的自然属性包括物理属性和生理属性;物理属性包括音高(声音的高低,它决定于发音体的振动频率的大小,与发音体的振动频率成正比)、音强(声音的强弱,它决定于发音体振幅的大小,与发音体的振幅成正比)、音长(声音的长短,它决定于发音体的振动时间的长短,与发音体的振动时间成正比)、音质(声音的个性或特色,也叫音色,它决定于发音体振动的形式);生理属性即指示语音的生理发声位置以及发音动作。从物理属性来讲,音素是从音质角度划分出来的最小语音单位,而从生理属性来讲,音素是根据发音动作划分出来的最小语音单位,也就是说,一个发音动作构成一个音素,例如,“啊”对应的语音(ā)对应为具有一个音素,“爱”对应的语音“ài”对应为具有两个音素。发音词典包含上述语音识别系统所能处理的词汇集合及词汇集合中的词汇对 应的发音,提供了声学模型建模单元与语言模型单元间之间的映射。语言模型是指用于计算词语到词串之间的映射概率的模型,即用于估计识别得到的词汇组合成目标文本时,该目标文本存在的可能性。
E2E语音识别:是指端到端语音识别,E2E语音识别系统中不再有独立的声学模型、发音词典、语言模型等模型,而是从输入端(语音特征序列)到输出端(词串序列)直接通过一个神经网络相连,由该神经网络来承担原先所有模型的语音识别系统;示意性的,该神经网络可以是基于多头注意力机制(Multi-Head Attention,MHA)构建的网络模型。如图10所示,E2E语音识别系统1000包括编码器(Encoder)1001、注意力机制(Attention)模型1002以及解码器(Decoder)1003;服务器可以通过对待识别的语音信息进行特征提取,得到语音特征序列;将语音特征序列输入至E2E语音识别系统1000,得到语音识别系统1000输出的语音信息对应的文本信息。
在一个示例中,通过同一台物理机器对实现本申请实施例提供的语音识别方法的系统(SNSC,Shared Network System Combination)、Hybrid语音识别系统、E2E语音识别系统进行测试,得到如表一中的测试结果,其中,字错率表示识别每100个字中错误的字数,实时率(Real Time Factor,RTF)是用于度量语音识别系统解码速度的值,当实时率等于或小于1时,则认为该处理是实时的。由表一中的结果可知,SNSC系统相比于Hybrid语音识别系统和E2E语音识别系统具有较低的字错率,且SNSC系统所测得的实时率小于Hybrid语音识别系统和E2E语音识别系统的实时率之和,达到了实时率小于1的服务部署要求,也即,本申请实施例中提供的语音识别方法,具有高效精准的性能,且低实时率满足服务部署的条件。
表一
Figure PCTCN2022082046-appb-000001
综上所述,本申请实施例提供的语音识别方法,针对待识别的语音内容,对语音内容进行特征提取,得到能够指示语音内容的音频表达特性的中间特征,然后通过两种不同的处理方式对该中间特征进行处理,得到两个词图,其中两种不同的处理方式包括基于注意力机制对中间特征进行解码,得到第一词图,以及基于语音内容的发音情况进行特征映射,得到第二词图,第一词图和第二词图分别用于指示通过上述两种处理方式得到的候选词汇组成的候选序列集,最后根据第一词图和第二词图指示的候选词汇之间的连接关系确定出识别结果,以实现将语音内容转换为文本内容的功能。由于第一词图和第二词图均为通过同一中间特征得到,因此能够节省服务器资源,同时,对中间特征执行不同的处理方式,再根据两种处理方式获得的词图共同确定处理结果,提高了语音识别的准确度。
请参考图11,其示出了本申请一个示例性实施例提供的语音识别模型的训练方法流程图,在本申请实施例中,对语音识别模型中各个功能子模型进行训练,得到用于对语音内容进行识别的语音识别模型,该方法包括如下步骤。
步骤1101,获取初始化网络参数。
该初始化网络参数是针对共享网络子模型和E2E网络子模型的初始化参数,示意性的,共享网络子模型和E2E网络子模型组成第一训练网络。共享网络子模型与E2E网络子模型能够组成一个E2E语音识别系统,其中,共享网络子模型实现为E2E语音识别系统中的编码器(encoder),共享网络子模型由卷积神经网络和Transformer两个部分组成,而E2E网络子模型由Attention处理层和解码器(decoder)两个部分组成;初始化网络参数包括卷积神经网络、 Transformer、Attention处理层和解码器各自的初始参数。示意性的,该初始化网络参数由系统随机生成。
步骤1102,通过反向传播算法对初始化网络参数进行训练,得到共享网络参数和E2E网络参数。
示意性的,初始化网络参数的训练还可以通过梯度下降法或其他训练方法进行训练,在此仅以通过反向传播算法为例进行说明。反向传播算法(Backpropagation Algorithm,BP算法)是一种适合于多层神经元网络的学习算法,在一个示例中,用于训练初始化网络参数的训练数据可以是数据库中存储的语音信息-文本信息的样本数据,也可以是从网络中获取的语音信息-文本信息的样本数据。在训练过程中,将训练数据中的样本一个接一个递交给由共享网络子模型和E2E网络子模型组成的第一训练网络;第一训练网络对样本输入x计算输出y,然后通过损失函数得到样本中目标值与y之间的误差值,然后通过求取损失函数的梯度,并对第一训练网络的权值进行全体更新,对每一个提交给神经网络的样本用损失函数对应的更新函数对全体权值进行一次更新,直到所有样本对应的误差值都小于一个预设的阈值,即训练至收敛。在本申请实施例中,先通过交叉熵损失函数(Cross Entropy Loss,CE Loss)对第一训练网络进行训练,直至收敛,然后通过最小词错率损失函数进行训练,直至再次收敛,则训练结束,得到第一训练网络对应的参数。其中,第一训练网络对应的参数包括共享网络子模型对应的共享网络参数和E2E网络子模型对应的E2E网络参数。
步骤1103,基于共享网络参数,对声学处理子模型进行训练,得到声学处理参数。
共享网络子模型和声学处理子模型能够组成一个Hybrid语音识别系统,共享网络子模型以及声学处理子模型中的全连接层共同充当Hybrid语音识别系统中的声学模型部分。其中,共享网络子模型和声学处理子模型共同组成待训练的第二训练网络。将已经训练完成的共享网络子模型的共享网络参数作为第二训练网络的部分参数,参与至第二训练网络的训练过程中。第二训练网络的训练过程包括:在已确定共享网络参数的基础上,对全连接层完成随机初始化,然后通过在对齐好的语料上采用交叉熵损失函数将第二训练网络训练至收敛,再通过在准备好的词图上采用鉴别性训练直到再次收敛,即完成训练,得到声学处理子模型对应的声学处理参数。其中,上述过程中包含的训练数据(对齐的语料以及词图),可以从数据库中读取得到。
也就是说,在语音识别模型在训练过程中,先对由共享网络子模型和E2E网络子模型组成的第一训练网络进行训练,然后再对由共享网络子模型和声学处理子模型组成的第二训练网络进行训练。
步骤1104,基于共享网络参数、E2E网络参数和声学处理参数构建语音识别模型。
由共享网络参数构建共享网络子模型,由E2E网络参数构建E2E网络子模型,由声学处理参数构建声学处理子模型,最后由共享网络子模型、E2E网络子模型、声学处理子模型和结果生成子模型共同组成语音识别模型。
综上所述,本申请实施例提供的语音识别模型的训练方法,首先从零起步以E2E语音识别系统的训练方式得到共享网络子模型和E2E网络子模型的网络参数,然后将共享网络子模型部分对应的共享网络参数用于对声学处理子模型的训练中,将共享网络子模型和声学处理子模型作为一个Hybrid语音识别系统进行训练,得到声学处理子模型的网络参数,然后由上述训练得到的参数共同构建语音识别模型,使得训练得到的语音识别模型在实现语音识别的过程中,既能保证语音识别的准确性,也能够节省整个语音识别模型占用的服务器资源。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守国家和地区的相关法律法规和标准。例如,本申请中涉及到的语音内容,模型训练数据都是在充分授权的情况 下获取的。
请参考图12,其示出了本申请一个示例性的实施例提供的语音识别装置结构框图,该装置包括如下模块:
获取模块1210,用于获取语音内容,所述语音内容为待识别的音频;
处理模块1220,用于对所述语音内容进行特征提取,得到中间特征,所述中间特征用于指示所述语音内容的音频表达特性;
第一生成模块1230,用于基于注意力机制对所述中间特征进行解码,得到第一词图,所述第一词图用于指示基于所述注意力机制预测得到的第一候选词汇组成的第一候选序列集;
第二生成模块1240,用于基于所述语音内容的发音情况对所述中间特征进行特征映射,得到第二词图,所述第二词图用于指示基于所述发音情况得到的第二候选词汇组成的第二候选序列集;
确定模块1250,用于根据所述第一词图和所述第二词图指示的候选词汇之间的连接关系,确定所述语音内容的识别结果。
在一个可选的实施例中,所述处理模块1220,还用于对所述语音内容通过至少一层卷积网络进行特征抽取,得到中间子特征;
所述处理模块1220,还用于对所述中间子特征进行特征加权,得到所述中间特征。
在一个可选的实施例中,请参考图13,所述第一生成模块1230,还包括:
第一处理单元1231,用于基于所述注意力机制对所述中间特征中指示人声语音表达的通道进行特征加权,得到第一分支特征;
第一解码单元1232,用于对所述第一分支特征进行解码,得到所述第一词图。
在一个可选的实施例中,所述第一解码单元1232,还用于通过解码器对所述第一分支特征进行解码,得到所述第一候选序列集;
所述第一生成模块1230,还包括:
第一生成单元1233,用于将所述第一候选序列集中对应的第一候选词汇作为路径,生成所述第一词图。
在一个可选的实施例中,所述第二生成模块1240,还包括:
第二确定单元1241,用于基于所述中间特征确定所述待识别语音的目标词汇集;
第二生成单元1242,用于基于所述目标词汇集生成所述第二词图。
在一个可选的实施例中,所述第二确定单元1241,还用于将所述中间特征输入至全连接层,得到所述待识别语音的音素的后验概率,所述音素用于指示根据语音的自然属性划分的最小语音单位;
所述第二确定单元1241,还用于基于所述音素的后验概率,确定所述目标词汇集。
在一个可选的实施例中,所述第二生成模块1240,还包括:
第二获取单元1243,用于获取发音词典,所述发音词典包括词汇与发音之间的映射关系;
所述第二确定单元1241,还用于根据所述音素的后验概率,确定所述语音内容中各个时间点的音素;
所述第二确定单元1241,还用于根据所述发音词典,确定所述各个时间点的音素所能够组成的目标词汇集。
在一个可选的实施例中,所述第二确定单元1241,还用于确定所述目标词汇集组成的至少一个第二候选序列的概率;
所述第二生成单元1242,还用于基于所述至少一个第二候选序列的概率,生成所述第二词图。
在一个可选的实施例中,所述确定模块1250,还包括:
生成单元1251,用于基于所述第一词图和所述第二词图生成目标混淆网络,所述目标混 淆网络中包括组成候选序列的第三候选词汇之间的连接概率,所述第三候选词汇是从所述第一候选词汇和所述第二候选词汇中确定的,所述第三候选词汇之间的连接概率通过对所述第一候选词汇之间的第一连接关系和所述第二候选词汇之间的第二连接关系进行加权合并得到;
确定单元1252,用于将所述目标混淆网络中第三候选词汇之间连接概率之和最高的候选序列,确定为所述识别结果。
在一个可选的实施例中,所述生成单元1251,还用于基于所述第一词图生成第一混淆网络,所述第一混淆网络中包括所述第一候选序列集中的所述第一候选词汇之间的连接概率;
所述生成单元1251,还用于基于所述第二词图生成第二混淆网络,所述第二混淆网络中包括所述第二候选序列集中的所述第二候选词汇之间的连接概率;
所述确定单元1252,还用于将所述第一混淆网络和所述第二混淆网络进行加权合并,得到所述目标混淆网络。
综上所述,本申请实施例提供的语音识别装置,针对待识别的语音内容,对语音内容进行特征提取,得到能够指示语音内容的音频表达特性的中间特征,然后通过两种不同的处理方式对该中间特征进行处理,得到两个词图,其中两种不同的处理方式包括基于注意力机制对中间特征进行解码,得到第一词图,以及基于语音内容的发音情况进行特征映射,得到第二词图,第一词图和第二词图分别用于指示通过上述两种处理方式得到的候选词汇组成的候选序列集,最后根据第一词图和第二词图指示的候选词汇之间的连接关系确定出识别结果,以实现将语音内容转换为文本内容的功能。由于第一词图和第二词图均为通过同一中间特征得到,因此能够节省服务器资源,同时,对中间特征执行不同的处理方式,再根据两种处理方式获得的词图共同确定处理结果,提高了语音识别的准确度。
需要说明的是:上述实施例提供的语音识别装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音识别装置与语音识别方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图14示出了本申请一个示例性实施例提供的服务器的结构示意图。该服务器可以包括如下结构。
服务器1400包括中央处理单元(Central Processing Unit,CPU)1401、包括随机存取存储器(Random Access Memory,RAM)1402和只读存储器(Read Only Memory,ROM)1403的系统存储器1404,以及连接系统存储器1404和中央处理单元1401的系统总线1405。服务器1400还包括用于存储操作系统1413、应用程序1414和其他程序模块1415的大容量存储设备1406。
不失一般性,计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、带电可擦可编程只读存储器(Electrically Erasable Programmable Read Only Memory,EEPROM)、闪存或其他固态存储其技术,CD-ROM、数字通用光盘(Digital Versatile Disc,DVD)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知计算机存储介质不局限于上述几种。上述的系统存储器1404和大容量存储设备1406可以统称为存储器。
根据本申请的各种实施例,服务器1400还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即服务器1400可以通过连接在系统总线1405上的网络接口单元1411连接 到网络1412,或者说,也可以使用网络接口单元1411来连接到其他类型的网络或远程计算机系统(未示出)。
上述存储器还包括一个或者一个以上的程序,一个或者一个以上程序存储于存储器中,被配置由CPU执行。
本申请的实施例还提供了一种计算机设备,该计算机设备包括处理器和存储器,该存储器中存储有至少一条计算机程序,至少一条计算机程序由处理器加载并执行以实现上述各方法实施例提供的语音识别方法。可选地,该计算机设备可以是终端,也可以是服务器。
本申请的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质上存储有至少一条程序代码,程序代码由处理器加载并执行以实现上述各方法实施例提供的语音识别方法。
本申请的实施例还提供了一种计算机程序产品,该计算机程序产品包括至少一条计算机程序。计算机设备的处理器从计算机程序产品读取该计算机程序,处理器执行该计算机程序,以实现上述实施例中任一所述的语音识别方法。
可选地,该计算机可读存储介质可以包括:只读存储器、随机存取记忆体、固态硬盘(SSD,Solid State Drives)或光盘等。其中,随机存取记忆体可以包括电阻式随机存取记忆体(ReRAM,Resistance Random Access Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。

Claims (20)

  1. 一种语音识别方法,所述方法由计算机设备执行,所述方法包括:
    获取语音内容,所述语音内容为待识别的音频;
    对所述语音内容进行特征提取,得到中间特征,所述中间特征用于指示所述语音内容的音频表达特性;
    基于注意力机制对所述中间特征进行解码,得到第一词图,所述第一词图用于指示基于所述注意力机制预测得到的第一候选词汇组成的第一候选序列集;
    基于所述语音内容的发音情况对所述中间特征进行特征映射,得到第二词图,所述第二词图用于指示基于所述发音情况得到的第二候选词汇组成的第二候选序列集;
    根据所述第一词图和所述第二词图指示的候选词汇之间的连接关系,确定所述语音内容的识别结果。
  2. 根据权利要求1所述的方法,所述对所述语音内容进行特征提取,得到中间特征,包括:
    通过至少一层卷积网络对所述语音内容进行特征抽取,得到中间子特征;
    对所述中间子特征进行特征加权,得到所述中间特征。
  3. 根据权利要求1所述的方法,所述基于注意力机制对所述中间特征进行解码,得到第一词图,包括:
    基于所述注意力机制对所述中间特征中指示人声语音表达的通道进行特征加权,得到第一分支特征;
    对所述第一分支特征进行解码,得到所述第一词图。
  4. 根据权利要求3所述的方法,所述对所述第一分支特征进行解码,得到所述第一词图,包括:
    通过解码器对所述第一分支特征进行解码,得到所述第一候选序列集;
    将所述第一候选序列集中对应的第一候选词汇作为路径,生成所述第一词图。
  5. 根据权利要求1所述的方法,所述基于所述语音内容的发音情况对所述中间特征进行特征映射,得到第二词图,包括:
    基于所述中间特征确定待识别语音的目标词汇集;
    基于所述目标词汇集生成所述第二词图。
  6. 根据权利要求5所述的方法,所述基于所述中间特征确定待识别语音的目标词汇集,包括:
    将所述中间特征输入至全连接层,得到所述待识别语音的音素的后验概率,所述音素用于指示根据语音的自然属性划分的最小语音单位;
    基于所述音素的后验概率,确定所述目标词汇集。
  7. 根据权利要求6所述的方法,所述基于所述音素的后验概率,确定所述目标词汇集,包括:
    获取发音词典,所述发音词典包括词汇与发音之间的映射关系;
    根据所述音素的后验概率,确定所述语音内容中各个时间点的音素;
    根据所述发音词典,确定所述各个时间点的音素组成的目标词汇集。
  8. 根据权利要求6所述的方法,所述基于所述目标词汇集生成所述第二词图,包括:
    确定所述目标词汇集组成的至少一个第二候选序列的概率;
    基于所述至少一个第二候选序列的概率,生成所述第二词图。
  9. 根据权利要求1至8任一所述的方法,所述根据所述第一词图和所述第二词图指示的候选词汇之间的连接关系,确定所述语音内容的识别结果,包括:
    基于所述第一词图和所述第二词图生成目标混淆网络,所述目标混淆网络中包括组成候选序列的第三候选词汇之间的连接概率,所述第三候选词汇是从所述第一候选词汇和所述第二候选词汇中确定的,所述第三候选词汇之间的连接概率通过对所述第一候选词汇之间的第一连接关系和所述第二候选词汇之间的第二连接关系进行加权合并得到;
    将所述目标混淆网络中所述第三候选词汇之间连接概率之和最高的候选序列,确定为所述识别结果。
  10. 根据权利要求9所述的方法,所述基于所述第一词图和所述第二词图生成目标混淆网络,包括:
    基于所述第一词图生成第一混淆网络,所述第一混淆网络中包括所述第一候选序列集中的所述第一候选词汇之间的连接概率;
    基于所述第二词图生成第二混淆网络,所述第二混淆网络中包括所述第二候选序列集中的所述第二候选词汇之间的连接概率;
    将所述第一混淆网络和所述第二混淆网络进行加权合并,得到所述目标混淆网络。
  11. 一种语音识别装置,所述装置包括:
    获取模块,用于获取语音内容,所述语音内容为待识别的音频;
    处理模块,用于对所述语音内容进行特征提取,得到中间特征,所述中间特征用于指示所述语音内容的音频表达特性;
    第一生成模块,用于基于注意力机制对所述中间特征进行解码,得到第一词图,所述第一词图用于指示基于所述注意力机制预测得到的第一候选词汇组成的第一候选序列集;
    第二生成模块,用于基于所述语音内容的发音情况对所述中间特征进行特征映射,得到第二词图,所述第二词图用于指示基于所述发音情况得到的第二候选词汇组成的第二候选序列集;
    确定模块,用于根据所述第一词图和所述第二词图指示的候选词汇之间的连接关系,确定所述语音内容的识别结果。
  12. 根据权利要求11所述的装置,所述处理模块,还用于对所述语音内容通过至少一层卷积网络进行特征抽取,得到中间子特征;
    所述处理模块,还用于对所述中间子特征进行特征加权,得到所述中间特征。
  13. 根据权利要求11所述的装置,所述第一生成模块,还包括:
    第一处理单元,用于基于所述注意力机制对所述中间特征中指示人声语音表达的通道进行特征加权,得到第一分支特征;
    第一解码单元,用于对所述第一分支特征进行解码,得到所述第一词图。
  14. 根据权利要求13所述的装置,所述第一解码单元,还用于通过解码器对所述第一分支特征进行解码,得到所述第一候选序列集;
    所述第一生成模块,还包括:
    第一生成单元,用于将所述第一候选序列集中对应的第一候选词汇作为路径,生成所述第一词图。
  15. 根据权利要求11所述的装置,所述第二生成模块,还包括:
    第二确定单元,用于基于所述中间特征确定所述待识别语音的目标词汇集;
    第二生成单元,用于基于所述目标词汇集生成所述第二词图。
  16. 根据权利要求15所述的装置,所述第二确定单元,还用于将所述中间特征输入至全连接层,得到所述待识别语音的音素的后验概率,所述音素用于指示根据语音的自然属性划分的最小语音单位;
    所述第二确定单元,还用于基于所述音素的后验概率,确定所述目标词汇集。
  17. 根据权利要求16所述的装置,所述第二生成模块,还包括:
    第二获取单元,用于获取发音词典,所述发音词典包括词汇与发音之间的映射关系;
    所述第二确定单元,还用于根据所述音素的后验概率,确定所述语音内容中各个时间点的音素;
    所述第二确定单元,还用于根据所述发音词典,确定所述各个时间点的音素所能够组成的目标词汇集。
  18. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行以实现如权利要求1至10任一所述的语音识别方法。
  19. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述程序代码由处理器加载并执行以实现如权利要求1至10任一所述的语音识别方法。
  20. 一种计算机程序产品,所述计算机程序产品包括至少一条计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至10任一所述的语音识别方法。
PCT/CN2022/082046 2021-04-26 2022-03-21 语音识别方法、装置、设备、存储介质及程序产品 WO2022227935A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22794411.3A EP4231283A4 (en) 2021-04-26 2022-03-21 SPEECH RECOGNITION METHOD AND DEVICE AS WELL AS DEVICE, STORAGE MEDIUM AND PROGRAM PRODUCT
US17/979,660 US20230070000A1 (en) 2021-04-26 2022-11-02 Speech recognition method and apparatus, device, storage medium, and program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110451736.0 2021-04-26
CN202110451736.0A CN112863489B (zh) 2021-04-26 2021-04-26 语音识别方法、装置、设备及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/979,660 Continuation US20230070000A1 (en) 2021-04-26 2022-11-02 Speech recognition method and apparatus, device, storage medium, and program product

Publications (1)

Publication Number Publication Date
WO2022227935A1 true WO2022227935A1 (zh) 2022-11-03

Family

ID=75992905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/082046 WO2022227935A1 (zh) 2021-04-26 2022-03-21 语音识别方法、装置、设备、存储介质及程序产品

Country Status (4)

Country Link
US (1) US20230070000A1 (zh)
EP (1) EP4231283A4 (zh)
CN (1) CN112863489B (zh)
WO (1) WO2022227935A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112863489B (zh) * 2021-04-26 2021-07-27 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及介质
CN113380237A (zh) * 2021-06-09 2021-09-10 中国科学技术大学 增强局部依赖关系无监督预训练语音识别模型及训练方法
CN115662397B (zh) * 2022-12-29 2023-04-18 北京百度网讯科技有限公司 语音信号的处理方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140304205A1 (en) * 2013-04-04 2014-10-09 Spansion Llc Combining of results from multiple decoders
CN106104674A (zh) * 2014-03-24 2016-11-09 微软技术许可有限责任公司 混合语音识别
CN110534095A (zh) * 2019-08-22 2019-12-03 百度在线网络技术(北京)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN110808032A (zh) * 2019-09-20 2020-02-18 平安科技(深圳)有限公司 一种语音识别方法、装置、计算机设备及存储介质
CN110970031A (zh) * 2019-12-16 2020-04-07 苏州思必驰信息科技有限公司 语音识别系统及方法
CN112863489A (zh) * 2021-04-26 2021-05-28 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645270A (zh) * 2008-12-12 2010-02-10 中国科学院声学研究所 一种双向语音识别处理系统及方法
US10672388B2 (en) * 2017-12-15 2020-06-02 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for open-vocabulary end-to-end speech recognition
US11257481B2 (en) * 2018-10-24 2022-02-22 Tencent America LLC Multi-task training architecture and strategy for attention-based speech recognition system
CN110164416B (zh) * 2018-12-07 2023-05-09 腾讯科技(深圳)有限公司 一种语音识别方法及其装置、设备和存储介质
CN112242144A (zh) * 2019-07-17 2021-01-19 百度在线网络技术(北京)有限公司 基于流式注意力模型的语音识别解码方法、装置、设备以及计算机可读存储介质
CN111933125B (zh) * 2020-09-15 2021-02-02 深圳市友杰智新科技有限公司 联合模型的语音识别方法、装置和计算机设备
CN112509564B (zh) * 2020-10-15 2024-04-02 江苏南大电子信息技术股份有限公司 基于连接时序分类和自注意力机制的端到端语音识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140304205A1 (en) * 2013-04-04 2014-10-09 Spansion Llc Combining of results from multiple decoders
CN106104674A (zh) * 2014-03-24 2016-11-09 微软技术许可有限责任公司 混合语音识别
CN110534095A (zh) * 2019-08-22 2019-12-03 百度在线网络技术(北京)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN110808032A (zh) * 2019-09-20 2020-02-18 平安科技(深圳)有限公司 一种语音识别方法、装置、计算机设备及存储介质
CN110970031A (zh) * 2019-12-16 2020-04-07 苏州思必驰信息科技有限公司 语音识别系统及方法
CN112863489A (zh) * 2021-04-26 2021-05-28 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4231283A4

Also Published As

Publication number Publication date
US20230070000A1 (en) 2023-03-09
CN112863489B (zh) 2021-07-27
EP4231283A1 (en) 2023-08-23
EP4231283A4 (en) 2024-05-22
CN112863489A (zh) 2021-05-28

Similar Documents

Publication Publication Date Title
WO2022227935A1 (zh) 语音识别方法、装置、设备、存储介质及程序产品
KR101183344B1 (ko) 사용자 정정들을 이용한 자동 음성 인식 학습
US7885817B2 (en) Easy generation and automatic training of spoken dialog systems using text-to-speech
JP2019522810A (ja) ニューラルネットワークベースの声紋情報抽出方法及び装置
US11043214B1 (en) Speech recognition using dialog history
CA3114572A1 (en) Conversational agent pipeline trained on synthetic data
US20190013008A1 (en) Voice recognition method, recording medium, voice recognition device, and robot
JP7230806B2 (ja) 情報処理装置、及び情報処理方法
WO2018192186A1 (zh) 语音识别方法及装置
WO2022213787A1 (zh) 音频编码方法、音频解码方法、装置、计算机设备、存储介质及计算机程序产品
WO2023245389A1 (zh) 歌曲生成方法、装置、电子设备和存储介质
JP2004310098A (ja) スイッチング状態空間型モデルによる変分推論を用いた音声認識の方法
CN110600013A (zh) 非平行语料声音转换数据增强模型训练方法及装置
CN114242033A (zh) 语音合成方法、装置、设备、存储介质及程序产品
WO2021169825A1 (zh) 语音合成方法、装置、设备和存储介质
JP2023511390A (ja) アテンションベースのジョイント音響およびテキストのオンデバイス・エンド・ツー・エンドモデル
CN115713939B (zh) 语音识别方法、装置及电子设备
US20230252971A1 (en) System and method for speech processing
Fadel et al. Which French speech recognition system for assistant robots?
Mirishkar et al. CSTD-Telugu corpus: Crowd-sourced approach for large-scale speech data collection
JP6306447B2 (ja) 複数の異なる対話制御部を同時に用いて応答文を再生する端末、プログラム及びシステム
JP4864783B2 (ja) パタンマッチング装置、パタンマッチングプログラム、およびパタンマッチング方法
CN113223513A (zh) 语音转换方法、装置、设备和存储介质
CN112951270A (zh) 语音流利度检测的方法、装置和电子设备
EP4205104B1 (en) System and method for speech processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794411

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022794411

Country of ref document: EP

Effective date: 20230516

NENP Non-entry into the national phase

Ref country code: DE