WO2020119351A1 - 语音解码方法、装置、计算机设备及存储介质 - Google Patents

语音解码方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020119351A1
WO2020119351A1 PCT/CN2019/116686 CN2019116686W WO2020119351A1 WO 2020119351 A1 WO2020119351 A1 WO 2020119351A1 CN 2019116686 W CN2019116686 W CN 2019116686W WO 2020119351 A1 WO2020119351 A1 WO 2020119351A1
Authority
WO
WIPO (PCT)
Prior art keywords
decoding
token
score
pruning
state
Prior art date
Application number
PCT/CN2019/116686
Other languages
English (en)
French (fr)
Inventor
黄羿衡
简小征
贺利强
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020119351A1 publication Critical patent/WO2020119351A1/zh
Priority to US17/191,604 priority Critical patent/US11935517B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • This application relates to the field of speech recognition technology, and in particular to a speech decoding method, device, computer equipment, and storage medium.
  • Speech recognition technology is also called ASR (Automatic Speech Recognition), its goal is to convert the vocabulary content of human speech into computer-readable input, including keys, binary codes or character sequences, etc., to achieve human-computer Interaction.
  • Voice recognition technology has a wide range of application scenarios in modern life, and can be applied to scenarios such as car navigation, smart home, voice dialing, and simultaneous interpretation.
  • the decoder is the core of the speech recognition system. The decoder-based speech decoding process plays an important role in the entire speech recognition process and directly affects the accuracy of the recognition results.
  • the decoder-based speech decoding process is: obtaining a high-level language model, and using a general openfst tool to generate a decoding network on the high-level language model, and then performing voice decoding based on the decoding network.
  • the memory of the high-level language model is large, and the memory of the decoding network generated based on the high-level language model is much larger than the memory of the high-level language model.
  • This requires the configuration of a large amount of storage resources and computing resources. In the scenario of limited computing resources, it is difficult to achieve decoding. Therefore, a voice decoding method that takes into account both decoding speed and decoding accuracy is urgently needed.
  • a voice decoding method, apparatus, computer equipment, and storage medium are provided.
  • a voice decoding method executed by a computer device, the voice includes a current audio frame and a previous audio frame; the method includes:
  • a voice decoding device executed by a computer device, the voice includes a current audio frame and a previous audio frame; the device includes:
  • An obtaining module configured to obtain a target token corresponding to a minimum decoding score from a first token list, the first token list includes a plurality of first tokens obtained by decoding the previous audio frame in different decoding networks Token, the first token includes a state pair and a decoding score, the state pair is used to characterize that the first token corresponds to the first state in the first decoding network corresponding to the low-order language model and the differential language model Correspondence between the second states in the second decoding network of
  • a determining module configured to determine pruning parameters when decoding the current audio frame according to the target token and the acoustic vector of the current audio frame, and the pruning parameters are used to determine the pruning parameters of the current audio frame
  • the decoding process is constrained
  • the decoding module is configured to decode the current audio frame according to the first token list, the pruning parameters, and the acoustic vector.
  • a computer device includes a processor and a memory.
  • the memory stores computer-readable instructions.
  • the processor causes the processor to perform the steps of the voice decoding method.
  • a non-volatile computer-readable storage medium that stores computer-readable instructions, which when executed by one or more processors, causes the one or more processors The steps of the speech decoding method are performed.
  • FIG. 1 is an implementation environment involved in a voice decoding method provided by an embodiment of the present application
  • Figure 2 is a decoding principle diagram of the existing speech decoding method
  • FIG. 3 is a decoding principle diagram of a voice decoding method provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of a voice decoding method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a voice decoding process provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a voice decoding device according to an embodiment of the present application.
  • FIG. 7 shows a structural block diagram when a computer device provided by an exemplary embodiment of the present application is specifically implemented as a terminal.
  • ⁇ eps> represents the empty symbol
  • Ilabel represents the input symbol
  • ⁇ s> represents the starting symbol
  • State.A indicates the state of the token in the first decoding network corresponding to the low-order language model
  • State.B indicates the state of the token in the second decoding network corresponding to the differential language model.
  • WFST Weighted-Finaite-State Transducer
  • a token that is, a token
  • a token is a data structure that records scores and information in a certain state at a certain moment in the decoding process. Starting from the initial state of the weighted finite state machine, the token transfers along the edge with the direction. The state change during the transfer process can be reflected by the change of the input symbol. In the state transfer process from the initial state to the end state, a series of states and edges are recorded in the token.
  • HCLG.fst is a decoding network composed of four fsts through a series of algorithms.
  • the four fsts are H.fst, C.fst, L.fst and G.fst.
  • G represents the language model, and its input and output types are the same.
  • a language model is a representation of language structure (including the rules between words and sentences, such as grammar, common collocation of words, etc.), and its probability is used to indicate the probability that a sequence of language units appears in a speech signal.
  • L represents a pronunciation dictionary
  • the input is a monophone (phoneme)
  • the output is a word.
  • the pronunciation dictionary contains a collection of words and their pronunciation.
  • C means context-sensitive, the input is triphone (triphone), and the output is monophone.
  • Context-sensitive is used to express the correspondence between triphones and phonemes.
  • H represents an acoustic model, which is a differential representation of acoustics, linguistics, environmental variables, speaker gender, accent, etc.
  • Acoustic models include acoustic models based on HMM (Hidden Markov Model, Hidden Markov Model), for example, GMM-HMM (Gaussian Mixture Model-Hidden Markov Model, Gaussian Mixture Model—Hidden Markov Model), DNN-HMM (Deep Neural Network-Hidden Markov Model, Deep Neural Network-Hidden Markov Model), etc., also includes end-to-end acoustic models, such as CTC-LSTM (Connectionist Temporal Classification-Long-Short-Term Memory, connection timing classification-long and short time Memory) etc.
  • Each state of the acoustic model represents the probability distribution of the speech features of the speech unit in that state, and is connected into an ordered state sequence through the transition between states.
  • the decoding network is also called a search space, and uses various knowledge sources fused by WSFT, including at least one of a language model, an acoustic model, a context-related model, and a pronunciation dictionary model, for example, including single-factor decoding composed of L and G
  • the network is denoted as LG
  • the C-level decoding network composed of C, L, and G is denoted as CLG network
  • the HCLG network is represented by a hidden Markov model.
  • the word point indicates the position where Chinese characters are output.
  • the real-time rate indicates the ratio of decoding time to audio time.
  • the speech recognition system is used for speech recognition, and mainly includes a preprocessing module, a feature extraction module, an acoustic model training, a language model training module and a decoder.
  • the preprocessing module is used to process the input original voice signal, filter out unimportant information and background noise, and perform endpoint detection of the voice signal (find the beginning and end of the voice signal), voice framing (approximately Within 10-30ms, the voice signal is stable for a short time, and the voice signal is divided into sections for analysis) and pre-emphasis (increasing the high-frequency part) and other processing.
  • the feature extraction module is used to remove redundant information that is not useful for speech recognition in the speech signal, retain information that can reflect the essential features of the speech, and adopt and express it in a certain form.
  • the feature extraction module is to extract key feature parameters that reflect the features of the speech signal to form a feature vector sequence for subsequent processing.
  • the acoustic model training module is used to train the acoustic model parameters according to the characteristic parameters of the training speech library. During recognition, the feature parameters of the speech to be recognized can be matched with the acoustic model to obtain the recognition result.
  • the current mainstream speech recognition systems mostly use Hidden Markov Model HMM for acoustic model modeling.
  • the language model training module is used for grammatical and semantic analysis of the training text database, and the language model is obtained after training based on the statistical model.
  • the training methods of language models are mainly based on rule models and statistical models.
  • the language model is actually a probability model that calculates the occurrence probability of any sentence.
  • the process of establishing a language model can effectively combine the knowledge of Chinese grammar and semantics to describe the internal relationship between words. When identifying based on the trained language model, it can improve the recognition rate and reduce the search range.
  • the decoder can construct a decoding network based on the trained acoustic model, language model and pronunciation dictionary for the input speech signal, and use a search algorithm to search for the best path in the decoding network.
  • the best path searched by the decoder can output the word string of the voice signal with the greatest probability, so that the vocabulary content included in the voice signal can be determined.
  • the hardware environment of the decoder includes: two 14-core CPUs (E5-2680v4), 256G memory, Raid (disk array), 2*300 SAS, 6*800G SSD (solid state drive), 2* 40G network port (optical port, multi-mode), 8*GPU, 2.4GHz.
  • Each GPU model is Tesla M40 24GB graphics card.
  • the voice decoding method provided in the embodiments of the present application can be applied to various scenarios that require the use of voice recognition functions, for example, smart home scenarios, voice input scenarios, car navigation scenarios, and simultaneous interpretation scenarios.
  • the implementation environment involved in the embodiments of the present application may include a terminal 101 and a server 102.
  • the terminal 101 may be a smart phone, a notebook computer, a tablet computer, and other devices.
  • the terminal 101 may obtain related data for voice recognition from the server 102 in advance, and store the obtained data in a memory.
  • the processor in the terminal 101 calls the data stored in the memory to decode the collected voice signal; the terminal 101 can also be installed with an application with voice recognition function.
  • the device collects the voice signal
  • the collected voice signal is uploaded to the server 102 based on the installed application, and the server 102 performs voice decoding to obtain the corresponding voice service.
  • a voice recognition system is configured in the server 102, so that a voice recognition service can be provided to the terminal 101.
  • FIG. 2 is a schematic diagram of the speech decoding process of the related technology.
  • the related technology records the first decoding network corresponding to the low-order language model as WFST A , and the state of the token in WFST A as State.A.
  • the second decoding network corresponding to the higher-order language model is denoted as WFST B
  • the state of the token in WFST B is denoted as State.B.
  • Related technologies use cohyps (associated hypothesis set) to record the different hypotheses of State.A in WFST B and the corresponding states of these hypotheses.
  • the number of states of the high-level language model is several orders of magnitude higher than the number of states of the low-level language model.
  • the same state of the low-level language model may correspond to the state of many different high-level language models, and the related technology performs the number of cohyps Set, according to the empirical value, set the number of cohyps to 15, this unified limit leads to incompletely equivalent decoding results, resulting in a loss of accuracy.
  • an embodiment of the present application provides a voice decoding method.
  • the embodiment of the present application uses the state pair ⁇ state.A, state.B> to record the decoding state instead of state.
  • the total number of state.B corresponding to A is limited, so that a decoding result that is completely equivalent to the higher-order language model can be obtained without loss of accuracy.
  • An embodiment of the present application provides a voice decoding method. Referring to FIG. 4, the method process provided by the embodiment of the present application includes:
  • the terminal obtains a first decoding network corresponding to the low-order language model and a second decoding network corresponding to the differential language model.
  • the terminal may obtain a low-level language model from the server, and then use a model conversion tool (such as openfst, etc.) to generate a first decoding network corresponding to the low-level language model based on the obtained low-level language model.
  • a model conversion tool such as openfst, etc.
  • the terminal may obtain a differential language model from the server, and then use a model conversion tool (such as openfst, etc.) to generate a second decoding network corresponding to the differential language model based on the acquired differential language model.
  • a model conversion tool such as openfst, etc.
  • the server Before the terminal obtains the low-level language model and the high-level language model from the server, the server needs to obtain the high-level language model first, and then obtain the low-level language model and the differential language model based on the high-level language model.
  • the acquisition process of the high-level language model is: the server acquires a large number of basic phonemes, performs grammatical analysis on each basic phoneme, and obtains the class relationship between each basic phoneme and other basic phonemes, and then uses the fallback edge based on the analysis result Connect each basic phoneme and its low-order basic phoneme, the input symbol and output symbol on the fallback edge are empty, and the weight value on the fallback edge is the backoff weight value (backoff weight) corresponding to each basic phoneme, Then the server takes the low-order basic phonemes of each basic phoneme as the starting point, and takes each basic phoneme as the end point, and connects with an edge.
  • backoff weight backoff weight
  • the input symbol and output symbol of the edge are each basic phoneme, and the weight on the edge
  • the value is the log probability corresponding to each basic phoneme
  • the server uses the network formed by each basic phoneme, the edge between the basic phonemes, and the fallback edge as a high-order language model.
  • the basic phonemes are commonly used words, words or sentences in the Chinese language database.
  • the basic phoneme can be expressed as ngram.
  • the basic morpheme includes 1st order ngram, 2nd order ngram, 3rd order ngram, etc.
  • each basic morpheme has a state identifier (state id).
  • state id state identifier
  • the circles representing each basic phoneme are obtained by connecting edges with directions.
  • Each The input symbols, output symbols and weight values are marked on the sides.
  • the server can obtain the number of edges corresponding to each basic phoneme, and then allocate memory for the high-level language model based on the number of edges corresponding to each basic phoneme, thereby avoiding memory Insufficiency causes the construction of high-level language models to fail.
  • the server builds a high-level language model
  • a preset number such as 10 million
  • clear the basic phonemes that have been written to memory and delete the memory
  • the basic phonemes are written to disk until all the basic phonemes are analyzed. Using this method can greatly reduce the memory consumed by high-level language models.
  • ngrams include first-order ngrams, second-order ngrams, .., n-order ngrams, etc., perform a first pass of each ngram (grammatically described or analyzed), and record the stateid corresponding to each ngram ( State ID) and the number of edges corresponding to each ngram state.
  • ngram For any ngram, use a backoff edge to connect its low-order ngram states.
  • the input character on the backoff edge is empty, and the output character is also empty.
  • the weight value on the backoff edge is the backoff weight corresponding to the current ngram.
  • the state of these ngrams can be written to the disk, and the information corresponding to the written state in the memory is cleared until parse completes all ngrams.
  • the memory consumption is about 200G. Compared with the existing high-level language model construction method, a lot of memory is saved.
  • the server Based on the generated high-order language model, the server performs order reduction processing on the high-order language model, removes some basic phonemes of lower importance, and obtains a low-order language model.
  • the server can obtain a differential language model by performing a differential calculation on the high-order language model and the low-order language model.
  • the applied formula is as follows:
  • h) logP 2 (w
  • h) is the probability of the differential language model
  • h) is the probability of the higher-order language model
  • h) is the probability of the lower-order language model
  • is the retreat time Score.
  • the premise that the differential language model can be expressed by the above formula (1) and formula (2) is that the ngram set of the low-order language model is a subset of the ngram of the high-order language model.
  • the differential language model can be expressed in the form of the rolled-back language model in formulas (1) and (2). If the ngram set of the higher-order language model is not a superset of the lower-order language model, when the higher-order language model retreats, the lower-order language model may not necessarily retreat. At this time, the differential language model will not be expressed as formula (1) and (2) The form of the language model in the fallback, so that some potential erroneous calculations may be generated when decoding.
  • the terminal decodes the previous audio frame according to the first decoding network and the second decoding network to obtain a first token list.
  • the terminal when a voice signal is collected through a device such as a microphone, the terminal divides the voice signal into multiple audio frames according to a preset duration, and decodes the multiple audio frames frame by frame. Before decoding, the terminal first initializes the tokens included in the token list to obtain an initial token, which corresponds to the first state state.A in the first decoding network is the initial state, the initial token The corresponding second state state.B in the second decoding network is also the initial state, that is, the state pair ⁇ state.A, state.B> in the initial token is ⁇ 0,0>, and the initial token corresponds to The decoding score is also 0.
  • the terminal obtains the first token list corresponding to the previous audio frame by decoding multiple audio frames based on the initial token.
  • the first token list includes a plurality of first tokens obtained by decoding the previous audio frame.
  • the first token includes the state pairs formed by decoding in different decoding networks and their decoding scores. To characterize the correspondence between the first state in the first decoding network corresponding to the low-order language model and the second state in the second decoding network corresponding to the differential language model.
  • the terminal inputs the last audio frame into the first decoding network, and traverses all the edges where the input is empty from the initial token state.A. For any edge where the input is empty, if the edge is an edge where there is no utterance point, the state of state.B in the initial token remains unchanged; if the edge is an edge where there is an utterance point, the first Decode the decoding score tot_cost and the word of the word point in the network, and use the current state of state.B as the starting state to query whether there is an edge with the same input symbol and word of the word point in the second decoding network. 2.
  • State.B is jumped to the input symbol
  • the next state of the edge with the same word is the updated state.B, the updated state.A and the updated state.B are formed into a state pair, and the decoding path formed in the second decoding network is re-scored ,
  • the sum of the re-scoring score and the decoding score in the first decoding network is used as the new tot_cost, and then the new tot_cost and the new state pair ⁇ state.A, state.B> are updated with the initial token, and the updated order
  • the card is added to the updated token list, and the updated token list may represent a new token list.
  • the above process is performed recursively for the tokens in the obtained newtokenlist until no new tokens are added to the newtokenlist, and the same state is formed for tokens with a smaller decoding score.
  • the terminal copies the tokens in the newtokenlist to the first token list, which can be represented as curtokenlist, and clears the tokens in the newtokenlist.
  • the terminal decodes the previous audio frame according to the first decoding network and the second decoding network to obtain the first token list, and the terminal decodes the current audio frame to obtain the second
  • the process of the token list is the same. For details, refer to the following process. The difference is that the previous audio frame is referred to when decoding the previous audio frame, and the previous audio frame is referred to when decoding the current audio frame.
  • the terminal obtains the target token corresponding to the minimum decoding score from the first token list.
  • the terminal obtains the optimal token with the smallest decoding score from the first token list according to the decoding score, and the optimal token is the target token.
  • the terminal determines pruning parameters when decoding the current audio frame according to the target token and the acoustic vector of the current audio frame.
  • the pruning parameters include a first pruning parameter, a second pruning parameter, a third pruning parameter, etc.
  • the first pruning parameter can be expressed by curcutoff and is used for each first token in the list based on the first token Before decoding, determine whether to skip any first token;
  • the second pruning parameter can be represented by am_cutoff, used to determine when the first token in the first token list is decoded in the first decoding network Whether to skip any first token;
  • the third pruning parameter can be expressed by nextcutoff, used to determine whether to skip any of the first tokens in the first token list when decoding on the second decoding network The first token.
  • the terminal determines the pruning parameters when performing voice decoding on the current audio frame according to the target token and the acoustic vector of the current audio frame, as follows:
  • the terminal obtains a decoding score corresponding to the target token, and determines the first pruning parameter according to the decoding score corresponding to the target token and a preset value.
  • the terminal obtains the decoding score corresponding to the target token, and determines the sum of the decoding score corresponding to the target token and the preset value as the first pruning parameter.
  • the preset value can be set by the R&D personnel, and the preset value is generally 10.
  • the terminal inputs the acoustic vector into the first decoding network, and takes the first state of the target token as the starting state, traverses the input non-empty edges in the first decoding network, and forms according to each input non-empty edge In the first decoding path, the second initial pruning parameter is updated to obtain the second pruning parameter.
  • the second initial pruning parameter is generally set to infinity.
  • the update process of the second initial pruning parameter is: the terminal takes the first state state.A of the target token as the initial state, traverses the input non-empty edge, and forms a first decoding path for any input non-empty edge , The terminal obtains the first acoustic score of the first decoding path under the acoustic model, and determines the second initial pruning parameter according to the score determined by the decoding score corresponding to the target token, the first acoustic score, and the preset value Update to obtain the second pruning parameters.
  • the terminal obtains the total score obtained by adding the decoding score corresponding to the target token, twice the first acoustic score and the preset value, and if the total score is less than the second initial pruning parameter, the second initial pruning parameter is performed Update; if the total score is greater than the second initial pruning parameter, the second initial pruning parameter is not updated.
  • the second pruning parameters can be finally obtained.
  • the terminal re-scores the first decoding path formed by each input non-empty edge in the second decoding network, and cuts the third initial cut according to the re-score result
  • the branch parameters are updated to obtain the third pruning parameter.
  • the third initial pruning parameter is generally set to infinity.
  • the update process of the third initial pruning parameters is:
  • the terminal obtains the first acoustic score of the first decoding path under the acoustic model.
  • the terminal obtains the first path score of the first decoding path in the first decoding network.
  • the terminal adds the weight value on each edge in the first decoding path to obtain the first path score.
  • the terminal obtains the first re-scoring score of the first decoding path in the second decoding network.
  • This step includes the following situations:
  • the terminal when there is no word point on the first decoding path formed by inputting a non-empty edge, the terminal cannot obtain the word to be queried. In this case, it is not necessary to perform re-scoring on the second decoding network.
  • the first re-scoring score in the second decoding network is 0.
  • the terminal when there is a word point on the first decoding path formed by inputting a non-empty edge, the terminal obtains the word of the word point and uses the second state of the target token as the starting state. Inquire whether there is an edge whose input symbol is the same as the word of the utterance point. If it is queried that there is no edge where the input symbol and the word of the utterance point are the same in the second decoding network, then go back through the fallback edge.
  • the terminal updates the third initial pruning parameter according to the score determined by the decoding score corresponding to the target token, the first acoustic score, the first path score, the first re-scoring score, and the preset value to obtain the third Pruning parameters.
  • the terminal obtains the total score of the decoding score, the first acoustic score, the first path score, the first re-scoring score, and the preset value corresponding to the target token. If the total score is less than the third initial pruning parameter, the third The initial pruning parameter is updated to the total score. If the total score is greater than the third initial pruning parameter, the third initial pruning parameter is not updated. When the first decoding path formed by using all input non-empty edges continuously updates the third initial pruning parameter, the third pruning parameter can be finally obtained.
  • the terminal decodes the current audio frame according to the first token list, pruning parameters, and acoustic vector.
  • the terminal inputs the acoustic vector into the first decoding network, and traverses each first token in the first token list.
  • the terminal determines whether to skip the first token according to the decoding score and the first pruning parameter corresponding to the first token.
  • the terminal obtains the decoding score corresponding to the first token, and compares the decoding score of the first token with the first pruning parameter. If the decoding score of the first token is greater than the first pruning parameter, then skip this The first token; if the decoding score of the first token is less than the first pruning parameter, it is determined to execute the first token.
  • the terminal takes the first state of the first token as the starting state, traverses the input non-empty edge in the first decoding network, and forms the second according to each input non-empty edge Decode the path and the second pruning parameter to determine whether to skip the first token.
  • the terminal traverses the input non-empty edge in the first decoding network starting with the first state of the first token. For the second decoding path formed by any input non-empty edge, obtain the second acoustic score of the second decoding path under the acoustic model, and the score determined according to the decoding score corresponding to the first token and the second acoustic score Compared with the second pruning parameter, if the score determined according to the decoding score and the second acoustic score corresponding to the first token is greater than the second pruning parameter, the first token is skipped, otherwise, the first token is executed .
  • the terminal obtains the total score of the decoding score corresponding to the first token and twice the second acoustic score, if the total score is greater than the second pruning parameter, the first token is skipped, if the total score is less than The second pruning, then execute the first token.
  • the terminal updates the second pruning parameter. Specifically, the terminal obtains the total score of the decoding score corresponding to the first token, twice the second acoustic score, and the preset value. If the total score is less than the second pruning parameter, the second pruning parameter is updated to the The total score. If the total score is greater than the second pruning parameter, the second pruning parameter is not updated. After the second pruning parameter is updated, when determining whether to skip any first token, the determination will be based on the updated second pruning parameter.
  • the terminal takes the second state of the first token as the starting state, and re-scores the second decoding path formed by each input non-empty edge in the second decoding network, according to The re-scoring result and the third pruning parameter determine whether to skip the first token.
  • the terminal When determining to execute the first token according to the second pruning parameter, the terminal takes the second state of the first token as the starting state, and forms a second decoding path for each input non-empty edge in the second decoding network Re-score.
  • the terminal obtains the second acoustic score of the second decoding path under the acoustic model.
  • the terminal obtains the second path score of the second decoding path in the first decoding network, if the score determined according to the decoding score of the first token, the second path score and the second acoustic score is greater than the third
  • the first token is skipped, otherwise, the first token is executed.
  • the terminal obtains the total score of the first token's decoding score, second path score, and second acoustic score. If the total score is greater than the third pruning parameter, the first token is skipped; if the total score is less than For the third pruning parameter, the first token is executed.
  • the terminal acquires the total score of the first token's decoding score, second path score, second acoustic score, and preset value, if the total score is less than
  • the third pruning parameter is updated to the total score; if the total score is greater than the third pruning parameter, the third pruning parameter is not updated.
  • the third pruning parameter is updated, when determining whether to skip any first token, the determination will be based on the updated second three-branch parameter.
  • the terminal obtains the second acoustic score of the second decoding path under the acoustic model, and the second decoding path is the second path in the first decoding network Score, when there is a word point on the second decoding path formed by the input non-empty edge, the terminal obtains the word of the word point, and starts with the second state of the first token in the second decoding network Inquire whether there is an edge with the same input symbol and the word of the outgoing point. If it is found that there is no edge with the same input symbol and the word of the outgoing point in the second decoding network, then go back through the fallback edge.
  • the terminal Based on the acquired second re-scoring score, the terminal obtains the score determined by the first token's decoding score, second path score, second acoustic score, and second re-scoring score, if the first token's decoding score is used , The score determined by the second path score, the second acoustic score, and the second re-scoring score is greater than the third pruning parameter, the first token is skipped, otherwise, the first token is executed.
  • the terminal acquires the total determined by the decoding score of the first token, the second path score, the second acoustic score, the second re-scoring score, and the preset value Score, if the total score is less than the third pruning parameter, the third pruning parameter is updated to the total score; if the total score is greater than the third pruning parameter, the third pruning parameter is not updated. After the third pruning parameter is updated, when determining whether to skip any first token, the determination will be based on the updated third pruning parameter.
  • the terminal acquires the second token by performing a state jump on the first token, and the second token includes the updated status pair and its decoding score.
  • the terminal When it is determined to execute the first token according to the third pruning parameter, the terminal performs a state jump on the first state in the first token according to the traversal result in the first decoding network to obtain the updated first state, and According to the re-scoring result in the second decoding network, the state transition is performed on the second state in the first token to obtain the updated second state, which is further composed of the updated first state and the updated second state
  • the status pair of the second token will be based on the decoding score corresponding to the first token, the path score in the first decoding network, the re-scoring score in the second decoding network, and the second acoustic score under the acoustic model To determine the decoding score corresponding to the second token.
  • the state transition of the first token includes the following situations:
  • the terminal When there is no edge on the second decoding path formed by the input of the non-empty edge, the terminal jumps the first state of the first token to the next state of the input non-empty edge, and the first token The second state remains unchanged.
  • the terminal When there is an edge of the word-out point on the second decoding path formed by the input non-empty edge, the terminal jumps the first state of the first token to the next state of the edge of the word-out point. The second state of jumps to the next state of the same side of the input symbol and output symbol.
  • the terminal composes the second token corresponding to each first token into a second token list.
  • the terminal After obtaining the second token, the terminal adds the second token to the second token list until traversing each first token in the first token list.
  • the second token list is the curtokenlist of the current audio frame.
  • the terminal determines the second token with the smallest decoding score in the second token list as the decoding result of the current audio frame.
  • the terminal Based on the obtained second token list, the terminal obtains the second token with the smallest decoding score from the second token list, and determines the second token as the decoding result of the current audio frame.
  • the terminal can also dynamically expand the decoding path through the dictionary, and then apply the language model to perform the dynamic re-scoring and pruning process. In this way, it is not necessary to generate TLG resources combining dictionaries and language models, the demand for resources is small, and only the G.fst network that needs to generate relevant language models is needed.
  • Figure 5 is a schematic diagram of the decoding process using the decoding network corresponding to different language models.
  • TLG.fst low-order language model
  • the "weather" re-scoring score is 0.1
  • the total score of the decoding path 0-1-2-4-6 is 2.1 points
  • the total score of "Today's Weather” 2.1 is less than the total score of "Today's Apocalypse” 2.2, and the final output decoding result is "Today's Weather".
  • the terminal may also send the collected voice data to the server, and obtain the voice decoding result from the server without directly decoding.
  • the method provided in the embodiment of the present application does not need to generate a decoding network corresponding to a high-order language model, and performs decoding based on a decoding network corresponding to a low-order language model and a differential language model, and saves computing resources and storage resources on the premise of ensuring decoding accuracy . And according to the decoding result of the previous audio frame, the current audio frame is decoded to improve the decoding speed.
  • an embodiment of the present application provides a voice decoding device, which includes:
  • the obtaining module 601 is configured to obtain the target token corresponding to the minimum decoding score from the first token list.
  • the first token list includes multiple first tokens obtained by decoding the previous audio frame.
  • the first token includes State pairs formed by decoding in different decoding networks and their decoding scores. The state pairs are used to characterize the first state in the first decoding network corresponding to the low-order language model and the second state in the second decoding network corresponding to the differential language model. Correspondence between states;
  • the determining module 602 is used to determine pruning parameters when decoding the current audio frame according to the target token and the acoustic vector of the current audio frame, and the pruning parameters are used to constrain the decoding process of the current audio frame;
  • the decoding module 603 is configured to decode the current audio frame according to the first token list, pruning parameters, and acoustic vectors.
  • the determining module 602 is used to obtain the decoding score corresponding to the target token, and determine the first pruning parameter according to the decoding score corresponding to the target token and a preset value; input the acoustic vector to In the first decoding network, and starting with the first state of the target token, traverse the input non-empty edges in the first decoding network, according to the first decoding path formed by each input non-empty edge, for the first Two initial pruning parameters are updated to obtain a second pruning parameter; starting with the second state of the target token, the first decoding path formed by each input non-empty edge is re-used in the second decoding network Scoring, according to the re-scoring result, the third initial pruning parameter is updated to obtain the third pruning parameter.
  • the determining module 602 is configured to obtain the first acoustic score of the first decoding path under the acoustic model for the first decoding path formed by any input non-empty edge;
  • the second initial pruning parameter is updated according to the score determined by the decoding score corresponding to the target token, the first acoustic score, and the preset value to obtain the second pruning parameter.
  • the determining module 602 is configured to obtain the first acoustic score of the first decoding path under the acoustic model for the first decoding path formed by any input non-empty edge; obtain the first The first path score of the decoding path in the first decoding network; the first re-scoring score of the first decoding path in the second decoding network; according to the decoding score, first acoustic score, and first path corresponding to the target token The score, the first score score and the score determined by the preset value update the third initial pruning parameter to obtain the third pruning parameter.
  • the decoding module 603 is used to input the acoustic vector into the first decoding network and traverse each first token in the first token list; for any first order Card, according to the decoding score corresponding to the first token and the first pruning parameters, determine whether to skip the first token; when it is determined to execute the first token, the first state of the first token is used as the starting state, in The first decoding network traverses the input non-empty edges, and determines whether to skip the first token according to the second decoding path and the second pruning parameters formed by each input non-empty edge; when it is determined to execute the first token, Taking the second state of the first token as the starting state, re-score the second decoding path formed by each input non-empty edge in the second decoding network, and determine according to the re-score result and the third pruning parameter Whether to skip the first token; when it is determined to execute the first token, obtain the second token by performing a state jump on the first token, the second token includes the updated status pair and
  • the decoding module 603 is used to obtain the second acoustic score of the second decoding path under the acoustic model for the second decoding path formed by any input non-empty edge; if according to the first If the score determined by the decoding score and the second acoustic score corresponding to the token is greater than the second pruning parameter, the first token is skipped, otherwise, the first token is executed.
  • the device further includes:
  • the update module is configured to update the second pruning parameter if the score determined according to the decoding score corresponding to the first token, the second acoustic score, and the preset value is less than the second pruning parameter.
  • the decoding module 603 is configured to obtain the second acoustic score of the second decoding path under the acoustic model for the second decoding path formed by any input non-empty edge; when the input is non-empty When there is no word point on the edge, obtain the second path score of the second decoding path in the first decoding network; if the score is determined according to the decoding score of the first token, the second path score, and the second acoustic score If it is greater than the third pruning parameter, the first token is skipped, otherwise, the first token is executed.
  • the device further includes:
  • the update module is configured to update the third pruning parameter if the score determined according to the decoding score of the first token, the second path score, the second acoustic score, and the preset value is less than the third pruning parameter.
  • the decoding module 603 is configured to obtain the second acoustic score of the second decoding path under the acoustic model for the second decoding path formed by any input non-empty edge; when the input is non-empty When there is a word point on the edge, obtain the second path score of the second decoding path in the first decoding network; obtain the second re-scoring score of the second decoding path in the second decoding network; if based on the first token The score determined by the decoding score, the second path score, the second acoustic score, and the second re-scoring score is greater than the third pruning parameter, the first token is skipped, otherwise, the first token is executed.
  • the device further includes:
  • the update module is configured to update the third pruning if the score determined according to the decoding score of the first token, the second path score, the second acoustic score, the second re-scoring score, and the preset value is less than the third pruning parameter parameter.
  • the decoding module 603 is configured to perform a state jump on the first state in the first token according to the traversal result in the first decoding network to obtain the updated first state;
  • the re-scoring result in the second decoding network performs a state transition on the second state in the first token to obtain an updated second state; a second composed of the updated first state and the updated second state
  • the status pair of the token determine the second according to the decoding score corresponding to the first token, the path score in the first decoding network, the re-scoring score in the second decoding network, and the second acoustic score under the acoustic model Decoding score corresponding to the token.
  • the device provided by the embodiment of the present application does not need to generate a decoding network corresponding to a high-order language model, and performs decoding based on the decoding network corresponding to a low-order language model and a differential language model. On the premise of ensuring decoding accuracy, it saves Computing resources and storage resources. And according to the decoding result of the previous audio frame, the current audio frame is decoded to improve the decoding speed.
  • FIG. 7 shows a structural block diagram of a computer device 700 provided by an exemplary embodiment of the present application.
  • the computer device 700 may be: a smartphone, a tablet computer, an MP3 player (Moving Pictures Experts Group Audio Audio Layer III, Motion Picture Expert Compression Standard Audio Level 3), MP4 (Moving Pictures Experts Group Audio Audio Layer IV, Motion Picture Expert Compression Standard Audio level 4) Player, laptop or desktop computer.
  • the computer device 700 may also be referred to as other names such as user equipment, portable terminal, laptop terminal, and desktop terminal.
  • the computer device 700 includes a processor 701 and a memory 702.
  • the processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 701 can adopt at least one hardware form of DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array), PLA (Programmable Logic Array). achieve.
  • the processor 701 may also include a main processor and a coprocessor.
  • the main processor is a processor for processing data in a wake-up state, also known as a CPU (Central Processing Unit, central processor); the coprocessor is A low-power processor for processing data in the standby state.
  • the processor 701 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used to render and draw content that needs to be displayed on the display screen.
  • the processor 701 may further include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 702 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 702 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 702 is used to store at least one instruction that is executed by the processor 701 to implement speech decoding provided by the method embodiments in the present application method.
  • the computer device 700 may optionally further include: a peripheral device interface 703 and at least one peripheral device.
  • the processor 701, the memory 702, and the peripheral device interface 703 may be connected by a bus or a signal line.
  • Each peripheral device may be connected to the peripheral device interface 703 through a bus, a signal line, or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 704, a touch display screen 705, a camera 706, an audio circuit 707, a positioning component 708, and a power supply 709.
  • the peripheral device interface 703 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 701 and the memory 702.
  • the processor 701, the memory 702, and the peripheral device interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 701, the memory 702, and the peripheral device interface 703 or Both can be implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 704 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 704 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 704 converts the electrical signal into an electromagnetic signal for transmission, or converts the received electromagnetic signal into an electrical signal.
  • the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and so on.
  • the radio frequency circuit 704 can communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity, wireless fidelity) networks.
  • the radio frequency circuit 704 may further include a circuit related to NFC (Near Field Communication), which is not limited in this application.
  • the display screen 705 is used to display a UI (User Interface).
  • the UI may include graphics, text, icons, video, and any combination thereof.
  • the display screen 705 also has the ability to collect touch signals on or above the surface of the display screen 705.
  • the touch signal can be input to the processor 701 as a control signal for processing.
  • the display screen 705 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 705 may be one, and the front panel of the computer device 700 is provided; in other embodiments, the display screen 705 may be at least two, respectively disposed on different surfaces of the computer device 700 or in a folded design In still other embodiments, the display screen 705 may be a flexible display screen, which is provided on the curved surface or folding surface of the computer device 700. Even, the display screen 705 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 705 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode, organic light emitting diode) and other materials.
  • the camera component 706 is used to collect images or videos.
  • the camera assembly 706 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • there are at least two rear cameras which are respectively one of the main camera, the depth-of-field camera, the wide-angle camera, and the telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera Integrate with wide-angle camera to realize panoramic shooting and VR (Virtual Reality, virtual reality) shooting function or other fusion shooting functions.
  • the camera assembly 706 may also include a flash.
  • the flash can be a single-color flash or a dual-color flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation at different color temperatures.
  • the audio circuit 707 may include a microphone and a speaker.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 701 for processing, or input them to the radio frequency circuit 704 to implement voice communication.
  • the microphone can also be an array microphone or an omnidirectional acquisition microphone.
  • the speaker is used to convert the electrical signal from the processor 701 or the radio frequency circuit 704 into sound waves.
  • the speaker can be a traditional thin-film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible by humans, but also convert electrical signals into sound waves inaudible to humans for distance measurement and other purposes.
  • the audio circuit 707 may further include a headphone jack.
  • the positioning component 708 is used to locate the current geographic location of the computer device 700 to implement navigation or LBS (Location Based Service, location-based service).
  • the positioning component 708 may be a positioning component based on the GPS (Global Positioning System) of the United States, the Beidou system of China, the Grenas system of Russia, or the Galileo system of the European Union.
  • the power supply 709 is used to supply power to various components in the computer device 700.
  • the power source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries.
  • the rechargeable battery may support wired charging or wireless charging.
  • the rechargeable battery can also be used to support fast charging technology.
  • the computer device 700 further includes one or more sensors 710.
  • the one or more sensors 710 include, but are not limited to: an acceleration sensor 711, a gyro sensor 712, a pressure sensor 713, a fingerprint sensor 714, an optical sensor 715, and a proximity sensor 716.
  • the acceleration sensor 711 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the computer device 700.
  • the acceleration sensor 711 may be used to detect components of gravity acceleration on three coordinate axes.
  • the processor 701 may control the touch screen 705 to display the user interface in a landscape view or a portrait view according to the gravity acceleration signal collected by the acceleration sensor 711.
  • the acceleration sensor 711 can also be used for game or user movement data collection.
  • the gyro sensor 712 can detect the body direction and rotation angle of the computer device 700, and the gyro sensor 712 can cooperate with the acceleration sensor 711 to collect a 3D motion of the user on the computer device 700.
  • the processor 701 can realize the following functions according to the data collected by the gyro sensor 712: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 713 may be disposed on the side frame of the computer device 700 and/or the lower layer of the touch display 705.
  • the pressure sensor 713 can detect the user's grip signal on the computer device 700, and the processor 701 can perform left-right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 713.
  • the processor 701 controls the operability control on the UI interface according to the user's pressure operation on the touch screen 705.
  • the operability control includes at least one of a button control, a scroll bar control, an icon control, and a menu control.
  • the fingerprint sensor 714 is used to collect the user's fingerprint, and the processor 701 identifies the user's identity based on the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the user's identity based on the collected fingerprint. When the user's identity is recognized as a trusted identity, the processor 701 authorizes the user to perform related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings.
  • the fingerprint sensor 714 may be provided on the front, back, or side of the computer device 700. When a physical button or a manufacturer's logo is provided on the computer device 700, the fingerprint sensor 714 may be integrated with the physical button or the manufacturer's logo.
  • the optical sensor 715 is used to collect the ambient light intensity.
  • the processor 701 can control the display brightness of the touch display 705 according to the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 705 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 705 is decreased.
  • the processor 701 can also dynamically adjust the shooting parameters of the camera assembly 706 according to the ambient light intensity collected by the optical sensor 715.
  • the proximity sensor 716 also called a distance sensor, is usually provided on the front panel of the computer device 700.
  • the proximity sensor 716 is used to collect the distance between the user and the front of the computer device 700.
  • the processor 701 controls the touch display 705 to switch from the bright screen state to the breathing state; when the proximity sensor 716 When it is detected that the distance between the user and the front of the computer device 700 is gradually becoming larger, the processor 701 controls the touch display 705 to switch from the breath-hold state to the bright-screen state.
  • FIG. 7 does not constitute a limitation on the computer device 700, and may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
  • the computer equipment provided in the embodiments of the present application does not need to generate a decoding network corresponding to a high-order language model, and performs decoding based on the decoding network corresponding to a low-order language model and a differential language model, saving computing resources and storage on the premise of ensuring decoding accuracy Resources. And according to the decoding result of the previous audio frame, the current audio frame is decoded to improve the decoding speed.
  • An embodiment of the present application provides a non-volatile computer-readable storage medium, in which a computer-readable instruction is stored, and the computer-readable instruction is executed by a processor to implement the voice decoding method provided by the foregoing embodiments .
  • the computer-readable storage medium does not need to generate a decoding network corresponding to a high-order language model, and performs decoding based on the decoding network corresponding to a low-order language model and a differential language model, which saves calculations on the premise of ensuring decoding accuracy Resources and storage resources. And according to the decoding result of the previous audio frame, the current audio frame is decoded to improve the decoding speed.
  • the voice decoding device provided in the above embodiment performs voice decoding
  • the division of the above functional modules is only used as an example for illustration.
  • the above functions may be allocated by different functional modules according to needs. That is, the internal structure of the voice decoding device is divided into different functional modules to complete all or part of the functions described above.
  • the speech decoding apparatus and the speech decoding method embodiments provided in the above embodiments belong to the same concept. For the specific implementation process, see the method embodiments, and details are not described here.
  • the program may be stored in a computer-readable storage medium.
  • the mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种语音解码方法,由计算机设备执行,语音包括当前音频帧和上一音频帧,包括:从第一令牌列表中获取最小解码分数对应的目标令牌(403),第一令牌列表包括在不同解码网络中对上一音频帧进行解码得到的第一令牌,第一令牌包括状态对和解码分数;状态对用于表征第一令牌在低阶语言模型对应的第一解码网络中的第一状态与在差分语言模型对应的第二解码网络中的第二状态之间的对应关系;根据目标令牌和当前音频帧的声学向量,确定对当前音频帧进行解码时的剪枝参数(404);根据第一令牌列表、剪枝参数及声学向量,对当前音频帧进行解码(405)。

Description

语音解码方法、装置、计算机设备及存储介质
本申请要求于2018年12月14日提交中国专利局,申请号为201811536173X、发明名称为“语音解码方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,特别涉及一种语音解码方法、装置、计算机设备及存储介质。
背景技术
语音识别技术也称为ASR(Automatic Speech Recognition,自动语音识别),其目标是将人类的语音中的词汇内容转换为计算机可读的输入,包括按键、二进制编码或者字符序列等,从而实现人机交互。语音识别技术在现代生活中具有广泛的应用场景,可应用于车载导航、智能家居、语音拨号、同声传译等场景中。解码器作为语音识别系统的核心,基于解码器的语音解码过程在整个语音识别过程中发挥着重要作用,直接影响着识别结果的准确性。
目前,基于解码器的语音解码过程为:获取高阶语言模型,并采用通用的openfst工具在高阶语言模型上生成解码网络,进而基于该解码网络进行语音解码。
然而,高阶语言模型的内存较大,基于高阶语言模型所生成的解码网络的内存又比高阶语言模型的内存大的多,这就需要配置大量的存储资源及计算资源,在存储资源及计算资源有限的场景下,很难实现解码,因此,亟需一种兼顾解码速度及解码精度的语音解码方法。
发明内容
根据本申请提供的各种实施例,提供一种语音解码方法、装置、计算机设备和存储介质。
一种语音解码方法,由计算机设备执行,所述语音包括当前音频帧和上 一音频帧;所述方法包括:
从第一令牌列表中获取最小解码分数对应的目标令牌,所述第一令牌列表包括在不同解码网络中对所述上一音频帧进行解码得到的多个第一令牌,所述第一令牌包括状态对和解码分数,所述状态对用于表征所述第一令牌在低阶语言模型对应的第一解码网络中的第一状态与在差分语言模型对应的第二解码网络中的第二状态之间的对应关系;
根据所述目标令牌和所述当前音频帧的声学向量,确定对所述当前音频帧进行解码时的剪枝参数,所述剪枝参数用于对所述当前音频帧的解码过程进行约束;及
根据所述第一令牌列表、所述剪枝参数及所述声学向量,对所述当前音频帧进行解码。
一种语音解码装置,由计算机设备执行,所述语音包括当前音频帧和上一音频帧;所述装置包括:
获取模块,用于从第一令牌列表中获取最小解码分数对应的目标令牌,所述第一令牌列表包括在不同解码网络中对所述上一音频帧进行解码得到的多个第一令牌,所述第一令牌包括状态对和解码分数,所述状态对用于表征所述第一令牌在低阶语言模型对应的第一解码网络中的第一状态和差分语言模型对应的第二解码网络中的第二状态之间的对应关系;
确定模块,用于根据所述目标令牌和所述当前音频帧的声学向量,确定对所述当前音频帧进行解码时的剪枝参数,所述剪枝参数用于对所述当前音频帧的解码过程进行约束;及
解码模块,用于根据所述第一令牌列表、所述剪枝参数及所述声学向量,对所述当前音频帧进行解码。
一种计算机设备,包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述语音解码方法的步骤。
一种非易失性的计算机可读存储介质,所述存储介质中存储有计算机可 读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行所述语音解码方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种语音解码方法所涉及的实施环境;
图2是现有的语音解码方法的解码原理图;
图3是本申请实施例提供的一种语音解码方法的解码原理图;
图4是本申请实施例提供的一种语音解码方法的流程图;
图5是本申请实施例提供的一种语音解码过程的示意图;
图6是本申请实施例提供的一种语音解码装置结构示意图;
图7示出了本申请一个示例性实施例提供的计算机设备具体实现为终端时的结构框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
首先,对本申请中涉及的符号进行说明。
<eps>:代表空符号;
Ilabel:代表输入符号;
Olable:代表输出符号;
<s>:代表起始符号;
State.A:表示令牌在低阶语言模型对应的第一解码网络中的状态;
State.B:表示令牌在差分语言模型对应的第二解码网络中的状态。
接着,对本申请中涉及的重要名词进行解释。
1、WFST(Weighted Finaite-State Transducer,加权有限状态机)用于大规模的语音识别,其状态的变化可用输入符号和输出符号标记。
2、令牌(即token)是记录解码过程中某一时刻某个状态上的得分和信息的数据结构。从加权有限状态机的初始状态开始,令牌沿着具有方向的边进行转移,在转移过程中状态的变化可通过输入符号的变化体现。在从初始状态向终止状态的状态传递过程中,令牌中记录一系列的状态和边组成的路径。
3、HCLG.fst为一种解码网络,由四个fst经过一系列算法组合而成。四个fst分别是H.fst、C.fst、L.fst和G.fst。
其中,G表示语言模型,其输入和输出的类型相同。语言模型是语言结构(包括词语、句子之间的规律,例如语法、词语的常用搭配等)的表示,其概率用于表示语言单元的序列在一段语音信号中出现的概率。
L表示发音词典,其输入的是monophone(音素),输出的是词。发音词典中包含单词集合及其发音等。
C表示上下文相关,其输入的是triphone(三音子),输出的是monophone。上下文相关用于表示从三音子到音素之间的对应关系。
H表示声学模型,为对声学、语言学、环境的变量、说话人性别、口音等的差异化表示。声学模型包括基于HMM(Hidden Markov Model,隐马尔可夫模型)的声学模型,例如,GMM-HMM(Gaussian Mixture Model-Hidden Markov Model,高斯混合模型—隐马尔可夫模型)、DNN-HMM(Deep Neural Network-Hidden Markov Model,深度神经网络-隐马尔可夫模型)等,还包括基于端到端的声学模型,例如,CTC-LSTM(Connectionist Temporal Classification—Long Short-Term Memory,连接时序分类-长短时记忆)等。声学模型的每个状态表示语音单元的语音特征在该状态的概率分布,并通过状态与状态之间的转移连接成一条有序的状态序列。
4、解码网络也称为搜索空间,使用WSFT融合的各种知识源,包括语言模型、声学模型、上下文相关模型、发音词典模型中的至少一种,例如,包括L和G组成的单因素解码网络,记为LG;包括C、L、G组成的C-level 解码网络,记为CLG网络;采用隐马尔可夫模型表示的HCLG网络。
5、出词点表示有汉字输出的位置。
6、实时率表示解码时间占音频时间的比例。
接着,对本申请涉及的语音识别系统进行介绍。
语音识别系统用于语音识别,主要包括预处理模块、特征提取模块、声学模型训练、语言模型训练模块及解码器等。
其中,预处理模块用于对输入的原始语音信号进行处理,滤除掉不重要的信息以及背景噪声,并对语音信号的进行端点检测(找出语音信号的始末)、语音分帧(近似认为在10-30ms内是语音信号是短时平稳的,将语音信号分割为一段一段进行分析)以及预加重(提升高频部分)等处理。
特征提取模块用于去除语音信号中对于语音识别无用的冗余信息,保留能够反映语音本质特征的信息,并采用并用一定的形式表示出来。特征提取模块,也即是提取反映语音信号特征的关键特征参数形成特征矢量序列,以便用于后续处理。
声学模型训练模块用于根据训练语音库的特征参数训练出声学模型参数。在识别时可以将待识别的语音的特征参数同声学模型进行匹配,得到识别结果。目前的主流语音识别系统多采用隐马尔可夫模型HMM进行声学模型建模。
语言模型训练模块用于对训练文本数据库进行语法、语义分析,经过基于统计模型训练得到语言模型。语言模型的训练方法主要有基于规则模型和基于统计模型两种方法。语言模型实际上是计算任一句子出现概率的概率模型。语言模型的建立过程能够有效的结合汉语语法和语义的知识,描述词之间的内在关系,基于所训练的语言模型进行识别时,能够提高识别率,减少搜索范围。
解码器能够在语音识别过程中,针对输入的语音信号,根据己经训练的声学模型、语言模型及发音字典所构建解码网络,采用搜索算法在解码网络中搜寻最佳路径。解码器所搜索的最佳路径能够以最大概率输出该语音信号的词串,从而可确定出语音信号中包括的词汇内容。
在本申请实施例中,解码器的硬件环境包括:2个14核CPU(E5-2680v4), 256G内存,Raid(磁盘阵列),2*300 SAS,6*800G SSD(固态硬盘),2*40G网口(光口,多模),8*GPU,2.4GHz.每块GPU型号是Tesla M40 24GB显卡。
接着,对本申请的应用实施场景进行介绍。
本申请实施例提供的语音解码方法可应用于需要使用语音识别功能的各种场景,例如,智能家居场景、语音输入场景、车载导航场景、同声传译场景等。在上述应用场景下,本申请实施例所涉及的实施环境可以包括终端101和服务器102。
其中,终端101可以为智能手机、笔记本电脑、平板电脑等设备,该终端101可预先从服务器102上获取用于进行语音识别的相关数据,并将所获取到的数据存储在存储器中,当通过麦克风等设备采集到语音信号后,终端101内的处理器调用存储器中所存储的数据,对采集到的语音信号进行语音解码;终端101还可安装有具有语音识别功能的应用,当通过麦克风等设备采集到语音信号后,基于该安装的应用将所采集的语音信号上传至服务器102,由服务器102进行语音解码,从而获得相应的语音服务。
服务器102中配置有语音识别系统,从而能够向终端101提供语音识别服务。
接着,对比本申请与现有的语音解码过程的区别。
图2为相关技术进行语音解码过程的示意图,参见图2,相关技术将低阶语言模型对应的第一解码网络记为WFST A,并将token在WFST A中的状态记为State.A,将高阶语言模型对应的第二解码网络记为WFST B,并将token在WFST B中的状态记为State.B。相关技术采用cohyps(伴生假设集合)记录State.A在WFST B中的不同的假设及这些假设对应的状态。通常高阶语言模型的状态数比低阶语言模型的状态数多出几个数量级,低阶语言模型同样一个状态可能会对应很多不同的高阶语言模型的状态,而相关技术对cohyps的数量进行设置,按照经验值将cohyps的数量设置为15个,这个统一的限制导致不完全等价的解码结果,从而造成精度上的损失。
为了解决相关技术中存在的问题,本申请实施例提供了一种语音解码方法,参见图3,本申请实施例采用状态对<state.A,state.B>来记录解码状态, 而不对state.A对应的state.B的总数进行限制,从而能够得到与高阶语言模型完全等价的解码结果,而不会有精度上的损失。
本申请实施例提供了一种语音解码方法,参见图4,本申请实施例提供的方法流程包括:
401、终端获取低阶语言模型对应的第一解码网络和差分语言模型对应的第二解码网络。
对于第一解码网络,终端可从服务器获取低阶语言模型,进而基于所获取的低阶语言模型,采用模型转换工具(例如openfst等)生成低阶语言模型对应的第一解码网络。
对于第二解码网络,终端可从服务器获取差分语言模型,进而基于所获取的差分语言模型采用模型转换工具(例如openfst等)生成差分语言模型对应的第二解码网络。
而终端在从服务器上获取低阶语言模型和高阶语言模型之前,服务器需要先获取高阶语言模型,进而基于高阶语言模型获取低阶语言模型及差分语言模型。
具体地,高阶语言模型的获取过程为:服务器获取大量基础音素,对每个基础音素进行语法分析,得到每个基础音素与其他基础音素之间的阶级关系,进而基于分析结果采用回退边连接每个基础音素及其低阶基础音素,该回退边上的输入符号和输出符号为空,该回退边上的权重值为每个基础音素对应的回退权重值(backoff weight),然后服务器以每个基础音素的低阶基础音素为起点,以每个基础音素为终点,采用一条边进行连接,该条边的输入符号和输出符号为每个基础音素,该条边上的权重值为每个基础音素对应的对数概率,然后服务器将各个基础音素、基础音素之间的边及回退边所构成的网络作为高阶语言模型。其中,基础音素为汉语言数据库中常用的字、词语或者句子等。基础音素可以表示为ngram,根据所包含的字符数量,基础语素包括1阶ngram、2阶ngram、3阶ngram等。为了便于区分不同的基础音素,每个基础语素具有状态标识(state id),在获取高阶语言模型时,实际上是将表征每个基础音素的圆圈采用具有方向的边连接得到的,每条边上标注有输入符号、输出符号及权重值。
进一步地,在高阶语言模型的构建过程中,服务器可获取每个基础音素对应的边的数量,进而基于每个基础音素对应的边的数量,为高阶语言模型分配内存,从而避免因内存不足导致高阶语言模型构建失败。考虑到内存有限,服务器在构建高阶语言模型构建过程中,可在进行语法分析的基础音素数量达到预设数量时,例如1000万个,清空已写入内存上的基础音素,并将内存中的基础音素写入到磁盘中,直到分析完所有的基础音素。采用该种方法,可大大减小高阶语言模型所消耗的内存。
对于工业级别50G以上的高阶语言模型,在实际获取时,可采用如下方法:
1、获取大量ngram,这些ngram包括1阶ngram、2阶ngram..、n阶ngram等,对每个ngram进行第一遍parse(从语法上描述或分析),记录每个ngram对应的stateid(状态标识)和每个ngram state对应的边的数量。
2、对每个ngram进行第二遍parse,并根据ngram state对应的边的数量,预先分配出相应内存。
3、对于任一ngram,采用backoff(回退)边连接其低阶的ngram state。该backoff边上的输入字符为空,输出字符也为空,该backoff边上的权重值为当前ngram对应的backoff weight(回退权重值)。通过回退边的连接,可确定当前ngram state对应的低阶state(即历史state),然后采用一条边连接历史state对应的id到当前的ngram state对应的id,该条边上的输入符号和输出符号均为当前ngram,该条边上的权重值为该ngram对应的对数概率。
当采用上述方法parse到1000万个ngram时,可将这些ngram的state写入到磁盘上面,同时清空内存中已经写入的state对应的信息,直到parse完成所有的ngram。采用该方法生成100G以上的ngram对应的高阶语言模型,内存消耗在200G左右,相对现有的高阶语言模型的构建方法,节省了大量内存。
基于所生成的高阶语言模型,服务器对高阶语言模型进行降阶处理,去除部分重要性较低的基础音素,得到低阶语言模型。
基于所生成的高阶语言模型和低阶语言模型,服务器通过对高阶语言模型和低阶语言模型进行差分计算,可得到差分语言模型。服务器进行差分计算时,所应用的公式如下:
logP diff(w|h)=logP 2(w|h)-logP 1(w|h)        (1)
α diff(h)=α 2(h)-α 1(h)                       (2)
其中,P diff(w|h)是差分语言模型的概率,P 2(w|h)是高阶语言模型的概率P 1(w|h)是低阶语言模型的概率,α是回退时的分数。
需要说明的是,差分语言模型能够采用上述公式(1)和公式(2)表示的前提是,低阶语言模型的ngram集合为高阶语言模型的ngram的子集,在满足该条件时,当高阶语言模型回退时,低阶语言模型必然回退,差分语言模型能够表示成公式(1)和(2)中回退的语言模型的形式。如果高阶语言模型的ngram集合不是低阶语言模型的超集,当高阶语言模型回退时,低阶语言模型不一定回退,此时差分语言模型将会无法表示成公式(1)和(2)中回退的语言模型的形式,从而在进行解码时候可能会产生一些潜在的错误计算。
402、终端根据第一解码网络和第二解码网络对上一音频帧进行解码,得到第一令牌列表。
在语音识别场景下,当通过麦克风等设备采集到语音信号后,终端按照预设时长将语音信号切分为多个音频帧,并对多个音频帧逐帧进行解码。在进行解码之前,终端首先对令牌列表中包括的令牌进行初始化,得到初始令牌,该初始令牌在第一解码网络中对应的第一状态state.A为初始状态,该初始令牌在第二解码网络中对应的第二状态state.B也为初始状态,即该初始令牌中的状态对<state.A,state.B>为<0,0>,该初始令牌对应的解码分数也为0。然后终端基于初始令牌,通过对多个音频帧进行解码,获取上一音频帧对应的第一令牌列表。其中,第一令牌列表包括对上一音频帧进行解码得到的多个第一令牌,该第一令牌包括在不同解码网络中进行解码形成的状态对及其解码分数,该状态对用于表征低阶语言模型对应的第一解码网络中的第一状态和差分语言模型对应的第二解码网络中的第二状态之间的对应关系。
当上一帧音频帧为第一音频帧时,终端将上一音频帧输入到第一解码网络中,并从初始令牌的state.A开始遍历所有的输入为空的边。对于任一输入为空的边,如果该边为不存在出词点的边,则初始令牌中state.B的状态维持不变;如果该边为存在出词点的边,则获取第一解码网络中的解码分数tot_cost及出词点的词,并以state.B当前状态为起始状态,查询第二解码网络中是否存在输入符号和出词点的词相同的边,如果查询到第二解码网络中不 存在输入符号和出词点的词相同的边,则通过回退边进行回退,在回退状态下继续进行查询,直至查询到存在输入符号和输出符号相同的边;如果查询到存在输入符号和出词点的词相同的边,则将state.A跳转至出词点的边的下一状态,得到更新的state.A,将state.B跳转至输入符号和出词点的词相同的边的下一状态,得到更新的state.B,将更新的state.A和更新的state.B形成状态对,并对第二解码网络中形成的解码路径进行重评分,将重评分分数与第一解码网络中的解码分数之和作为新的tot_cost,进而将新的tot_cost和新的状态对<state.A,state.B>更新初始令牌,并将更新的令牌加入到更新的令牌列表中,该更新的令牌列表可以表示newtokenlist。
重复执行上述过程,直至遍历完所有输入为空的边。对于得到的newtokenlist中的令牌递归执行上述过程,直至没有新的令牌加入到newtokenlist,且同样的状态对没有解码分数更小的令牌形成。终端将newtokenlist中的令牌复制到第一令牌列表中,该第一令牌列表可以表示为curtokenlist,并清空将newtokenlist中的令牌。
当上一音频帧不是第一音频帧时,终端根据第一解码网络和第二解码网络对上一音频帧进行解码得到第一令牌列表的过程,与终端对当前音频帧进行解码得到第二令牌列表的过程相同,具体可参见下述过程,所不同的是,对上一音频帧进行解码时参考其上一音频帧,对当前音频帧进行解码时参考其上一音频帧。
403、终端从第一令牌列表中获取最小解码分数对应的目标令牌。
终端按照解码分数,从第一令牌列表中获取解码分数最小的最优令牌,该最优令牌即为目标令牌。
404、终端根据目标令牌和当前音频帧的声学向量,确定对当前音频帧进行解码时的剪枝参数。
其中,剪枝参数包括第一剪枝参数、第二剪枝参数及第三剪枝参数等,第一剪枝参数可用curcutoff表示,用于在基于第一令牌列表中的各个第一令牌进行解码之前,确定是否跳过任一第一令牌;第二剪枝参数可用am_cutoff表示,用于在基于第一令牌列表中的各个第一令牌在第一解码网络进行解码时,确定是否跳过任一第一令牌;第三剪枝参数可用nextcutoff表示,用于在基于第一令牌列表中的各个第一令牌在第二解码网络进行解码时,确定是否 跳过任一第一令牌。
终端根据目标令牌和当前音频帧的声学向量,确定对当前音频帧进行语音解码时的剪枝参数时,步骤如下:
4041、终端获取目标令牌对应的解码分数,根据目标令牌对应的解码分数与预设数值,确定第一剪枝参数。
终端获取目标令牌对应的解码分数,并将目标令牌对应的解码分数与预设数值之和,确定为第一剪枝参数。其中,预设数值可由研发人员设置,该预设数值一般为10。预设数值可以表示为config.beam,第一剪枝参数可以表示为curcutoff=tot_cost+config.beam。
4042、终端将声学向量输入到第一解码网络中,并以目标令牌的第一状态为起始状态,在第一解码网络中遍历输入非空的边,根据每条输入非空的边形成的第一解码路径,对第二初始剪枝参数进行更新,得到第二剪枝参数。
其中,第二初始剪枝参数一般设置为无穷大。对第二初始剪枝参数的更新过程为:终端以目标令牌的第一状态state.A为起始状态,遍历输入非空的边,对于任一条输入非空的边形成的第一解码路径,终端获取该第一解码路径在声学模型下的第一声学分数,并根据目标令牌对应的解码分数、第一声学分数及预设数值所确定的分数,对第二初始剪枝参数进行更新,得到第二剪枝参数。终端获取目标令牌对应的解码分数、两倍的第一声学分数及预设数值相加得到的总分数,如果该总分数小于第二初始剪枝参数,则对第二初始剪枝参数进行更新;如果该总分数大于第二初始剪枝参数,则不对第二初始剪枝参数进行更新。当采用所有输入非空的边形成的第一解码路径不断对第二初始剪枝参数进行更新,最终可得到第二剪枝参数。
4043、终端以目标令牌的第二状态为起始状态,在第二解码网络中对每条输入非空的边形成的第一解码路径进行重评分,根据重评分结果,对第三初始剪枝参数进行更新,得到第三剪枝参数。
其中,第三初始剪枝参数一般设置为无穷大。对第三初始剪枝参数的更新过程为:
40431、对于任一条输入非空的边形成的第一解码路径,终端获取第一解码路径在声学模型下的第一声学分数。
40432、终端获取第一解码路径在第一解码网络中的第一路径分数。
终端将第一解码路径中各条边上的权重值相加,得到第一路径分数。
40433、终端获取第一解码路径在第二解码网络中的第一重评分分数。
本步骤包括以下几种情况:
第一种情况、当输入非空的边形成的第一解码路径上不存在出词点,终端无法获取到待查询的词,此时无需在第二解码网络进行重评分,该第一解码路径在第二解码网络中的第一重评分分数为0。
第二种情况、当输入非空的边形成的第一解码路径上存在出词点,终端获取出词点的词,并以目标令牌的第二状态为起始状态,在第二解码网络中查询是否存在输入符号与出词点的词相同的边,如果查询到第二解码网络中不存在输入符号与出词点的词相同的边,则通过回退边进行回退,在回退状态下继续进行查询,直至查询到输入符号和输出符号相同的边,并将从第二状态到最终状态之间的各条边上的权重值作为第一重评分分数;如果查询到第二解码网络中存在输入符号与出词点的词相同的边,则获取与出词点的词相同的边的权重值,该权重值即为第一重评分分数。
40434、终端根据目标令牌对应的解码分数、第一声学分数、第一路径分数、第一重评分分数及预设数值所确定的分数,对第三初始剪枝参数进行更新,得到第三剪枝参数。
终端获取目标令牌对应的解码分数、第一声学分数、第一路径分数、第一重评分分数及预设数值的总分数,如果该总分数小于第三初始剪枝参数,则将第三初始剪枝参数更新为该总分数,如果该总分数大于第三初始剪枝参数,则不对第三初始剪枝参数进行更新。当采用所有输入非空的边形成的第一解码路径不断对第三初始剪枝参数进行更新,最终可得到第三剪枝参数。
405、终端根据第一令牌列表、剪枝参数及声学向量,对当前音频帧进行解码。
终端根据第一令牌列表、剪枝参数及声学向量,对当前音频帧进行解码时,步骤如下:
4051、终端将声学向量输入到第一解码网络中,并遍历第一令牌列表中的每个第一令牌。
4052、对于任一第一令牌,终端根据第一令牌对应的解码分数和第一剪枝参数,确定是否跳过第一令牌。
终端获取该第一令牌对应的解码分数,并将第一令牌的解码分数与第一剪枝参数进行比较,如果该第一令牌的解码分数大于第一剪枝参数,则跳过该第一令牌;如果该第一令牌的解码分数小于第一剪枝参数,则确定执行该第一令牌。
4053、当确定执行第一令牌,终端以第一令牌的第一状态为起始状态,在第一解码网络中遍历输入非空的边,根据每条输入非空的边形成的第二解码路径和第二剪枝参数,确定是否跳过第一令牌。
当根据第一剪枝参数确定执行该第一令牌时,终端以第一令牌的第一状态为起始状态,在第一解码网络中遍历输入非空的边。对于任一条输入非空的边形成的第二解码路径,获取第二解码路径在声学模型下的第二声学分数,并将根据第一令牌对应的解码分数和第二声学分数所确定的分数与第二剪枝参数进行比较,如果根据第一令牌对应的解码分数和第二声学分数所确定的分数大于第二剪枝参数,则跳过第一令牌,否则,执行第一令牌。具体地,终端获取第一令牌对应的解码分数和两倍的第二声学分数的总分数,如果该总分数大于第二剪枝参数,则跳过该第一令牌,如果该总分数小于第二剪枝,则执行该第一令牌。
进一步地,当根据第二剪枝参数确定执行该第一令牌时,如果根据第一令牌对应的解码分数、第二声学分数及预设数值所确定的分数小于第二剪枝参数,则终端更新第二剪枝参数。具体地,终端获取第一令牌对应的解码分数、两倍的第二声学分数及预设数值的总分数,如果该总分数小于第二剪枝参数,则将第二剪枝参数更新为该总分数,如果该总分数大于第二剪枝参数,则不对该第二剪枝参数进行更新。当第二剪枝参数更新后,在确定是否跳过任一第一令牌时,将根据更新的第二剪枝参数进行确定。
4054、当确定执行第一令牌,终端以第一令牌的第二状态为起始状态,在第二解码网络中对每条输入非空的边形成的第二解码路径进行重评分,根据重评分结果和第三剪枝参数,确定是否跳过第一令牌。
当根据第二剪枝参数确定执行第一令牌时,终端以第一令牌的第二状态为起始状态,在第二解码网络中对每条输入非空的边形成的第二解码路径进行重评分。
第一种情况、对于任一条输入非空的边形成的第二解码路径,终端获取 第二解码路径在声学模型下的第二声学分数,当输入非空的边形成的第二解码路径上不存在出词点时,终端获取第二解码路径在第一解码网络中的第二路径分数,如果根据第一令牌的解码分数、第二路径分数及第二声学分数所确定的分数大于第三剪枝参数,则跳过第一令牌,否则,执行第一令牌。具体地,终端获取第一令牌的解码分数、第二路径分数及第二声学分数的总分数,如果该总分数大于第三剪枝参数,则跳过第一令牌;如果该总分数小于第三剪枝参数,则执行第一令牌。
进一步地,当根据第三剪枝参数确定执行第一令牌时,终端获取第一令牌的解码分数、第二路径分数、第二声学分数及预设数值的总分数,如果该总分数小于第三剪枝参数,则将第三剪枝参数更新为该总分数;如果该总分数大于第三剪枝参数,则不对该第三剪枝参数进行更新。当第三剪枝参数更新后,在确定是否跳过任一第一令牌时,将根据更新的第二三枝参数进行确定。
第二种情况、对于任一条输入非空的边形成的第二解码路径,终端获取第二解码路径在声学模型下的第二声学分数,第二解码路径在第一解码网络中的第二路径分数,当输入非空的边形成的第二解码路径上存在出词点时,终端获取出词点的词,并以第一令牌的第二状态为起始状态,在第二解码网络中查询是否存在输入符号与出词点的词相同的边,如果查询到第二解码网络中不存在输入符号与出词点的词相同的边,则通过回退边进行回退,在回退状态下继续进行查询,直至查询到输入符号和输出符号相同的边,并将从第二状态到最终状态之间的各条边上的权重值作为第二重评分分数;如果查询到第二解码网络中存在输入符号与出词点的词相同的边,则获取与出词点的词相同的边的权重值,该权重值即为第二重评分分数。基于所获取到的第二重评分分数,终端获取第一令牌的解码分数、第二路径分数、第二声学分数及第二重评分分数所确定的分数,如果根据第一令牌的解码分数、第二路径分数、第二声学分数及第二重评分分数所确定的分数大于第三剪枝参数,则跳过第一令牌,否则,执行第一令牌。
进一步地,当根据第三剪枝参数确定执行第一令牌时,终端获取第一令牌的解码分数、第二路径分数、第二声学分数、第二重评分分数及预设数值确定的总分数,如果该总分数小于第三剪枝参数,则将第三剪枝参数更新后 该总分数;如果该总分数大于第三剪枝参数,则不对该第三剪枝参数进行更新。当第三剪枝参数更新后,在确定是否跳过任一第一令牌时,将根据更新的第三剪枝参数进行确定。
4055、当确定执行第一令牌,终端通过对第一令牌进行状态跳转,获取第二令牌,第二令牌包括更新的状态对及其解码分数。
当根据第三剪枝参数确定执行第一令牌时,终端根据第一解码网络中的遍历结果,对第一令牌中的第一状态进行状态跳转,得到更新后的第一状态,并根据第二解码网络中的重评分结果,对第一令牌中的第二状态进行状态跳转,得到更新后的第二状态,进而将更新后第一状态和更新后的第二状态组成的第二令牌的状态对,并将根据第一令牌对应的解码分数、在第一解码网络中的路径分数、在第二解码网络中的重评分分数及在声学模型下的第二声学得分,确定第二令牌对应的解码分数。
具体地,第一令牌进行状态跳转时,包括以下几种情况:
当输入非空的边形成的第二解码路径上不存在出词点的边时,终端将第一令牌的第一状态跳转至输入非空的边的下一状态,而第一令牌的第二状态保持不变。
当输入非空的边形成的第二解码路径上存在出词点的边时,终端将第一令牌的第一状态跳转至出词点的边的下一状态,将第一令牌中的第二状态跳转至输入符号与输出符号相同的边的下一状态。
4056、终端将每个第一令牌对应的第二令牌组成第二令牌列表。
当得到第二令牌后,终端将第二令牌加入到第二令牌列表中,直至遍历完第一令牌列表中的每个第一令牌。该第二令牌列表为当前音频帧的curtokenlist。
4057、终端将第二令牌列表中解码分数最小的第二令牌,确定为对当前音频帧的解码结果。
基于所得到的第二令牌列表,终端从第二令牌列表中获取解码分数最小的第二令牌,并将该第二令牌确定为对当前音频帧的解码结果。
需要说明的是,上述以根据上一音频帧的第一令牌列表、第一解码网络及第二解码网络,对当前音频帧进行解码为例,对于其他的音频帧进行解码可参照解码方式,此处不再赘述。
在本申请的另一个实施例中,终端还可通过词典动态扩张解码路径,然后应用语言模型进行动态重评分和剪枝过程。采用该种方式可以不用生成结合词典和语言模型的TLG资源,对资源的需求量小,且仅需要生成相关语言模型的G.fst网络。
图5为应用不同语言模型对应的解码网络进行解码过程的示意图,参见图5,在TLG.fst(低阶语言模型)上解码时,解码路径0-1-2-4-6对应的解码结果为“今天天气”,解码分数为0+0.8+1.2=2.0;解码路径0-1-2-4-7对应的解码结果为“今天天启”,解码分数为0+0.8+1.0=1.8,对比上述两个结果,“今天天启”的解码分数比“今天天气”的解码分数要小,结果更优。然而,经过G.fst(差分语言模型)进行重评分,“天气”的重评分分数为0.1,解码路径0-1-2-4-6的总分数为2.1分;“天启”的重评分分数为0.4,解码路径0-1-2-4-7的总分数为1.8+0.4=2.2。经过重评分之后,“今天天气”的总分数2.1小于“今天天启”的总分数2.2,最终输出的解码结果为“今天天气”。
需要说明的是,上述以终端进行语音解码为例,在实际应用中,终端还可以将采集到的语音数据发送至服务器,从服务器获取语音解码结果,而不直接进行解码。
本申请实施例提供的方法,无需生成高阶语言模型对应的解码网络,基于低阶语言模型和差分语言模型对应的解码网络进行解码,在确保解码精度的前提下,节省了计算资源和存储资源。且根据对上一音频帧的解码结果,对当前音频帧的解码进行解码,提高了解码速度。
参见图6,本申请实施例提供了一种语音解码装置,该装置包括:
获取模块601,用于从第一令牌列表中获取最小解码分数对应的目标令牌,第一令牌列表包括对上一音频帧进行解码得到的多个第一令牌,第一令牌包括在不同解码网络中进行解码形成的状态对及其解码分数,状态对用于表征低阶语言模型对应的第一解码网络中的第一状态和差分语言模型对应的第二解码网络中的第二状态之间的对应关系;
确定模块602,用于根据目标令牌和当前音频帧的声学向量,确定对当前音频帧进行解码时的剪枝参数,剪枝参数用于对当前音频帧的解码过程进行约束;
解码模块603,用于根据第一令牌列表、剪枝参数及声学向量,对当前音频帧进行解码。
在本申请的另一个实施例中,确定模块602,用于获取目标令牌对应的解码分数,根据目标令牌对应的解码分数与预设数值,确定第一剪枝参数;将声学向量输入到第一解码网络中,并以目标令牌的第一状态为起始状态,在第一解码网络中遍历输入非空的边,根据每条输入非空的边形成的第一解码路径,对第二初始剪枝参数进行更新,得到第二剪枝参数;以目标令牌的第二状态为起始状态,在第二解码网络中对每条输入非空的边形成的第一解码路径进行重评分,根据重评分结果,对第三初始剪枝参数进行更新,得到第三剪枝参数。
在本申请的另一个实施例中,确定模块602,用于对于任一条输入非空的边形成的第一解码路径,获取第一解码路径在声学模型下的第一声学分数;
根据目标令牌对应的解码分数、第一声学分数及预设数值所确定的分数,对第二初始剪枝参数进行更新,得到第二剪枝参数。
在本申请的另一个实施例中,确定模块602,用于对于任一条输入非空的边形成的第一解码路径,获取第一解码路径在声学模型下的第一声学分数;获取第一解码路径在第一解码网络中的第一路径分数;获取第一解码路径在第二解码网络中的第一重评分分数;根据目标令牌对应的解码分数、第一声学分数、第一路径分数、第一重评分分数及预设数值所确定的分数,对第三初始剪枝参数进行更新,得到第三剪枝参数。
在本申请的另一个实施例中,解码模块603,用于将声学向量输入到第一解码网络中,并遍历第一令牌列表中的每个第一令牌;,对于任一第一令牌,根据第一令牌对应的解码分数和第一剪枝参数,确定是否跳过第一令牌;当确定执行第一令牌,以第一令牌的第一状态为起始状态,在第一解码网络中遍历输入非空的边,根据每条输入非空的边形成的第二解码路径和第二剪枝参数,确定是否跳过第一令牌;当确定执行第一令牌,以第一令牌的第二状态为起始状态,在第二解码网络中对每条输入非空的边形成的第二解码路径进行重评分,根据重评分结果和第三剪枝参数,确定是否跳过第一令牌;当确定执行第一令牌,通过对第一令牌进行状态跳转,获取第二令牌,第二令牌包括更新的状态对及其解码分数;将每个第一令牌对应的第二令牌组成第 二令牌列表;将第二令牌列表中解码分数最小的第二令牌,确定为对当前音频帧的解码结果。
在本申请的另一个实施例中,解码模块603,用于对于任一条输入非空的边形成的第二解码路径,获取第二解码路径在声学模型下的第二声学分数;如果根据第一令牌对应的解码分数和第二声学分数所确定的分数大于第二剪枝参数,则跳过第一令牌,否则,执行第一令牌。
在本申请的另一个实施例中,该装置还包括:
更新模块,用于如果根据第一令牌对应的解码分数、第二声学分数及预设数值所确定的分数小于第二剪枝参数,则更新第二剪枝参数。
在本申请的另一个实施例中,解码模块603,用于对于任一条输入非空的边形成的第二解码路径,获取第二解码路径在声学模型下的第二声学分数;当输入非空的边上不存在出词点时,获取第二解码路径在第一解码网络中的第二路径分数;如果根据第一令牌的解码分数、第二路径分数及第二声学分数所确定的分数大于第三剪枝参数,则跳过第一令牌,否则,执行第一令牌。
在本申请的另一个实施例中,该装置还包括:
更新模块,用于如果根据第一令牌的解码分数、第二路径分数、第二声学分数及预设数值所确定的分数小于第三剪枝参数,则更新第三剪枝参数。
在本申请的另一个实施例中,解码模块603,用于对于任一条输入非空的边形成的第二解码路径,获取第二解码路径在声学模型下的第二声学分数;当输入非空的边上存在出词点时,获取第二解码路径在第一解码网络中的第二路径分数;获取第二解码路径在第二解码网络中的第二重评分分数;如果根据第一令牌的解码分数、第二路径分数、第二声学分数及第二重评分分数所确定的分数大于第三剪枝参数,则跳过第一令牌,否则,执行第一令牌。
在本申请的另一个实施例中,该装置还包括:
更新模块,用于如果根据第一令牌的解码分数、第二路径分数、第二声学分数、第二重评分分数及预设数值确定的分数小于第三剪枝参数,则更新第三剪枝参数。
在本申请的另一个实施例中,解码模块603,用于根据第一解码网络中的遍历结果,对第一令牌中的第一状态进行状态跳转,得到更新后的第一状态;根据第二解码网络中的重评分结果,对第一令牌中的第二状态进行状态 跳转,得到更新后的第二状态;将更新后第一状态和更新后的第二状态组成的第二令牌的状态对;根据第一令牌对应的解码分数、在第一解码网络中的路径分数、在第二解码网络中的重评分分数及在声学模型下的第二声学得分,确定第二令牌对应的解码分数。
综上所述,本申请实施例提供的装置,无需生成高阶语言模型对应的解码网络,基于低阶语言模型和差分语言模型对应的解码网络进行解码,在确保解码精度的前提下,节省了计算资源和存储资源。且根据对上一音频帧的解码结果,对当前音频帧的解码进行解码,提高了解码速度。
图7示出了本申请一个示例性实施例提供的计算机设备700的结构框图。该计算机设备700可以是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。计算机设备700还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,计算机设备700包括有:处理器701和存储器702。
处理器701可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器701可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器701也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器701可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器701还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器702可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器702还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施 例中,存储器702中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器701所执行以实现本申请中方法实施例提供的语音解码方法。
在一些实施例中,计算机设备700还可选包括有:外围设备接口703和至少一个外围设备。处理器701、存储器702和外围设备接口703之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口703相连。具体地,外围设备包括:射频电路704、触摸显示屏705、摄像头706、音频电路707、定位组件708和电源709中的至少一种。
外围设备接口703可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器701和存储器702。在一些实施例中,处理器701、存储器702和外围设备接口703被集成在同一芯片或电路板上;在一些其他实施例中,处理器701、存储器702和外围设备接口703中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路704用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路704通过电磁信号与通信网络以及其他通信设备进行通信。射频电路704将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路704包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路704可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路704还可以包括NFC(Near Field Communication,近距离无线通信)有关的电路,本申请对此不加以限定。
显示屏705用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏705是触摸显示屏时,显示屏705还具有采集在显示屏705的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器701进行处理。此时,显示屏705还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏705可以为一个,设置计算机设备700的前面板;在另一些实施例中,显示屏705可以为至少两个,分别设置在计算机设备700的不同 表面或呈折叠设计;在再一些实施例中,显示屏705可以是柔性显示屏,设置在计算机设备700的弯曲表面上或折叠面上。甚至,显示屏705还可以设置成非矩形的不规则图形,也即异形屏。显示屏705可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件706用于采集图像或视频。可选地,摄像头组件706包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件706还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路707可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器701进行处理,或者输入至射频电路704以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在计算机设备700的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器701或射频电路704的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路707还可以包括耳机插孔。
定位组件708用于定位计算机设备700的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。定位组件708可以是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统、俄罗斯的格雷纳斯系统或欧盟的伽利略系统的定位组件。
电源709用于为计算机设备700中的各个组件进行供电。电源709可以是交流电、直流电、一次性电池或可充电电池。当电源709包括可充电电池时,该可充电电池可以支持有线充电或无线充电。该可充电电池还可以用于 支持快充技术。
在一些实施例中,计算机设备700还包括有一个或多个传感器710。该一个或多个传感器710包括但不限于:加速度传感器711、陀螺仪传感器712、压力传感器713、指纹传感器714、光学传感器715以及接近传感器716。
加速度传感器711可以检测以计算机设备700建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器711可以用于检测重力加速度在三个坐标轴上的分量。处理器701可以根据加速度传感器711采集的重力加速度信号,控制触摸显示屏705以横向视图或纵向视图进行用户界面的显示。加速度传感器711还可以用于游戏或者用户的运动数据的采集。
陀螺仪传感器712可以检测计算机设备700的机体方向及转动角度,陀螺仪传感器712可以与加速度传感器711协同采集用户对计算机设备700的3D动作。处理器701根据陀螺仪传感器712采集的数据,可以实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
压力传感器713可以设置在计算机设备700的侧边框和/或触摸显示屏705的下层。当压力传感器713设置在计算机设备700的侧边框时,可以检测用户对计算机设备700的握持信号,由处理器701根据压力传感器713采集的握持信号进行左右手识别或快捷操作。当压力传感器713设置在触摸显示屏705的下层时,由处理器701根据用户对触摸显示屏705的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。
指纹传感器714用于采集用户的指纹,由处理器701根据指纹传感器714采集到的指纹识别用户的身份,或者,由指纹传感器714根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器701授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器714可以被设置计算机设备700的正面、背面或侧面。当计算机设备700上设置有物理按键或厂商Logo时,指纹传感器714可以与物理按键或厂商Logo集成在一起。
光学传感器715用于采集环境光强度。在一个实施例中,处理器701可以根据光学传感器715采集的环境光强度,控制触摸显示屏705的显示亮度。 具体地,当环境光强度较高时,调高触摸显示屏705的显示亮度;当环境光强度较低时,调低触摸显示屏705的显示亮度。在另一个实施例中,处理器701还可以根据光学传感器715采集的环境光强度,动态调整摄像头组件706的拍摄参数。
接近传感器716,也称距离传感器,通常设置在计算机设备700的前面板。接近传感器716用于采集用户与计算机设备700的正面之间的距离。在一个实施例中,当接近传感器716检测到用户与计算机设备700的正面之间的距离逐渐变小时,由处理器701控制触摸显示屏705从亮屏状态切换为息屏状态;当接近传感器716检测到用户与计算机设备700的正面之间的距离逐渐变大时,由处理器701控制触摸显示屏705从息屏状态切换为亮屏状态。
本领域技术人员可以理解,图7中示出的结构并不构成对计算机设备700的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
本申请实施例提供的计算机设备,无需生成高阶语言模型对应的解码网络,基于低阶语言模型和差分语言模型对应的解码网络进行解码,在确保解码精度的前提下,节省了计算资源和存储资源。且根据对上一音频帧的解码结果,对当前音频帧的解码进行解码,提高了解码速度。
本申请实施例提供了一种非易失性的计算机可读存储介质,存储介质中存储有存储有计算机可读指令,计算机可读指令由处理器执行以实现上述各个实施例提供的语音解码方法。
本申请实施例提供的计算机可读存储介质,无需生成高阶语言模型对应的解码网络,基于低阶语言模型和差分语言模型对应的解码网络进行解码,在确保解码精度的前提下,节省了计算资源和存储资源。且根据对上一音频帧的解码结果,对当前音频帧的解码进行解码,提高了解码速度。
需要说明的是:上述实施例提供的语音解码装置在进行语音解码时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将语音解码装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提 供的语音解码装置与语音解码方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种语音解码方法,由计算机设备执行,所述语音包括当前音频帧和上一音频帧;所述方法包括:
    从第一令牌列表中获取最小解码分数对应的目标令牌,所述第一令牌列表包括在不同解码网络中对所述上一音频帧进行解码得到的多个第一令牌,所述第一令牌包括状态对和解码分数,所述状态对用于表征所述第一令牌在低阶语言模型对应的第一解码网络中的第一状态与在差分语言模型对应的第二解码网络中的第二状态之间的对应关系;
    根据所述目标令牌和所述当前音频帧的声学向量,确定对所述当前音频帧进行解码时的剪枝参数,所述剪枝参数用于对所述当前音频帧的解码过程进行约束;及
    根据所述第一令牌列表、所述剪枝参数及所述声学向量,对所述当前音频帧进行解码。
  2. 根据权利要求1所述的方法,其特征在于,所述剪枝参数包括第一剪枝参数、第二剪枝参数和第三剪枝参数;所述根据所述目标令牌和当前音频帧的声学向量,确定对所述当前音频帧进行语音解码时的剪枝参数,包括:
    获取所述目标令牌对应的解码分数,根据所述目标令牌对应的解码分数与预设数值,确定第一剪枝参数;
    将所述声学向量输入到所述第一解码网络中,并以所述目标令牌的第一状态为起始状态,在所述第一解码网络中遍历输入非空的边,根据每条输入非空的边形成的第一解码路径,对第二初始剪枝参数进行更新,得到第二剪枝参数;及
    以所述目标令牌的第二状态为起始状态,在所述第二解码网络中对每条输入非空的边形成的第一解码路径进行重评分,根据重评分结果,对第三初始剪枝参数进行更新,得到第三剪枝参数。
  3. 根据权利要求2所述的方法,其特征在于,所述根据每条输入非空的边形成的第一解码路径,对第二初始剪枝参数进行更新,得到第二剪枝参数,包括:
    对于根据每条输入非空的边形成的第一解码路径,获取所述第一解码路径在声学模型下的第一声学分数;及
    根据所述目标令牌对应的解码分数、所述第一声学分数及所述预设数值所确定的分数,对所述第二初始剪枝参数进行更新,得到所述第二剪枝参数。
  4. 根据权利要求2所述的方法,其特征在于,所述在所述第二解码网络中对每条输入非空的边形成的第一解码路径进行重评分,根据重评分结果,对第三初始剪枝参数进行更新,得到第三剪枝参数,包括:
    对于根据每条输入非空的边形成的第一解码路径,获取所述第一解码路径在声学模型下的第一声学分数;
    获取所述第一解码路径在所述第一解码网络中的第一路径分数;
    获取所述第一解码路径在所述第二解码网络中的第一重评分分数;及
    根据所述目标令牌对应的解码分数、所述第一声学分数、所述第一路径分数、所述第一重评分分数及所述预设数值所确定的分数,对所述第三初始剪枝参数进行更新,得到所述第三剪枝参数。
  5. 根据权利要求2所述的方法,其特征在于,所述根据所述第一令牌列表、所述剪枝参数及所述声学向量,对所述当前音频帧进行解码,包括:
    将所述声学向量输入到所述第一解码网络中,并遍历所述第一令牌列表中的每个第一令牌;
    对于任一第一令牌,根据所述第一令牌对应的解码分数和所述第一剪枝参数,确定是否跳过所述第一令牌;
    当根据所述第一剪枝参数确定执行所述第一令牌,以所述第一令牌的第一状态为起始状态,在所述第一解码网络中遍历输入非空的边,根据每条输入非空的边形成的第二解码路径和所述第二剪枝参数,确定是否跳过所述第一令牌;
    当根据所述第二剪枝参数确定执行所述第一令牌,以所述第一令牌的第二状态为起始状态,在所述第二解码网络中对每条输入非空的边形成的第二解码路径进行重评分,根据重评分结果和所述第三剪枝参数,确定是否跳过所述第一令牌;
    当根据所述第三剪枝参数确定执行所述第一令牌,通过对所述第一令牌进行状态跳转,获取第二令牌,所述第二令牌包括更新的状态对及其解码分数;
    将每个第一令牌对应的第二令牌组成第二令牌列表;及
    将所述第二令牌列表中解码分数最小的第二令牌,确定为对所述当前音频帧的解码结果。
  6. 根据权利要求5所述的方法,其特征在于,所述根据每条输入非空的边形成的第二解码路径和所述第二剪枝参数,确定是否跳过所述第一令牌,包括:
    对于根据每条输入非空的边形成的第二解码路径,获取所述第二解码路径在声学模型下的第二声学分数;及
    如果根据所述第一令牌对应的解码分数和所述第二声学分数所确定的分数大于所述第二剪枝参数,则跳过所述第一令牌,否则,执行所述第一令牌。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    如果根据所述第一令牌对应的解码分数、所述第二声学分数及预设数值所确定的分数小于所述第二剪枝参数,则更新所述第二剪枝参数。
  8. 根据权利要求5所述的方法,其特征在于,所述在所述第二解码网络中对每条输入非空的边形成的第二解码路径进行重评分,根据重评分结果和所述第三剪枝参数,确定是否跳过所述第一令牌,包括:
    对于根据每条输入非空的边形成的第二解码路径,获取所述第二解码路径在声学模型下的第二声学分数;
    当所述输入非空的边上不存在出词点时,获取所述第二解码路径在所述第一解码网络中的第二路径分数;所述出词点包括有汉字输出的位置;及
    如果根据所述第一令牌的解码分数、所述第二路径分数及所述第二声学分数所确定的分数大于所述第三剪枝参数,则跳过所述第一令牌,否则,执行所述第一令牌。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    如果根据所述第一令牌的解码分数、所述第二路径分数、所述第二声学分数及预设数值所确定的分数小于所述第三剪枝参数,则更新所述第三剪枝参数。
  10. 根据权利要求5所述的方法,其特征在于,所述在所述第二解码网络中对每条输入非空的边形成的第二解码路径进行重评分,根据重评分结果和所述第三剪枝参数,确定是否跳过所述第一令牌,包括:
    对于任一条输入非空的边形成的第二解码路径,获取所述第二解码路径 在声学模型下的第二声学分数;
    当所述输入非空的边上存在出词点时,获取所述第二解码路径在所述第一解码网络中的第二路径分数;
    获取所述第二解码路径在所述第二解码网络中的第二重评分分数;及
    如果根据所述第一令牌的解码分数、所述第二路径分数、所述第二声学分数及所述第二重评分分数所确定的分数大于所述第三剪枝参数,则跳过所述第一令牌,否则,执行所述第一令牌。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    如果根据所述第一令牌的解码分数、所述第二路径分数、所述第二声学分数、所述第二重评分分数及所述预设数值确定的分数小于所述第三剪枝参数,则更新所述第三剪枝参数。
  12. 根据权利要求10所述的方法,其特征在于,所述通过对所述第一令牌进行状态跳转,获取第二令牌,包括:
    根据所述第一解码网络中的遍历结果,对所述第一令牌中的第一状态进行状态跳转,得到更新后的第一状态;
    根据所述第二解码网络中的重评分结果,对所述第一令牌中的第二状态进行状态跳转,得到更新后的第二状态;
    将更新后第一状态和更新后的第二状态组成第二令牌的状态对;及
    根据所述第一令牌对应的解码分数、在第一解码网络中的路径分数、在第二解码网络中的重评分分数及在声学模型下的第二声学得分,确定所述第二令牌对应的解码分数。
  13. 一种语音解码装置,所述语音包括当前音频帧和上一音频帧;所述装置包括:
    获取模块,用于从第一令牌列表中获取最小解码分数对应的目标令牌,所述第一令牌列表包括在不同解码网络中对所述上一音频帧进行解码得到的多个第一令牌,所述第一令牌包括状态对和解码分数,所述状态对用于表征所述第一令牌在低阶语言模型对应的第一解码网络中的第一状态与在差分语言模型对应的第二解码网络中的第二状态之间的对应关系;
    确定模块,用于根据所述目标令牌和所述当前音频帧的声学向量,确定 对所述当前音频帧进行解码时的剪枝参数,所述剪枝参数用于对所述当前音频帧的解码过程进行约束;及
    解码模块,用于根据所述第一令牌列表、所述剪枝参数及所述声学向量,对所述当前音频帧进行解码。
  14. 根据权利要求13所述的装置,其特征在于,所述剪枝参数包括第一剪枝参数、第二剪枝参数和第三剪枝参数;所述确定模块还用于获取所述目标令牌对应的解码分数,根据所述目标令牌对应的解码分数与预设数值,确定第一剪枝参数;
    将所述声学向量输入到所述第一解码网络中,并以所述目标令牌的第一状态为起始状态,在所述第一解码网络中遍历输入非空的边,根据每条输入非空的边形成的第一解码路径,对第二初始剪枝参数进行更新,得到第二剪枝参数;及
    以所述目标令牌的第二状态为起始状态,在所述第二解码网络中对每条输入非空的边形成的第一解码路径进行重评分,根据重评分结果,对第三初始剪枝参数进行更新,得到第三剪枝参数。
  15. 根据权利要求14所述的装置,其特征在于,所述解码模块还用于遍历所述第一令牌列表中的每个第一令牌;
    将所述声学向量输入到所述第一解码网络中,并遍历所述第一令牌列表中的每个第一令牌;
    对于任一第一令牌,根据所述第一令牌对应的解码分数和所述第一剪枝参数,确定是否跳过所述第一令牌;
    当根据所述第一剪枝参数确定执行所述第一令牌,以所述第一令牌的第一状态为起始状态,在所述第一解码网络中遍历输入非空的边,根据每条输入非空的边形成的第二解码路径和所述第二剪枝参数,确定是否跳过所述第一令牌;
    当根据所述第二剪枝参数确定执行所述第一令牌,以所述第一令牌的第二状态为起始状态,在所述第二解码网络中对每条输入非空的边形成的第二解码路径进行重评分,根据重评分结果和所述第三剪枝参数,确定是否跳过所述第一令牌;
    当根据所述第三剪枝参数确定执行所述第一令牌,通过对所述第一令牌 进行状态跳转,获取第二令牌,所述第二令牌包括更新的状态对及其解码分数;
    将每个第一令牌对应的第二令牌组成第二令牌列表;及
    将所述第二令牌列表中解码分数最小的第二令牌,确定为对所述当前音频帧的解码结果。
  16. 根据权利要求15所述的装置,其特征在于,所述解码模块还用于
    对于根据每条输入非空的边形成的第二解码路径,获取所述第二解码路径在声学模型下的第二声学分数;
    当所述输入非空的边上不存在出词点时,获取所述第二解码路径在所述第一解码网络中的第二路径分数;所述出词点包括有汉字输出的位置;及
    如果根据所述第一令牌的解码分数、所述第二路径分数及所述第二声学分数所确定的分数大于所述第三剪枝参数,则跳过所述第一令牌,否则,执行所述第一令牌。
  17. 根据权利要求15所述的装置,其特征在于,所述解码模块还用于
    所述在所述第二解码网络中对每条输入非空的边形成的第二解码路径进行重评分,根据重评分结果和所述第三剪枝参数,确定是否跳过所述第一令牌,包括:
    对于任一条输入非空的边形成的第二解码路径,获取所述第二解码路径在声学模型下的第二声学分数;
    当所述输入非空的边上存在出词点时,获取所述第二解码路径在所述第一解码网络中的第二路径分数;
    获取所述第二解码路径在所述第二解码网络中的第二重评分分数;及
    如果根据所述第一令牌的解码分数、所述第二路径分数、所述第二声学分数及所述第二重评分分数所确定的分数大于所述第三剪枝参数,则跳过所述第一令牌,否则,执行所述第一令牌。
  18. 根据权利要求17所述的装置,其特征在于,所述解码模块还用于
    根据所述第一解码网络中的遍历结果,对所述第一令牌中的第一状态进行状态跳转,得到更新后的第一状态;
    根据所述第二解码网络中的重评分结果,对所述第一令牌中的第二状态进行状态跳转,得到更新后的第二状态;
    将更新后第一状态和更新后的第二状态组成第二令牌的状态对;及
    根据所述第一令牌对应的解码分数、在第一解码网络中的路径分数、在第二解码网络中的重评分分数及在声学模型下的第二声学得分,确定所述第二令牌对应的解码分数。
  19. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如权利要求1至12中任一项所述的方法的步骤。
  20. 一种非易失性的计算机可读存储介质,其特征在于,所述存储介质中存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至12中任一项所述的方法的步骤。
PCT/CN2019/116686 2018-12-14 2019-11-08 语音解码方法、装置、计算机设备及存储介质 WO2020119351A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/191,604 US11935517B2 (en) 2018-12-14 2021-03-03 Speech decoding method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811536173.X 2018-12-14
CN201811536173.XA CN110164421B (zh) 2018-12-14 2018-12-14 语音解码方法、装置及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/191,604 Continuation US11935517B2 (en) 2018-12-14 2021-03-03 Speech decoding method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020119351A1 true WO2020119351A1 (zh) 2020-06-18

Family

ID=67645236

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116686 WO2020119351A1 (zh) 2018-12-14 2019-11-08 语音解码方法、装置、计算机设备及存储介质

Country Status (3)

Country Link
US (1) US11935517B2 (zh)
CN (1) CN110164421B (zh)
WO (1) WO2020119351A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362812A (zh) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 一种语音识别方法、装置和电子设备
CN113808594A (zh) * 2021-02-09 2021-12-17 京东科技控股股份有限公司 编码节点处理方法、装置、计算机设备及存储介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110164421B (zh) 2018-12-14 2022-03-11 腾讯科技(深圳)有限公司 语音解码方法、装置及存储介质
CN111968648B (zh) * 2020-08-27 2021-12-24 北京字节跳动网络技术有限公司 语音识别方法、装置、可读介质及电子设备
CN112259082B (zh) * 2020-11-03 2022-04-01 思必驰科技股份有限公司 实时语音识别方法及系统
CN113539242A (zh) * 2020-12-23 2021-10-22 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
CN114220444B (zh) * 2021-10-27 2022-09-06 安徽讯飞寰语科技有限公司 语音解码方法、装置、电子设备和存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1455387A (zh) * 2002-11-15 2003-11-12 中国科学院声学研究所 一种语音识别系统中的快速解码方法
CN105513589A (zh) * 2015-12-18 2016-04-20 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105575386A (zh) * 2015-12-18 2016-05-11 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105654945A (zh) * 2015-10-29 2016-06-08 乐视致新电子科技(天津)有限公司 一种语言模型的训练方法及装置、设备
CN108288467A (zh) * 2017-06-07 2018-07-17 腾讯科技(深圳)有限公司 一种语音识别方法、装置及语音识别引擎
CN108305634A (zh) * 2018-01-09 2018-07-20 深圳市腾讯计算机系统有限公司 解码方法、解码器及存储介质
CN108682415A (zh) * 2018-05-23 2018-10-19 广州视源电子科技股份有限公司 语音搜索方法、装置和系统
CN110164421A (zh) * 2018-12-14 2019-08-23 腾讯科技(深圳)有限公司 语音解码方法、装置及存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030061046A1 (en) * 2001-09-27 2003-03-27 Qingwei Zhao Method and system for integrating long-span language model into speech recognition system
US7533023B2 (en) * 2003-02-12 2009-05-12 Panasonic Corporation Intermediary speech processor in network environments transforming customized speech parameters
US9424246B2 (en) * 2009-03-30 2016-08-23 Touchtype Ltd. System and method for inputting text into electronic devices
GB201020771D0 (en) * 2010-12-08 2011-01-19 Univ Belfast Improvements in or relating to pattern recognition
US9047868B1 (en) * 2012-07-31 2015-06-02 Amazon Technologies, Inc. Language model data collection
US9672810B2 (en) * 2014-09-26 2017-06-06 Intel Corporation Optimizations to decoding of WFST models for automatic speech recognition
CN105845128B (zh) * 2016-04-06 2020-01-03 中国科学技术大学 基于动态剪枝束宽预测的语音识别效率优化方法
DE102018108856A1 (de) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Tiefes verstärktes Modell für abstrahierungsfähige Verdichtung
US10943583B1 (en) * 2017-07-20 2021-03-09 Amazon Technologies, Inc. Creation of language models for speech recognition
US11106868B2 (en) * 2018-03-06 2021-08-31 Samsung Electronics Co., Ltd. System and method for language model personalization
US11017778B1 (en) * 2018-12-04 2021-05-25 Sorenson Ip Holdings, Llc Switching between speech recognition systems

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1455387A (zh) * 2002-11-15 2003-11-12 中国科学院声学研究所 一种语音识别系统中的快速解码方法
CN105654945A (zh) * 2015-10-29 2016-06-08 乐视致新电子科技(天津)有限公司 一种语言模型的训练方法及装置、设备
CN105513589A (zh) * 2015-12-18 2016-04-20 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105575386A (zh) * 2015-12-18 2016-05-11 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN108288467A (zh) * 2017-06-07 2018-07-17 腾讯科技(深圳)有限公司 一种语音识别方法、装置及语音识别引擎
CN108305634A (zh) * 2018-01-09 2018-07-20 深圳市腾讯计算机系统有限公司 解码方法、解码器及存储介质
CN108682415A (zh) * 2018-05-23 2018-10-19 广州视源电子科技股份有限公司 语音搜索方法、装置和系统
CN110164421A (zh) * 2018-12-14 2019-08-23 腾讯科技(深圳)有限公司 语音解码方法、装置及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808594A (zh) * 2021-02-09 2021-12-17 京东科技控股股份有限公司 编码节点处理方法、装置、计算机设备及存储介质
CN113362812A (zh) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 一种语音识别方法、装置和电子设备
CN113362812B (zh) * 2021-06-30 2024-02-13 北京搜狗科技发展有限公司 一种语音识别方法、装置和电子设备

Also Published As

Publication number Publication date
CN110164421B (zh) 2022-03-11
US11935517B2 (en) 2024-03-19
US20210193123A1 (en) 2021-06-24
CN110164421A (zh) 2019-08-23

Similar Documents

Publication Publication Date Title
WO2020119351A1 (zh) 语音解码方法、装置、计算机设备及存储介质
US11482208B2 (en) Method, device and storage medium for speech recognition
EP3792911B1 (en) Method for detecting key term in speech signal, device, terminal, and storage medium
US10956771B2 (en) Image recognition method, terminal, and storage medium
CN108829235B (zh) 语音数据处理方法和支持该方法的电子设备
US10978048B2 (en) Electronic apparatus for recognizing keyword included in your utterance to change to operating state and controlling method thereof
KR102389625B1 (ko) 사용자 발화를 처리하는 전자 장치 및 이 전자 장치의 제어 방법
CN110890093B (zh) 一种基于人工智能的智能设备唤醒方法和装置
US20220172737A1 (en) Speech signal processing method and speech separation method
EP3531416A1 (en) System for processing user utterance and controlling method thereof
KR102339819B1 (ko) 프레임워크를 이용한 자연어 표현 생성 방법 및 장치
WO2015171646A1 (en) Method and system for speech input
CN110570840B (zh) 一种基于人工智能的智能设备唤醒方法和装置
EP3738117B1 (en) System for processing user utterance and controlling method thereof
CN111833872B (zh) 对电梯的语音控制方法、装置、设备、系统及介质
US11468892B2 (en) Electronic apparatus and method for controlling electronic apparatus
EP3610479B1 (en) Electronic apparatus for processing user utterance
KR102369309B1 (ko) 파셜 랜딩 후 사용자 입력에 따른 동작을 수행하는 전자 장치
CN114333774A (zh) 语音识别方法、装置、计算机设备及存储介质
CN114360510A (zh) 一种语音识别方法和相关装置
CN112289302B (zh) 音频数据的合成方法、装置、计算机设备及可读存储介质
US20220223142A1 (en) Speech recognition method and apparatus, computer device, and computer-readable storage medium
CN116956814A (zh) 标点预测方法、装置、设备及存储介质
CN111640432B (zh) 语音控制方法、装置、电子设备及存储介质
US12008988B2 (en) Electronic apparatus and controlling method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19896417

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19896417

Country of ref document: EP

Kind code of ref document: A1