WO2018232591A1 - Sequence recognition processing - Google Patents

Sequence recognition processing Download PDF

Info

Publication number
WO2018232591A1
WO2018232591A1 PCT/CN2017/089187 CN2017089187W WO2018232591A1 WO 2018232591 A1 WO2018232591 A1 WO 2018232591A1 CN 2017089187 W CN2017089187 W CN 2017089187W WO 2018232591 A1 WO2018232591 A1 WO 2018232591A1
Authority
WO
WIPO (PCT)
Prior art keywords
wfst
sequence
data structure
compact
recognition processing
Prior art date
Application number
PCT/CN2017/089187
Other languages
French (fr)
Inventor
Qiang Huo
Meng CAI
Original Assignee
Microsoft Technology Licensing, Llc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc. filed Critical Microsoft Technology Licensing, Llc.
Priority to PCT/CN2017/089187 priority Critical patent/WO2018232591A1/en
Publication of WO2018232591A1 publication Critical patent/WO2018232591A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/081Search algorithms, e.g. Baum-Welch or Viterbi
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning

Definitions

  • SRP state-of-the-art segmentation-free sequence recognition processing
  • HWR online hand-writing recognition
  • OCR optical character recognition
  • decoder generates recognition result by computing the best path in a decoding network.
  • Different decoding algorithms and decoder implementations may have a great impact on the speed and memory footprint of a practical SRP system, such as a HWR system or an ASR system.
  • a highly successful technique widely used in decoders is the weighted finite-state transducer (WFST) .
  • WFST is a directed graph consisting of a set of states and a set of transitions (arcs) connecting the states.
  • the WFST provides a unified representation of language model, lexicon, and the topology of character model.
  • Each of the components can be compiled into a WFST format. After the WFST’s of the components are generated, the individual WFST’s are often composed to generate a single WFST. Optimization operations such as determinization and minimization are applied to obtain the final WFST.
  • This final WFST is used as the decoding network by the SRP decoder, such as an HWR or ASR decoder.
  • each WFST transition there are four attributes stored on each WFST transition.
  • the four attributes are an input label, an output label, a weight and a pointer to the next state.
  • the input labels are characters and the output labels are words.
  • the weight on each WFST transition is the composed cost of language model score, lexicon score, and transition score in the character model. Some efforts have also referred to the weights on the WFST transitions as the graph costs.
  • the actual cost used during HWR decoding is a weighted sum of the graph cost and the character cost.
  • the character cost is typically the negative log likelihood produced by the character model.
  • the decoder tries to find the path with the lowest cost using Viterbi search with pruning. After the best path is found, the sequence of the output labels on the WFST transitions are used as the decoding result.
  • WFST-based decoding may result in big memory consumption because the WFST is a fully-expanded decoding network.
  • reducing the memory footprint while maintaining the efficiency of WFST-based decoders may make for a more practical SRP system, such as an HWR or ASR system.
  • Various embodiments herein each include at least one of systems, devices, methods, software, and data structures for SRP.
  • the SRP may include HWR processing, ASR processing, and the like.
  • One embodiment, in the form of a method includes compiling a decoding network to obtain a WFST data structure with transitions including an input label, an output label, a pointer to the next state, and a weight value. The method then removes the output labels from the WFST data structure to obtain a compact WFST data structure and outputs the compact WFST data structure for SRP.
  • Another method embodiment includes receiving, on a computing device, a sequence of input signals for sequence recognition processing based on a compact WFST and recording an alignment sequence for a computed path.
  • the method may proceed by then splitting the recording alignment sequence with ⁇ space> symbols to generate n sub-sequences and applying connectionist temporal classification (CTC) criterion on each of n sub-sequences to obtain n words, forming n words, and forming a word sequence W n .
  • CTC connectionist temporal classification
  • the method may then return W n or store to a memory device.
  • a further embodiment is in the form of a computing device.
  • the computing device may include an input, such as a network interface device or a touch screen, a processor, and a memory device.
  • the memory device stores instructions executable by the processor to perform data processing activities.
  • the data processing activities may include compiling a decoding network to obtain a weighted finite-state transducer data structure with transitions including an input label, an output label, a pointer to the next state, and a weight value.
  • the data processing activities may further include removing the output labels from the WFST data structure to obtain a compact WFST data structure.
  • the compact WFST data structure may then be stored in the memory device for SRP.
  • FIG. 1 is a logical block diagram of a system, according to an example embodiment.
  • FIG. 2 is a logical block diagram of a system according to an example embodiment.
  • FIG. 3 is an illustration of a lexicon WFST model, according to an example embodiment.
  • FIG. 4 is an illustration of a standard WFST model, according to an example embodiment.
  • FIG. 5 is an illustration of a compact WFST model, according to an example embodiment.
  • FIG. 6 is a block flow diagram of a method, according to an example embodiment.
  • FIG. 7 is a block diagram of a computing device, according to an example embodiment.
  • an HWR or an OCR task has some unique properties that are different from an ASR task, which may enable some specific optimizations for the design of HWR/OCR decoders.
  • WFST compact WFST data structure for sequence decoding
  • SRP systems SRP systems.
  • WFST WFST
  • WFST are directed graphs that may be used to represent language models, lexicons, context dependencies, and hidden Markov models in a unified framework.
  • the WFST-based methods have been applied to various sequence decoding problems, such as speech recognition, offline handwriting recognition (which may instead be referred to as handwriting OCR) , printed OCR, online handwriting recognition (which may instead be referred to as ink recognition) , and the like.
  • the data structure of the WFST is represented by states and arcs.
  • a WFST arc contains an input label, an output label, a weight value, and a pointer to the next state.
  • the WFST-based sequence decoders have been known to be efficient, but may result in big memory footprint.
  • Some embodiments herein include a WFST that omits the output label of the WFST arc for some sequence decoding problems, in which the modeling unit sequences can be mapped to the system output sequences without ambiguity. These sequence decoding problems include handwriting OCR, printed OCR, ink recognition, and the like.
  • the embodiments herein present a compact WFST data structure and related algorithms for the corresponding sequence decoding tasks.
  • the decoders based on the compact WFST data structure have a smaller memory footprint and faster speed compared to conventional WFST-based decoders.
  • the compact WFST data structure in some such embodiments enables efficient sequence decoders.
  • the decoder with the compact WFST produces exactly the same recognition results with a smaller memory footprint compared to conventional WFST-based decoders.
  • the speed of the decoder with the compact WFST is also faster than conventional decoders.
  • sequence decoding problems in which the modeling unit sequences can be mapped to the system output sequences without ambiguity.
  • the output symbols and modeling units may both be characters in handwriting OCR.
  • Other sequence decoding problems that can benefit from the compact WFST data structure include printed OCR and ink recognition (i.e., handwriting input received on an electronic sensing surface, such as a touch screen of a personal computer, smartphone, tablet, smartwatch, or other device) .
  • speech recognition systems typically model phonemes and output characters, it is still possible for some speech recognition systems to benefit from the compact WFST data structure. These systems may use graphemes or even characters as the modeling units in the acoustic models. For example, it is possible to build a Chinese speech recognition system that models all the Chinese characters using connectionist temporal classification (CTC) -based acoustic models. Then the compact WFST data structure can be applied as well.
  • CTC connectionist temporal classification
  • the functions or algorithms described herein are implemented in hardware, software or a combination of software and hardware in one embodiment.
  • the software comprises computer executable instructions stored on computer readable media such as memory or other type of storage devices. Further, described functions may correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions are performed in one or more modules as desired, and the embodiments described are merely examples.
  • the software is executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a system, such as a personal computer, server, a router, or other device capable of processing data including network interconnection devices.
  • Some embodiments implement the functions in two or more specific interconnected hardware modules or devices with related control and data signals communicated between and through the modules, or as portions of an application-specific integrated circuit.
  • the exemplary process flow is applicable to software, firmware, and hardware implementations.
  • FIG. 1 is a logical block diagram of a system 100, according to an example embodiment.
  • the system 100 is an example of various computing elements on which some embodiments may be deployed, in whole or in part.
  • some SRP systems may be deployed on a client device 102, on a server 106 to process sequence recognition processing requests received over a network 104 from a client device 102, or a hybrid thereof.
  • the client device 102 may be a smartphone, tablet, smartwatch, personal computer, laptop or convertible laptop (i.e., a laptop that may also be used as a tablet) , and the like.
  • the client device 102 may be any device with which a user may interact to provide input that maybe the subject of sequence recognition processing (e.g., handwriting, speech, etc. ) .
  • the client device 102 includes one or more input devices through which such input may be received.
  • Such input devices may include a touch screen, a touchpad, a microphone, a camera, and one or more other input devices that may be used to receive input that may be the subject of sequence recognition processing by an SRP system.
  • an SRP system Upon receipt of input for sequence recognition processing, an SRP system is present to receive that input and then coordinate the processing thereof.
  • the SRP system may be deployed entirely on the client device 102.
  • the SRP system may be deployed in part on the client device 102 to receive the input, transmit a representation thereof over the network 104 to one or more servers 106, which may also be referred to as “the cloud” , and to receive SRP processing results over the network 104 from the one or more servers 106.
  • sequence recognition processing may be optimized dynamically to ensure results may be obtained even when network 104 connectivity is not available (e.g., on an airplane, outside a WI-FI network, outside a wireless data network, etc. ) . Further, such embodiments may also optimize client device 102 responsiveness and help prevent network 102 and server 106 overloading to enhance the user experience and overall robustness of the system 100.
  • FIG. 2 is a logical block diagram of a system 200, according to an example embodiment.
  • the system 200 includes a decoding network that takes into account labeled signals, a character model, a lexicon, a language model, and text to provide to a decoder that operates against received inputs for sequence recognition processing to obtain recognition results.
  • WFSTs are usually used as the decoding network to provide a contextual constraint in sequence recognition problems, such as speech recognition, handwriting OCR, printed OCR, ink recognition, etc.
  • FIG. 2 shows the architecture block diagram of a typical handwriting OCR system.
  • the compact WFST data structures of the various embodiments herein are applicable to such problems that the modeling unit sequences can be mapped to the system output sequences without ambiguity.
  • the modeling units are defined as the units in the lexicon.
  • the process in some embodiments, may be represented as:
  • H is the character/acoustic model topology WFST
  • C is the WFST that represents the context dependency
  • L is the lexicon WFST
  • G is the language model WFST.
  • the symbol “ ⁇ ” represents WFST composition.
  • the symbol det () represents WFST determination.
  • the symbol min () represents WFST minimization.
  • the symbol rmeps is the operation of removing epsilon arcs.
  • the symbol addloop () is the operation that adds self-loops to the WFST.
  • a standard WFST decoding network HCLG may be generated by applying the operations in Equation (1) .
  • the operation Equation (2) , rmolabel () in such embodiments, is to remove the output labels in the WFST arcs.
  • the rmolabel () operation is applied to the standard WFST decoding network HCLG to generate the compact WFST decoding network P.
  • the network P has the same topology as the network HCLG. However, the arcs of P only contain the input label, the weight value, and the pointer to the next state. As a result, the footprint of P is smaller than the HCLG.
  • a comparison between the standard WFST and the compact WFST can be seen in FIG. 4 and FIG. 5.
  • the word boundaries may be recovered using the information in WFST P for some languages.
  • this can be achieved in some embodiments by modeling the ⁇ space> symbol in the character models and the lexicon.
  • the way to leverage the optional ⁇ space> symbol in a lexicon WFST is illustrated in FIG. 3.
  • One path contains the character sequence without the ⁇ space> symbol.
  • the other path contains the character sequence with a tailing ⁇ space> symbol.
  • the optimal sequence is obtained by computing the lowest-cost path with respect to the following WFST:
  • U is the WFST that represents the likelihood scores produced by the character models.
  • T frames in an input signal
  • then U has T+1 states that form a chain.
  • the number of arcs connecting adjacent states in U equals the number of output symbols of the character model.
  • the arcs contain the character likelihood scores.
  • the WFST U is never explicitly generated in a practical implementation, but the arcs of U are obtained on-demand to enable an on-the-fly composition according to Equation (3) .
  • the breadth first search algorithm with a bcam is performed on the WFST S to get the optimal path.
  • the input label sequences are recorded during the search.
  • the final output is obtained by doing an alignment on the input label sequence.
  • the alignment is performed by traversing the input sequence using the WFST H ⁇ C ⁇ L.
  • the Word boundaries are determined by the ⁇ space> symbols in the input label sequences.
  • FIG. 6 is provided to show an example of a method 600 that may be performed in some embodiments.
  • the method 600 includes compiling 602 a standard HCLG.
  • the compiling may be performed according to Equation (1) to obtain a WFST deeoding network model as illustrated in FIG. 4.
  • the method 600 may then remove 604 output labels.
  • the removing 604 of the output labels may be performed according to Equation (2) to obtain a compact WFST decoding network model as illustrated in FIG. 5.
  • the method 600 further includes deeoding 606 to obtain an optimal sequence, such as according to Equation (3) .
  • the method 600 may then perform an alignment 608 as discussed above.
  • a baseline system may be based on a deep bidirectional long short-term memory (DBLSTM) recurrent netural network (RNN) character model and an n-gram language model.
  • the character model in some of these embodiments may be trained using a connectionist temporal classification (CTC) objective function and a stochastic gradient descent (SGD) optimization method.
  • CTC connectionist temporal classification
  • SGD stochastic gradient descent
  • Such embodiments may denote the character model as the DBLSTM-CTC model.
  • the inputs of the DBLSTM-CTC model may be the PCA features of the frames of the text line image or ink, while the outputs of the DBLSTM-CTC model may all be the character symbols and an extra ⁇ blank> symbol.
  • the ⁇ blank> symbols may occur between two same characters and may optionally occur between two different characters in the label sequence.
  • the ⁇ blank> symbol may be used to model ambiguous frames between characters. Because of the insertion of the ⁇ blank> symbol, a softmax output of the DBLSTM-CTC may produce sharp spikes for the characters. This phenomenon can speed-up decoding as a more aggressive adaptive beam search strategy can be effectively used for pruning.
  • the softmax outputs of the DBLSTM-CTC model may be used in some embodiments as posterior probabilities of the characters p (c
  • the c represents characters and the o represents the frame vectors.
  • the posterior probabilities of the character model may be converted into likelihood scores according to the following equation:
  • p (o) is a constant that can be omitted.
  • Thep (c) is the prior probability estimated from the training set.
  • the characters with the maximum CTC output values are just picked as the alignment to calculate thep (c) .
  • the ⁇ is a scaling factor between 0 and 1 to smooth the prior probability.
  • the generation of the WFST in some embodiments follows a standard format.
  • the overall procedure may be summarized as:
  • L and G are the lexicon WFST and the language model WFST, respectively. These two WFSTs are compiled using a standard method.
  • the C is the WFST that models the context dependency in the character model. As context dependency is not explicitly modeled in the HWR/OCR task in some embodiments, C may be a trivial WFST with the same input label and output label on each transition.
  • the H is the WFST that models a hidden Markov model (HMM) .
  • the CTC character model may also be viewed as an HMM with a single state for each character.
  • the transition probability and the self-loop probability may both be set to 0.5.
  • the ⁇ symbol stands for the WFST composition.
  • the det () symbol stands for the WFST determinization and the min () symbol stands for the WFST minimization.
  • the ⁇ blank> symbol is inserted between the characters in the lexical items to match the CTC criterion.
  • Some embodiments include optionally attaching the ⁇ space> symbol to the tail of every lexical item in order to model the space between two words.
  • An illustration of the WFST L is given in Fig. 3.
  • HWR/OCR decoder embodiments may also be deployed in either cloud-based or device-based ways.
  • the decoder may execute on the powerful servers with advanced hardware.
  • the cloud-based decoder may be designed to maximize processing efficiency.
  • memory footprint is still an important issue for the cloud-based decoders as smaller decoder footprint can save in hardware cost, while also contributing to less intensive hardware utilization.
  • Some methods of decoding on a single WFST may be chosen for the cloud-based HWR/OCR decoder.
  • Some such embodiments include a compact WFST data structure designed for the HWR/OCR task as described above and otherwise. Some such methods can significantly reduce the memory footprint compared with a standard method while producing the same decoding result.
  • the output labels of the decoding network S are the words and the input labels of S are the characters.
  • the output label sequence on the transitions of the best path is the decoding result, while the input label sequence is the alignment.
  • the alignment is used to calculate the word boundaries together with the word sequence. Recording the WFST output label sequence for ASR may be performed for two reasons. First, the lexicon of speech recognition may contain word items with the same pronunciations, making it difficult, if not impossible, to get the word sequence using the alignments. Second, the alignments of the ASR decoding results may not necessarily contain a silence phoneme between two adjacent words, potentially resulting in ambiguity of word boundaries if only the alignments are given.
  • the HWR/OCR decoder only needs to produce the alignments using the input labels on the transitions of the WFST. Then a post-processing step can be used to generate the word sequence using the alignments.
  • the data structure called compact WFST for the HWR task as generally presented above is well suited to the task of HWR/OCR decoding.
  • the compact WFST is generated in some embodiments by simply removing the output labels from the transitions, or arcs.
  • An algorithm according to some embodiments for a compact WFST-based decoder is given in Algorithm 1.
  • the compact WFST data structure reduces the footprint compared with the standard WFST.
  • the decoding speed is also slightly faster than the standard method because the cost of memory management is smaller.
  • the decoding network may be stored as three arrays in memory.
  • the first array in such embodiments may be the non-epsilon transitions of the compact WFST. Each non-epsilon transition may store the input label, the weight, and the pointer to the next state.
  • the second array may be the epsilon transitions of the compact WFST. Each epsilon transition stores the weight and the pointer to the next state.
  • the third array may hold the states of the compact WFST. Two pointers may be stored in each state; one points to the first non-epsilon transition from that state and the other points to the first epsilon transition from that state. This structure uniquely handles the non-epsilon transitions and the epsilon transitions separately.
  • a hash table is typically needed to determine whether a next state has been pointed by any of the current transitions.
  • the key of the hash table is the state index of the compact WFST and the value of the hash table is the pointer to the current token.
  • a token in such embodiments typically records a path in the Viterbi algorithm.
  • the time complexity of searching an element in a hash table is O (1) .
  • the time complexity of inserting an element in a hash table is not O (1) , which degrades the performance.
  • an array of flags and an array of pointers may be created, whose numbers of elements both equal the number of states in the decoding network. Inserting an element is equivalent to setting the corresponding flag to true and recording the pointer of the token. Searching an element is equivalent to checking whether the flag is true and fetching the pointer of the token if available. In this way, the time complexity of inserting an element or searching an element are both O (1) .
  • This data structure is referred to as an exhaustive hash table herein. The exhaustive hash table is possible for the cloud-based HWR/OCR decoder because such embodiments do not use a huge language model and thus the WFST may only have several million states.
  • the pruning strategy of the HWR/OCR decoder may be a combination of two pruning methods: standard beam pruning and histogram pruning.
  • the decoder may always record the current best path ⁇ opt for each frame.
  • a path p may be pruned when the cost between p and the optimal path ⁇ opt is greater than a threshold ⁇ .
  • the threshold ⁇ is the beam.
  • the histogram pruning may be applied after beam pruning to control the maximum number of active paths at a frame.
  • FIG. 7 is a block diagram of a computing device, according to an example embodiment.
  • multiple such computer systems are utilized in a distributed network to implement multiple components in a transaction-based environment.
  • An object-oriented, service-oriented, or other architecture may be used to implement such functions and communicate between the multiple systems and components.
  • One example computing device in the form of a computer 710 may include a processing unit 702, memory 704, removable storage 712, and non-removable storage 714.
  • the example computing device is illustrated and described as computer 710, the computing device may be in different forms in different embodiments.
  • the computing device may instead be a smartphone, a tablet, smartwatch, or other computing device including the same or similar elements as illustrated and described with regard to FIG. 7.
  • Devices such as smartphones, tablets, and smartwatches are generally collectively referred to as mobile devices.
  • the various data storage elements are illustrated as part of the computer 710, the storage may also or alternatively include cloud-based storage accessible via a network, such as the Internet.
  • memory 704 may include volatile memory 706 and non-volatile memory 708.
  • Computer 710 may include -or have access to a computing environment that includes a variety of computer-readable media, such as volatile memory 706 and non-volatile memory 708, removable storage 712 and non-removable storage 714.
  • Computer storage includes random access memory (RAM) , read only memory (ROM) , erasable programmable read-only memory (EPROM) and electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technologies, compact disc read-only memory (CD ROM) , Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technologies compact disc read-only memory (CD ROM) , Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
  • CD ROM compact disc read-only memory
  • DVD Digital Versatile Disks
  • Computer 710 may include or have access to a computing environment that includes input 716, output 718, and a communication connection 720.
  • the input 716 may include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the computer 710, and other input devices.
  • the computer 710 may operate in a networked environment using a communication connection 720 to connect to one or more remote computers, such as database servers, web servers, and other computing device.
  • An example remote computer may include a personal computer (PC) , server, router, network PC, a peer device or other common network node, or the like.
  • PC personal computer
  • the communication connection 720 may be a network interface device such as one or both of an Ethernet card and a wireless card or circuit that may be connected to a network.
  • the network may include one or more of a Local Area Network (LAN) , a Wide Area Network (WAN) , the Internet, and other networks.
  • the communication connection 720 may also or alternatively include a transceiver device, such as a device that enables the computer 710 to wirelessly receive data from and transmit data to other devices.
  • Computer-readable instructions stored on a computer-readable medium are executable by the processing unit 702 of the computer 710.
  • a hard drive (magnetic disk or solid state) , CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium.
  • various computer programs 725 or apps such as one or more applications and modules implementing one or more of the methods illustrated and described herein or an app or application that executes on a mobile device or is accessible via a web browser, may be stored on a non-transitory computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Character Discrimination (AREA)

Abstract

Various embodiments herein each include at least one of systems, devices, methods, software, and data structures for sequence recognition processing (SRP). The SRP may include online or offline handwriting recognition (HWR) processing, optical character recognition (OCR) processing, automatic speech recognition (ASR) processing, and the like. One embodiment, in the form of a method includes compiling a decoding network to obtain a weighted finite-state transducer (WFST) data structure with transitions including an input label, an output label, a pointer to the next state, and a weight value. The method then removes the output labels from the WFST data structure to obtain a compact WFST data structure and outputs the compact WFST data structure for SRP.

Description

SEQUENCE RECOGNITION PROCESSING
BACKGROUND INFORMATION
The framework of a state-of-the-art segmentation-free sequence recognition processing (SRP) system, such as an online hand-writing recognition (HWR) system, an offline HWR system, or an optical character recognition (OCR) system for printed text, often includes a character model, a language model, a lexicon, and a decoder. This is similar to an automatic speech recognition (ASR) system, although an ASR system typically uses a phoneme-based acoustic model. A decoder generates recognition result by computing the best path in a decoding network. Different decoding algorithms and decoder implementations may have a great impact on the speed and memory footprint of a practical SRP system, such as a HWR system or an ASR system.
A highly successful technique widely used in decoders is the weighted finite-state transducer (WFST) . A WFST is a directed graph consisting of a set of states and a set of transitions (arcs) connecting the states. The WFST provides a unified representation of language model, lexicon, and the topology of character model. Each of the components can be compiled into a WFST format. After the WFST’s of the components are generated, the individual WFST’s are often composed to generate a single WFST. Optimization operations such as determinization and minimization are applied to obtain the final WFST. This final WFST is used as the decoding network by the SRP decoder, such as an HWR or ASR decoder.
By the definition of a WFST, there are four attributes stored on each WFST transition. The four attributes are an input label, an output label, a weight and a pointer to the next state. For the WFST decoding network of an English HWR system, the input labels are characters and the output labels are words. The weight on each WFST transition is the composed cost of language model score, lexicon score, and transition score in the character model. Some efforts have also referred to the weights on the WFST transitions as the graph costs. The actual cost used during HWR decoding is a weighted sum of the graph cost and the character cost.  The character cost is typically the negative log likelihood produced by the character model. The decoder tries to find the path with the lowest cost using Viterbi search with pruning. After the best path is found, the sequence of the output labels on the WFST transitions are used as the decoding result.
The advantage of using WFST for decoding is that searching on a single, pre-compiled and optimized decoding network can be efficient and simple. However, WFST-based decoding may result in big memory consumption because the WFST is a fully-expanded decoding network. As a result, reducing the memory footprint while maintaining the efficiency of WFST-based decoders may make for a more practical SRP system, such as an HWR or ASR system.
SUMMARY
Various embodiments herein each include at least one of systems, devices, methods, software, and data structures for SRP. The SRP may include HWR processing, ASR processing, and the like.
One embodiment, in the form of a method includes compiling a decoding network to obtain a WFST data structure with transitions including an input label, an output label, a pointer to the next state, and a weight value. The method then removes the output labels from the WFST data structure to obtain a compact WFST data structure and outputs the compact WFST data structure for SRP.
Another method embodiment includes receiving, on a computing device, a sequence of input signals for sequence recognition processing based on a compact WFST and recording an alignment sequence for a computed path. The method may proceed by then splitting the recording alignment sequence with <space> symbols to generate n sub-sequences and applying connectionist temporal classification (CTC) criterion on each of n sub-sequences to obtain n words, forming n words, and forming a word sequence Wn. The method may then return Wn or store to a memory device.
A further embodiment is in the form of a computing device. The computing device may include an input, such as a network interface device or a  touch screen, a processor, and a memory device. The memory device stores instructions executable by the processor to perform data processing activities. The data processing activities may include compiling a decoding network to obtain a weighted finite-state transducer data structure with transitions including an input label, an output label, a pointer to the next state, and a weight value. The data processing activities may further include removing the output labels from the WFST data structure to obtain a compact WFST data structure. The compact WFST data structure may then be stored in the memory device for SRP.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a logical block diagram of a system, according to an example embodiment.
FIG. 2 is a logical block diagram of a system according to an example embodiment.
FIG. 3 is an illustration of a lexicon WFST model, according to an example embodiment.
FIG. 4 is an illustration of a standard WFST model, according to an example embodiment.
FIG. 5 is an illustration of a compact WFST model, according to an example embodiment.
FIG. 6 is a block flow diagram of a method, according to an example embodiment.
FIG. 7 is a block diagram of a computing device, according to an example embodiment.
DETAILED DESCRIPTION
As mentioned above with regard to SRP systems, such as with regard to HWR, OCR, and ASR systems, reducing the memory footprint of the system while maintaining efficiency of WFST-based decoders may make for more practical SRP systems.
Previous work has made attempts to reduce the footprint of WFST-based decoders. The most well-known method is WFST-based decoding with on-the-fly language model rescoring. In these efforts, the WFST decoding network is built with a small language model. The decoder loads the small WFST and performs on-the-fly language model rescoring using a normal language model. The footprint of the small WFST decoding network plus the language model is significantly smaller than the single WFST built with the normal language model. Thus, the overall footprint of the decoder is reduced. With careful engineering, SRP decoding is even made possible on mobile devices. The on-the-fly language model rescoring method is suitable for a device-based SRP system. However, performing on-the-fly language model rescoring inevitably requires some extra computation compared with decoding on a single graph. Thus, the method of decoding on a single graph is more well suited for a cloud-based or server-based SRP system. Moreover, an HWR or an OCR task has some unique properties that are different from an ASR task, which may enable some specific optimizations for the design of HWR/OCR decoders.
The various embodiments herein provide a compact WFST data structure for sequence decoding, such as may be used by SRP systems. WFST’s are directed graphs that may be used to represent language models, lexicons, context dependencies, and hidden Markov models in a unified framework. The WFST-based methods have been applied to various sequence decoding problems, such as speech recognition, offline handwriting recognition (which may instead be referred to as handwriting OCR) , printed OCR, online handwriting recognition (which may instead be referred to as ink recognition) , and the like.
According to its definition, the data structure of the WFST is represented by states and arcs. A WFST arc contains an input label, an output label,  a weight value, and a pointer to the next state. The WFST-based sequence decoders have been known to be efficient, but may result in big memory footprint. Some embodiments herein include a WFST that omits the output label of the WFST arc for some sequence decoding problems, in which the modeling unit sequences can be mapped to the system output sequences without ambiguity. These sequence decoding problems include handwriting OCR, printed OCR, ink recognition, and the like. Thus, the embodiments herein present a compact WFST data structure and related algorithms for the corresponding sequence decoding tasks. The decoders based on the compact WFST data structure have a smaller memory footprint and faster speed compared to conventional WFST-based decoders.
The compact WFST data structure in some such embodiments enables efficient sequence decoders. When using the same models and the same decoding beam, the decoder with the compact WFST produces exactly the same recognition results with a smaller memory footprint compared to conventional WFST-based decoders. As the cost of memory management is smaller, the speed of the decoder with the compact WFST is also faster than conventional decoders.
In the various embodiments herein, methods can be used in sequence decoding problems in which the modeling unit sequences can be mapped to the system output sequences without ambiguity. For example, the output symbols and modeling units may both be characters in handwriting OCR. Other sequence decoding problems that can benefit from the compact WFST data structure include printed OCR and ink recognition (i.e., handwriting input received on an electronic sensing surface, such as a touch screen of a personal computer, smartphone, tablet, smartwatch, or other device) .
Although speech recognition systems typically model phonemes and output characters, it is still possible for some speech recognition systems to benefit from the compact WFST data structure. These systems may use graphemes or even characters as the modeling units in the acoustic models. For example, it is possible to build a Chinese speech recognition system that models all the Chinese characters using connectionist temporal classification (CTC) -based acoustic models. Then the compact WFST data structure can be applied as well.
The various embodiments herein and as will be readily apparent are of likely interest to any companies, organizations, and other such entities and developers that develop and deploy practical systems for the abovementioned and related sequence decoding tasks. By reducing the memory footprint, such embodiments are likely to provide a technical cost savings (e.g., time, processing power, etc. ) for virtually any deployment, such as a cloud solution, even client-side or client only solutions on resource-constrained devices.
These and other embodiments are described herein with reference to the figures.
In the following detailed description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments in which the inventive subject matter may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice them, and it is to be understood that other embodiments may be utilized and that structural, logical, and electrical changes may be made without departing from the scope of the inventive subject matter. Such embodiments of the inventive subject matter may be referred to, individually and/or collectively, herein by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept ifmore than one is in fact disclosed.
The following description is, therefore, not to be taken in a limited sense, and the scope of the inventive subject matter is defined by the appended claims.
The functions or algorithms described herein are implemented in hardware, software or a combination of software and hardware in one embodiment. The software comprises computer executable instructions stored on computer readable media such as memory or other type of storage devices. Further, described functions may correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions are performed in one or more modules as desired, and the embodiments described are merely examples. The software is executed on a digital signal processor, ASIC, microprocessor, or other  type of processor operating on a system, such as a personal computer, server, a router, or other device capable of processing data including network interconnection devices.
Some embodiments implement the functions in two or more specific interconnected hardware modules or devices with related control and data signals communicated between and through the modules, or as portions of an application-specific integrated circuit. Thus, the exemplary process flow is applicable to software, firmware, and hardware implementations.
FIG. 1 is a logical block diagram of a system 100, according to an example embodiment. The system 100 is an example of various computing elements on which some embodiments may be deployed, in whole or in part. For example, some SRP systems may be deployed on a client device 102, on a server 106 to process sequence recognition processing requests received over a network 104 from a client device 102, or a hybrid thereof.
The client device 102 may be a smartphone, tablet, smartwatch, personal computer, laptop or convertible laptop (i.e., a laptop that may also be used as a tablet) , and the like. Generally, the client device 102 may be any device with which a user may interact to provide input that maybe the subject of sequence recognition processing (e.g., handwriting, speech, etc. ) . Thus, the client device 102 includes one or more input devices through which such input may be received. Such input devices may include a touch screen, a touchpad, a microphone, a camera, and one or more other input devices that may be used to receive input that may be the subject of sequence recognition processing by an SRP system.
Upon receipt of input for sequence recognition processing, an SRP system is present to receive that input and then coordinate the processing thereof. In some embodiments, the SRP system may be deployed entirely on the client device 102. In other embodiments, the SRP system may be deployed in part on the client device 102 to receive the input, transmit a representation thereof over the network 104 to one or more servers 106, which may also be referred to as “the cloud” , and to receive SRP processing results over the network 104 from the one or more servers 106. Other embodiments may dynamically determine when to offload sequence  recognition processing to the one or more servers 106 based on other factors, such as a current processing load on the client device 102, network 104 connectivity and latency, latency or load factors on the one or more servers 106, and other factors. In this way, sequence recognition processing may be optimized dynamically to ensure results may be obtained even when network 104 connectivity is not available (e.g., on an airplane, outside a WI-FI network, outside a wireless data network, etc. ) . Further, such embodiments may also optimize client device 102 responsiveness and help prevent network 102 and server 106 overloading to enhance the user experience and overall robustness of the system 100.
FIG. 2 is a logical block diagram of a system 200, according to an example embodiment. The system 200 includes a decoding network that takes into account labeled signals, a character model, a lexicon, a language model, and text to provide to a decoder that operates against received inputs for sequence recognition processing to obtain recognition results. In such embodiments, WFSTs are usually used as the decoding network to provide a contextual constraint in sequence recognition problems, such as speech recognition, handwriting OCR, printed OCR, ink recognition, etc. FIG. 2 shows the architecture block diagram of a typical handwriting OCR system. The compact WFST data structures of the various embodiments herein are applicable to such problems that the modeling unit sequences can be mapped to the system output sequences without ambiguity. The modeling units are defined as the units in the lexicon.
In the compact WFST building phase, the process, in some embodiments, may be represented as:
Figure PCTCN2017089187-appb-000001
P= rmolabel (HCLG)          (2)
where H is the character/acoustic model topology WFST, C is the WFST that represents the context dependency, L is the lexicon WFST and G is the language model WFST. The symbol “ο ” represents WFST composition. The symbol det () represents WFST determination. The symbol min () represents WFST minimization. The symbol rmeps is the operation of removing epsilon arcs. The symbol addloop () is the operation that adds self-loops to the WFST.
In such embodiments, a standard WFST decoding network HCLG may be generated by applying the operations in Equation (1) . The operation Equation (2) , rmolabel () , in such embodiments, is to remove the output labels in the WFST arcs. The rmolabel () operation is applied to the standard WFST decoding network HCLG to generate the compact WFST decoding network P. The network P has the same topology as the network HCLG. However, the arcs of P only contain the input label, the weight value, and the pointer to the next state. As a result, the footprint of P is smaller than the HCLG. A comparison between the standard WFST and the compact WFST can be seen in FIG. 4 and FIG. 5.
In such embodiments, such as can be seen in FIG. 5, as there is no output label in the compact WFST P, the word boundaries may be recovered using the information in WFST P for some languages. For OCR and ink recognition tasks, this can be achieved in some embodiments by modeling the <space> symbol in the character models and the lexicon. The way to leverage the optional <space> symbol in a lexicon WFST is illustrated in FIG. 3. There are two paths for each lexicon item in the lexicon WFST. One path contains the character sequence without the <space> symbol. The other path contains the character sequence with a tailing <space> symbol.
Continuing with such embodiments, in the decoding phase, the optimal sequence is obtained by computing the lowest-cost path with respect to the following WFST:
Figure PCTCN2017089187-appb-000002
where P is the compact WFST decoding network. U is the WFST that represents the likelihood scores produced by the character models. Suppose there are T frames in an input signal,then U has T+1 states that form a chain. The number of arcs connecting adjacent states in U equals the number of output symbols of the character model. The arcs contain the character likelihood scores. The WFST U is never explicitly generated in a practical implementation, but the arcs of U are obtained on-demand to enable an on-the-fly composition according to Equation (3) . The breadth first search algorithm with a bcam is performed on the WFST S to get the optimal path. The input label sequences are recorded during the search.
The final output is obtained by doing an alignment on the input label sequence. The alignment is performed by traversing the input sequence using the WFST H ο C ο L. The Word boundaries are determined by the <space> symbols in the input label sequences.
Thus, FIG. 6 is provided to show an example of a method 600 that may be performed in some embodiments. The method 600 includes compiling 602 a standard HCLG. The compiling may be performed according to Equation (1) to obtain a WFST deeoding network model as illustrated in FIG. 4.
The method 600 may then remove 604 output labels. For example, the removing 604 of the output labels may be performed according to Equation (2) to obtain a compact WFST decoding network model as illustrated in FIG. 5.
The method 600 further includes deeoding 606 to obtain an optimal sequence, such as according to Equation (3) . The method 600 may then perform an alignment 608 as discussed above.
In some HWR/OCR embodiments,a baseline system may be based on a deep bidirectional long short-term memory (DBLSTM) recurrent netural network (RNN) character model and an n-gram language model. The character  model in some of these embodiments may be trained using a connectionist temporal classification (CTC) objective function and a stochastic gradient descent (SGD) optimization method. Such embodiments may denote the character model as the DBLSTM-CTC model. The inputs of the DBLSTM-CTC model may be the PCA features of the frames of the text line image or ink, while the outputs of the DBLSTM-CTC model may all be the character symbols and an extra <blank> symbol. The <blank> symbols may occur between two same characters and may optionally occur between two different characters in the label sequence. The <blank> symbol may be used to model ambiguous frames between characters. Because of the insertion of the <blank> symbol, a softmax output of the DBLSTM-CTC may produce sharp spikes for the characters. This phenomenon can speed-up decoding as a more aggressive adaptive beam search strategy can be effectively used for pruning.
The softmax outputs of the DBLSTM-CTC model may be used in some embodiments as posterior probabilities of the characters p (c|o) during decoding. The c represents characters and the o represents the frame vectors. In order to combine with the language model scores during decoding, the posterior probabilities of the character model may be converted into likelihood scores according to the following equation:
Figure PCTCN2017089187-appb-000003
wherep (o) is a constant that can be omitted. Thep (c) is the prior probability estimated from the training set. The characters with the maximum CTC output values are just picked as the alignment to calculate thep (c) . The α is a scaling factor between 0 and 1 to smooth the prior probability.
The generation of the WFST in some embodiments follows a standard format. The overall procedure may be summarized as:
Figure PCTCN2017089187-appb-000004
where L and G are the lexicon WFST and the language model WFST, respectively. These two WFSTs are compiled using a standard method. The C is the WFST that models the context dependency in the character model. As context dependency is not explicitly modeled in the HWR/OCR task in some embodiments, C may be a trivial WFST with the same input label and output label on each transition. The H is the WFST that models a hidden Markov model (HMM) . The CTC character model may also be viewed as an HMM with a single state for each character. The transition probability and the self-loop probability may both be set to 0.5. The ο symbol stands for the WFST composition. The det () symbol stands for the WFST determinization and the min () symbol stands for the WFST minimization. In the HWR/OCR lexicon, the <blank> symbol is inserted between the characters in the lexical items to match the CTC criterion. Some embodiments include optionally attaching the <space> symbol to the tail of every lexical item in order to model the space between two words. An illustration of the WFST L is given in Fig. 3.
Just as with the various general embodiments above, HWR/OCR decoder embodiments may also be deployed in either cloud-based or device-based ways. For the cloud-based HWR/OCR task, the decoder may execute on the powerful servers with advanced hardware. The cloud-based decoder may be designed to maximize processing efficiency. However, memory footprint is still an important issue for the cloud-based decoders as smaller decoder footprint can save in hardware cost, while also contributing to less intensive hardware utilization. Some methods of decoding on a single WFST may be chosen for the cloud-based HWR/OCR decoder. Some such embodiments include a compact WFST data structure designed for the HWR/OCR task as described above and otherwise. Some such methods can significantly reduce the memory footprint compared with a standard method while producing the same decoding result.
According to Equation (5) , the output labels of the decoding network S are the words and the input labels of S are the characters. In the standard decoding scheme for HWR, OCR, or ASR, the output label sequence on the transitions of the best path is the decoding result, while the input label sequence is the alignment. The alignment is used to calculate the word boundaries together with the word sequence. Recording the WFST output label sequence for ASR may be performed for two reasons. First, the lexicon of speech recognition may contain word items with the same pronunciations, making it difficult, if not impossible, to get the word sequence using the alignments. Second, the alignments of the ASR decoding results may not necessarily contain a silence phoneme between two adjacent words, potentially resulting in ambiguity of word boundaries if only the alignments are given. However, these two issues do not exist for the HWR/OCR task. The concatenations of the characters naturally form the words in the HWR/OCR lexicon, and there is a space between two words in a text line for languages such as English. As a result, the word sequence in the HWR/OCR task can typically be completely recovered using the decoding alignment, as long as the <space> symbol is modeled in the WFST decoding network.
Based on the above analysis, it may not be necessary to record the word sequence for the HWR/OCR decoding. It is neither necessary to store the output labels on the transitions of the WFST for HWR/OCR. The HWR/OCR decoder only needs to produce the alignments using the input labels on the transitions of the WFST. Then a post-processing step can be used to generate the word sequence using the alignments. Thus, the data structure called compact WFST for the HWR task as generally presented above is well suited to the task of HWR/OCR decoding. Given a WFST decoding network, the compact WFST is generated in some embodiments by simply removing the output labels from the transitions, or arcs. An algorithm according to some embodiments for a compact WFST-based decoder is given in Algorithm 1.
Figure PCTCN2017089187-appb-000005
Figure PCTCN2017089187-appb-000006
As the output labels are not stored on the transitions of the decoding network or the paths of Viterbi search, the compact WFST data structure reduces the footprint compared with the standard WFST. The decoding speed is also slightly faster than the standard method because the cost of memory management is smaller.
Besides the compact WFST data structure, there are also several optimization strategies to speed-up some cloud-based HWR/OCR decoder embodiments. These strategies, in various embodiments, may include one or more of decoding network storage, an exhaustive hash table, and a pruning strategy.
With regard to decoding network storage, the decoding network may be stored as three arrays in memory. The first array in such embodiments may be the non-epsilon transitions of the compact WFST. Each non-epsilon transition may store the input label, the weight, and the pointer to the next state. The second array may be the epsilon transitions of the compact WFST. Each epsilon transition stores the weight and the pointer to the next state. The third array may hold the states of the compact WFST. Two pointers may be stored in each state; one points to the first non-epsilon transition from that state and the other points to the first epsilon transition from that state. This structure uniquely handles the non-epsilon transitions and the epsilon transitions separately.
With regard to the exhaustive hash table, during Viterbi search, a hash table is typically needed to determine whether a next state has been pointed by  any of the current transitions. The key of the hash table is the state index of the compact WFST and the value of the hash table is the pointer to the current token. A token in such embodiments typically records a path in the Viterbi algorithm. The time complexity of searching an element in a hash table is O (1) . But the time complexity of inserting an element in a hash table is not O (1) , which degrades the performance. Some such embodiments avoid using conventional hash tables in the cloud-based HWR/OCR decoder. Such embodiments, an array of flags and an array of pointers may be created, whose numbers of elements both equal the number of states in the decoding network. Inserting an element is equivalent to setting the corresponding flag to true and recording the pointer of the token. Searching an element is equivalent to checking whether the flag is true and fetching the pointer of the token if available. In this way, the time complexity of inserting an element or searching an element are both O (1) . This data structure is referred to as an exhaustive hash table herein. The exhaustive hash table is possible for the cloud-based HWR/OCR decoder because such embodiments do not use a huge language model and thus the WFST may only have several million states.
With regard to the pruning strategy, the pruning strategy of the HWR/OCR decoder may be a combination of two pruning methods: standard beam pruning and histogram pruning. For the beam pruning, the decoder may always record the current best path ρopt for each frame. A path p may be pruned when the cost between p and the optimal path ρopt is greater than a threshold η. The threshold η is the beam. The histogram pruning may be applied after beam pruning to control the maximum number of active paths at a frame.
FIG. 7 is a block diagram of a computing device, according to an example embodiment. In one embodiment, multiple such computer systems are utilized in a distributed network to implement multiple components in a transaction-based environment. An object-oriented, service-oriented, or other architecture may be used to implement such functions and communicate between the multiple systems and components. One example computing device in the form of a computer 710, may include a processing unit 702, memory 704, removable storage 712, and non-removable storage 714. Although the example computing device is illustrated  and described as computer 710, the computing device may be in different forms in different embodiments. For example, the computing device may instead be a smartphone, a tablet, smartwatch, or other computing device including the same or similar elements as illustrated and described with regard to FIG. 7. Devices such as smartphones, tablets, and smartwatches are generally collectively referred to as mobile devices. Further, although the various data storage elements are illustrated as part of the computer 710, the storage may also or alternatively include cloud-based storage accessible via a network, such as the Internet.
Returning to the computer 710, memory 704 may include volatile memory 706 and non-volatile memory 708. Computer 710 may include -or have access to a computing environment that includes a variety of computer-readable media, such as volatile memory 706 and non-volatile memory 708, removable storage 712 and non-removable storage 714. Computer storage includes random access memory (RAM) , read only memory (ROM) , erasable programmable read-only memory (EPROM) and electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technologies, compact disc read-only memory (CD ROM) , Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
Computer 710 may include or have access to a computing environment that includes input 716, output 718, and a communication connection 720. The input 716 may include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the computer 710, and other input devices. The computer 710 may operate in a networked environment using a communication connection 720 to connect to one or more remote computers, such as database servers, web servers, and other computing device. An example remote computer may include a personal computer (PC) , server, router, network PC, a peer device or other common network node, or the like. The communication connection 720 may be a network interface device such as one or both of an Ethernet card and a wireless card or circuit that may be connected to a  network. The network may include one or more of a Local Area Network (LAN) , a Wide Area Network (WAN) , the Internet, and other networks. In some embodiments, the communication connection 720 may also or alternatively include a transceiver device, such as a
Figure PCTCN2017089187-appb-000007
device that enables the computer 710 to wirelessly receive data from and transmit data to other
Figure PCTCN2017089187-appb-000008
devices.
Computer-readable instructions stored on a computer-readable medium are executable by the processing unit 702 of the computer 710. A hard drive (magnetic disk or solid state) , CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium. For example, various computer programs 725 or apps, such as one or more applications and modules implementing one or more of the methods illustrated and described herein or an app or application that executes on a mobile device or is accessible via a web browser, may be stored on a non-transitory computer-readable medium.
It will be readily understood to those skilled in the art that various other changes in the details, material, and arrangements of the parts and method stages which have been described and illustrated in order to explain the nature of the inventive subject matter may be made without departing from the principles and scope of the inventive subject matter as expressed in the subjoined claims.

Claims (20)

  1. A method for sequence recognition processing (SRP) on a computing device comprising:
    compiling a decoding network to obtain a weighted finite-state transducer (WFST) data structure with transitions including an input label, an output label, a pointer to the next state, and a weight value;
    removing the output labels from the WFST data structure to obtain a compact WFST data structure; and
    outputting the compact WFST data structure for SRP.
  2. The method of claim 1, wherein the compiling of the decoding network is performed as part of a process according to equation:
    S = min (det (H о C о L о G) )
    where H is a WFST that models a hidden Markov model (HMM) , C is a WFST that models a context dependency of a character model, L is a lexicon WFST, G is a language model WFST, det () indicates WFST determination, and min () indicates WFST minimization.
  3. The method of claim 1, further comprising:
    performing Viterbi search with pruning on the compact WFST to compute a path.
  4. The method of claim 3, further comprising:
    recording an alignment sequence for the computed path.
  5. The method of claim 4, further comprising:
    splitting the recording alignment sequence with <space> symbols to generate n sub-sequences.
  6. The method of claim 5, further comprising:
    applying connectionist temporal classification (CTC) criterion on each of n sub-sequences to obtain n words, forming n words, and forming a word sequence Wn; and
    returning Wn.
  7. The method of claim 1, wherein computing device on which the method is performed is a networked server that provides the sequence recognition processing to client device processes over a network as a cloud-based service.
  8. The method of claim 1, wherein the sequence recognition processing is online or offline handwriting recognition processing, or optical character recognition processing.
  9. A method comprising:
    receiving, on a computing device, a sequence of input signals for sequence recognition processing based on a compact WFST;
    recording an alignment sequence for a computed path;
    splitting the recording alignment sequence with <space> symbols to generate n sub-sequences;
    applying connectionist temporal classification (CTC) criterion on each of n sub-sequences to obtain n words, forming n words, and forming a word sequence Wn; and
    returning Wn.
  10. The method of claim 9, wherein the sequence of input signals are received via a network from a client device.
  11. The method of claim 9, wherein the compact WFST is generated by a process including:
    compiling a decoding network to obtain a weighted finite-state transducer (WFST) data structure with transitions including an input label, an output label, a pointer to the next state, and a weight value; and
    removing the output labels from the WFST data structure to obtain the compact WFST data structure.
  12. The method of claim 11, wherein the compiling of the decoding network is performed as part of a process according to equation:
    S = min (det (H о C о L о G) )
    where H is a WFST that models a hidden Markov model (HMM) , C is a WFST that models a context dependency of a character model, L is a lexicon WFST, G is a language model WFST, det () indicates WFST determination, and min () indicates WFST minimization.
  13. The method of claim 9, wherein the path is computed by performing Viterbi search with pruning on the compact WFST.
  14. The method of claim 9, wherein the computing device is a mobile device.
  15. The method of claim 9, wherein the sequence recognition processing is online or offline handwriting recognition processing, or optical character recognition processing.
  16. A computing device comprising:
    an input;
    a processor; and
    a memory device storing instructions executable by the processor to perform data processing activities comprising:
    compiling a decoding network to obtain a weighted finite-state transducer (WFST) data structure with transitions including an input label, an output label, a pointer to the next state, and a weight value;
    removing the output labels from the WFST data structure to obtain a compact WFST data structure; and
    storing the compact WFST data structure in the memory device for sequence recognition processing (SRP) .
  17. The computing device of claim 16, wherein the compiling of the decoding network is performed as part of a process according to equation:
    S = min (det (H о C о L о G) )
    where H is a WFST that models a hidden Markov model (HMM) , C is a WFST that models a context dependency of a character model, L is a lexicon WFST, G is a language model WFST, det () indicates WFST determination, and min () indicates WFST minimization.
  18. The computing device of claim 16, further comprising:
    recording an alignment sequence for a computed path;
    splitting the recording alignment sequence with <space> symbols to generate n sub-sequences.
    applying connectionist temporal classification (CTC) criterion on each of the n sub-sequences to obtain n words, forming n words, and forming a word sequence Wn; and
    returning Wn.
  19. The computing device of claim 16, wherein:
    the input is a network interface device; and
    the computing device is a networked server that provides sequence recognition processing to client device processes via the network interface over a network as a cloud-based service.
  20. The computing device of claim 16, wherein:
    the computing device is a smartphone;
    the input is a microphone; and
    the sequence recognition processing is automated speech recognition processing.
PCT/CN2017/089187 2017-06-20 2017-06-20 Sequence recognition processing WO2018232591A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/089187 WO2018232591A1 (en) 2017-06-20 2017-06-20 Sequence recognition processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/089187 WO2018232591A1 (en) 2017-06-20 2017-06-20 Sequence recognition processing

Publications (1)

Publication Number Publication Date
WO2018232591A1 true WO2018232591A1 (en) 2018-12-27

Family

ID=64735469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/089187 WO2018232591A1 (en) 2017-06-20 2017-06-20 Sequence recognition processing

Country Status (1)

Country Link
WO (1) WO2018232591A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047477A (en) * 2019-04-04 2019-07-23 北京清微智能科技有限公司 A kind of optimization method, equipment and the system of weighted finite state interpreter
CN110377591A (en) * 2019-06-12 2019-10-25 北京百度网讯科技有限公司 Training data cleaning method, device, computer equipment and storage medium
CN112133285A (en) * 2020-08-31 2020-12-25 北京三快在线科技有限公司 Voice recognition method, voice recognition device, storage medium and electronic equipment
CN112560842A (en) * 2020-12-07 2021-03-26 马上消费金融股份有限公司 Information identification method, device, equipment and readable storage medium
CN113362813A (en) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113707137A (en) * 2021-08-30 2021-11-26 普强时代(珠海横琴)信息技术有限公司 Decoding implementation method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077708A (en) * 2012-12-27 2013-05-01 安徽科大讯飞信息科技股份有限公司 Method for improving rejection capability of speech recognition system
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN104143329A (en) * 2013-08-19 2014-11-12 腾讯科技(深圳)有限公司 Method and device for conducting voice keyword search
US20150179166A1 (en) * 2013-12-24 2015-06-25 Kabushiki Kaisha Toshiba Decoder, decoding method, and computer program product
US20150179177A1 (en) * 2013-12-24 2015-06-25 Kabushiki Kaisha Toshiba Decoder, decoding method, and computer program product
WO2016099301A1 (en) * 2014-12-17 2016-06-23 Intel Corporation System and method of automatic speech recognition using parallel processing for weighted finite state transducer-based speech decoding
CN105895091A (en) * 2016-04-06 2016-08-24 普强信息技术(北京)有限公司 ESWFST construction method
CN105976812A (en) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 Voice identification method and equipment thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077708A (en) * 2012-12-27 2013-05-01 安徽科大讯飞信息科技股份有限公司 Method for improving rejection capability of speech recognition system
CN104143329A (en) * 2013-08-19 2014-11-12 腾讯科技(深圳)有限公司 Method and device for conducting voice keyword search
US20150179166A1 (en) * 2013-12-24 2015-06-25 Kabushiki Kaisha Toshiba Decoder, decoding method, and computer program product
US20150179177A1 (en) * 2013-12-24 2015-06-25 Kabushiki Kaisha Toshiba Decoder, decoding method, and computer program product
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
WO2016099301A1 (en) * 2014-12-17 2016-06-23 Intel Corporation System and method of automatic speech recognition using parallel processing for weighted finite state transducer-based speech decoding
CN105895091A (en) * 2016-04-06 2016-08-24 普强信息技术(北京)有限公司 ESWFST construction method
CN105976812A (en) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 Voice identification method and equipment thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI WEI ET AL.: "Low space-complexity composition algorithm for weighted finite-state transducers", APPLICATION RESEARCH OF COMPUTERS, vol. 28, no. 8, 31 August 2011 (2011-08-31), pages 2931 - 2934 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047477A (en) * 2019-04-04 2019-07-23 北京清微智能科技有限公司 A kind of optimization method, equipment and the system of weighted finite state interpreter
CN110047477B (en) * 2019-04-04 2021-04-09 北京清微智能科技有限公司 Optimization method, equipment and system of weighted finite state converter
CN110377591A (en) * 2019-06-12 2019-10-25 北京百度网讯科技有限公司 Training data cleaning method, device, computer equipment and storage medium
CN112133285A (en) * 2020-08-31 2020-12-25 北京三快在线科技有限公司 Voice recognition method, voice recognition device, storage medium and electronic equipment
CN112133285B (en) * 2020-08-31 2024-03-01 北京三快在线科技有限公司 Speech recognition method, device, storage medium and electronic equipment
CN112560842A (en) * 2020-12-07 2021-03-26 马上消费金融股份有限公司 Information identification method, device, equipment and readable storage medium
CN112560842B (en) * 2020-12-07 2021-10-22 马上消费金融股份有限公司 Information identification method, device, equipment and readable storage medium
CN113362813A (en) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113362813B (en) * 2021-06-30 2024-05-28 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113707137A (en) * 2021-08-30 2021-11-26 普强时代(珠海横琴)信息技术有限公司 Decoding implementation method and device
CN113707137B (en) * 2021-08-30 2024-02-20 普强时代(珠海横琴)信息技术有限公司 Decoding realization method and device

Similar Documents

Publication Publication Date Title
US11238845B2 (en) Multi-dialect and multilingual speech recognition
WO2018232591A1 (en) Sequence recognition processing
JP5901001B1 (en) Method and device for acoustic language model training
JP5257071B2 (en) Similarity calculation device and information retrieval device
WO2020001458A1 (en) Speech recognition method, device, and system
US20200311207A1 (en) Automatic text segmentation based on relevant context
KR102573637B1 (en) Entity linking method and device, electronic equipment and storage medium
US20210133535A1 (en) Parameter sharing decoder pair for auto composing
WO2019037700A1 (en) Speech emotion detection method and apparatus, computer device, and storage medium
US9972314B2 (en) No loss-optimization for weighted transducer
CN108604311B (en) Enhanced neural network with hierarchical external memory
WO2016165120A1 (en) Deep neural support vector machines
KR20110043645A (en) Optimizing parameters for machine translation
CN112541076B (en) Method and device for generating expanded corpus in target field and electronic equipment
KR20140028174A (en) Method for recognizing speech and electronic device thereof
CN112632226B (en) Semantic search method and device based on legal knowledge graph and electronic equipment
JP2016536652A (en) Real-time speech evaluation system and method for mobile devices
US11948559B2 (en) Automatic grammar augmentation for robust voice command recognition
JP7418991B2 (en) Speech recognition method and device
JP7209330B2 (en) classifier, trained model, learning method
US9384188B1 (en) Transcription correction using multi-token structures
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
Kim et al. Sequential labeling for tracking dynamic dialog states
JP5612720B1 (en) Dialog control learning apparatus, dialog control apparatus, method and program thereof
Cai et al. An open vocabulary OCR system with hybrid word-subword language models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17914451

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17914451

Country of ref document: EP

Kind code of ref document: A1