WO2020179193A1 - 情報処理装置及び情報処理方法 - Google Patents
情報処理装置及び情報処理方法 Download PDFInfo
- Publication number
- WO2020179193A1 WO2020179193A1 PCT/JP2019/049771 JP2019049771W WO2020179193A1 WO 2020179193 A1 WO2020179193 A1 WO 2020179193A1 JP 2019049771 W JP2019049771 W JP 2019049771W WO 2020179193 A1 WO2020179193 A1 WO 2020179193A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- arc
- storage device
- wfst
- graph
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
Definitions
- the technology disclosed in this specification (hereinafter referred to as “the present disclosure”) relates to an information processing apparatus and an information processing method for performing a graph search process.
- WFST Weighted Finite State Transducer
- the WFST model is made up of text data and corpus (language material that is a large-scale database of texts and utterances) collected for learning.
- a WFST model search process (hereinafter, also referred to as “WFST search” in the present specification) is performed in order to search a text character string that is likely to be input speech.
- WFST search is a kind of graph search processing. In order to perform the search at high speed, it is common to load all the WFSTs into the main memory at the time of execution (the main memory referred to here corresponds to the local memory (or main memory) of the CPU. Also simply called “memory"). However, the WFST corresponding to a large vocabulary has a size of several tens GB to several hundreds GB, and the WFTS search cannot be operated unless the system has a large memory capacity. If WFST is placed in an auxiliary storage device (hereinafter also simply referred to as “disk”) such as HDD (Hard Disc Drive) or SSD (Solid State Drive) instead of memory, it is possible to reduce the memory usage. However, since the disk has lower access speed and throughput performance than the memory, the time required for the WFTS search becomes significantly long.
- auxiliary storage device hereinafter also simply referred to as “disk”
- HDD Hard Disc Drive
- SSD Solid State Drive
- Patent Document 1 Japanese Unexamined Patent Application Publication No. 2015-529350 Japanese Patent Publication No. 2017-527844
- An object of the technique according to the present disclosure is to provide an information processing apparatus and an information processing method for performing graph search processing of a huge size.
- the first aspect of the technology according to the present disclosure is It is provided with a calculation unit, a first storage device, and a second storage device.
- the graph information is divided into first graph information and second graph information, Arranging the first graph information in the first storage device, Arranging the second graph information in the second storage device, An information processing device, wherein the arithmetic unit executes a graph search process using the first graph information arranged in the first storage device and the second graph information arranged in the second storage device. Is.
- the graph information is a WFST model that expresses an acoustic model, a pronunciation dictionary, and a language model in speech recognition. Then, the language model is divided into two, large and small, and a small WFST model in which a smaller language model considering the connection of words of a first number or less is combined with an acoustic model and a pronunciation dictionary is used as the first graph information,
- the second graph information is a large WFST model composed of a language model that considers the connection of an arbitrary number of words that is larger than the first number.
- the calculation unit When the calculation unit needs to refer to the second graph information while executing the search process using the first graph information, the calculation unit obtains the necessary part of the second graph information. Copy from the second storage device to the first storage device to continue the search process.
- the arithmetic unit includes a first arithmetic unit including a GPU (Graphics Processing Unit) or other many-core arithmetic unit and a second arithmetic unit including a CPU (Central Processing Unit), and the first storage device includes the above-mentioned first memory unit. It is a memory in the GPU, and the second storage device is a local memory of the CPU. Then, the first arithmetic unit transitions the token in the small WFST model, but when a word is output from the transitioned arc and it is necessary to perform the state transition of the token in the large WFST model, the processing is performed. The first computing unit performs all the search processing while copying necessary data from the second storage device to the first storage device in a copy.
- a GPU Graphics Processing Unit
- CPU Central Processing Unit
- the arithmetic unit is composed of a CPU or a GPU
- the first storage device is a local memory of the arithmetic unit
- the second storage device is an auxiliary storage device. Then, the arithmetic unit transitions the token in the small WFST model, but when a word is output from the transitioned arc and it is necessary to perform the state transition of the token in the large WFST model, the data necessary for the processing is generated. Is carried out while copying from the second storage device to the first storage device.
- the larger WFST model is composed of an arc array in which arcs are sorted by the state ID of the source state and the input label, and the first storage device stores the data for access on the arc array of the arcs in each state. It includes an arc index for storing the start position and an input label array for storing the input labels corresponding to the arcs on the arc array in the same array as the arc array. Then, the arithmetic unit specifies the start position on the arc array of the state ID of the source state of the target arc by the arc index, and obtains the input label of the target arc from the element of the start position on the input label array. By searching, the position where the target arc is stored on the arc array is specified, and the data of the target arc is acquired from the arc array of the second storage device.
- a second aspect of the technology according to the present disclosure is an information processing apparatus including a calculation unit, a first storage device, and a second storage device.
- an information processing apparatus and an information processing method for dividing a huge size of graph information into two and arranging them in two storage areas respectively and performing the graph search at high speed with less memory are provided. be able to.
- FIG. 1 is a diagram showing a configuration example of a voice recognition system 100.
- FIG. 2 is a diagram showing an example of dividing the WFST model.
- FIG. 3 is a diagram showing a schematic configuration example of the voice recognition system 300 (first embodiment).
- FIG. 4 is a diagram showing a specific configuration example of the voice recognition system 300.
- FIG. 5 is a flowchart showing the overall processing procedure of voice recognition executed by the voice recognition system 300.
- FIG. 6 is a flowchart showing a detailed processing procedure of the graph search processing.
- FIG. 7 is a diagram for explaining a method of sending necessary arc information from the GPU 320 to the CPU 310 and causing the CPU 310 side to copy to the device memory 321.
- FIG. 8 is a diagram showing a mechanism for caching an arc of a large graph in the device memory 321 on the GPU 320 side.
- FIG. 9 is a diagram showing a configuration example of a speech recognition system 300 including a large graph cache.
- FIG. 10 is a flowchart showing the overall processing procedure of voice recognition executed by the voice recognition system 300 shown in FIG.
- FIG. 11 is a flowchart showing a detailed processing procedure of the graph search processing.
- FIG. 12 is a diagram showing a functional configuration example of the agent system 1200.
- FIG. 13 is a diagram showing how the arc extends from the state.
- FIG. 14 is a diagram showing the input/output relationship of the language model WFST.
- FIG. 15 is a diagram showing a schematic configuration example of a voice recognition system 1500 (second embodiment).
- FIG. 16 is a diagram showing a configuration example of a voice recognition system 1500 in which data for searching for an arc is arranged in a memory.
- FIG. 17 is a diagram showing a configuration example of WFST (large) access data.
- FIG. 18 is a diagram showing another configuration example of the WFST (large) access data.
- FIG. 19 is a diagram showing a specific functional configuration example of the voice recognition system 1500.
- FIG. 20 is a flowchart showing the overall processing procedure of voice recognition executed by the voice recognition system 1500.
- FIG. 21 is a flowchart showing an example of a detailed processing procedure of the WFST search processing.
- FIG. 22 is a flowchart showing another example of the detailed processing procedure of the WFST search processing.
- FIG. 23 is a flowchart showing a processing procedure for identifying a page in which a target arc is arranged on the arc array.
- FIG. 24 is a diagram showing a specific functional configuration example of the voice recognition system 1500 having the arc pre-reading function.
- FIG. 25 is a flowchart showing a detailed processing procedure of the WFST search process in the voice recognition system 1500 shown in FIG. 24.
- FIG. 26 is a flowchart showing a detailed processing procedure of the WFST search process in the voice recognition system 1500 shown in FIG. 24.
- FIG. 27 is a diagram showing a specific functional configuration example of the voice recognition system 2700.
- FIG. 28 is a flowchart showing an overall processing procedure of voice recognition executed by the voice recognition system 2700.
- FIG. 1 shows a schematic functional configuration example of the speech recognition system 100.
- the illustrated voice recognition system 100 includes a feature amount extraction unit 101, a DNN (Deep Neural Network) calculation unit 102, and a WFST search unit 103. It should be noted that not all speech recognition systems are configured as in FIG. 1 and other configurations may exist.
- DNN Deep Neural Network
- the feature amount extraction unit 101 is input with voice data in units of 10 milliseconds, for example, from a voice input unit such as a microphone (not shown).
- the feature amount extraction unit 101 calculates the feature amount of voice by applying a Fourier transform to the input voice data or using a mel filter bank or the like.
- the required processing time of the feature amount extraction unit 101 is, for example, less than 1 millisecond.
- the DNN calculation unit 102 uses a DNN model that has been pre-learned for the feature amount extracted by the feature amount extraction unit 101, and scores (likelihoods) corresponding to each state of the HMM (Hidden Markov Model). ) Is calculated.
- the processing time required by the DNN calculation unit 102 is, for example, about 1 millisecond.
- the WFST search unit 103 uses a WFST model that has been pre-learned for the HMM state score calculated by the DNN calculation unit 103, calculates a likely recognition result character string, and outputs the text of the recognition result.
- the processing time required for the WFST search unit 103 is, for example, about 1 to 30 milliseconds.
- the WFST is a finite state machine in which information of an input symbol, an output symbol, and a weight (transition probability) is attached to an arc.
- a speech recognition system is composed of an acoustic model showing phonemes and acoustic features, a pronunciation dictionary showing pronunciation of individual words, and a language model giving grammatical rules and a probability of chaining words.
- the state transition of the HMM used as the acoustic model, the pronunciation dictionary, and the N-gram model used as the language model can be expressed by the WFST model, respectively.
- the input symbols are HMM states, and the output symbols are phonemes.
- the input symbols are phonemes and the output symbols are words.
- the language model WFST has both input and output symbols. Has become a word.
- the language model is used to express transition probabilities of connection between words.
- the WFST is configured as, for example, a network in which a phoneme string is embedded in a word and an HMM is embedded in a phoneme. Further, in the WFST after combination, the input symbol is in the HMM state and the output symbol is the word. In this way, the speech recognition process is reduced to the network search problem.
- the language model increases in size with the power of the vocabulary.
- the number of states (nodes) and arcs (edges) of WFST increase to several billion, respectively, and the size becomes several tens of GB (for example, the number of states is 1.2 billion, arc).
- the WFST search unit 103 searches for a path (optimal state transition process) that best suits the input voice signal in the WFST, that is, the network in which each WFST of the acoustic model, the pronunciation dictionary, and the language model is synthesized, and the input voice. It will be decoded into word strings that are acoustically and linguistically matched to the signal.
- the WFST search unit 103 is required to search for an optimum word string at high speed.
- WFST search procedure The WFST search procedure is, for example, as follows.
- the WFST may be divided into two by dividing a large language model into two large and small.
- the language model considering the connection of two words is made smaller, and the language model considering the connection of four words is made larger and divided into two.
- a combination of an acoustic model, a pronunciation dictionary, and a small language model is a small WFST model
- a large language model is a large WFST model (or a language model WFST).
- a small WFST model is about several GB
- a large WFST model is about several tens GB.
- a token is transitioned in a small WFST model, and only when a word is output from the transitioned arc, the state transition of the token in a large WFST model is performed. Since the large WFST model has only the role of considering the connection of words, the transition occurs only when the words are output. By multiplying the tokens by the transition probabilities of the large WFST model, it is possible to consider the probability of long word connections that do not exist in the small WFST model.
- N-gram is often used as a language model.
- the N-gram is a model in which the probability that words are connected to each other is represented by N-1 multiple Markov processes. If the number of vocabulary of V, N number of connections of words are present as V N, it is needed V N pieces of arc which represent at WFST. Since it is unrealistic to create such a WFST, we do not actually model all the connections.
- the language model WFST learns from a large amount of sentences, but, for example, the connection of words whose appearance frequency is a certain number or less is removed from the model. If a connection to an unmodeled word occurs during the search, the token will transition to a state called backoff. Transitioning to the backoff state is equivalent to considering lower order connections.
- the input label of the arc of the language model WFST is a word.
- an arc having a word input from the current state (a word output by a small WFST in the case of on-the-fly synthesis) is followed. If there is no arc with the input word, the talk transitions to the backoff state, from which it searches for the arc with the similarly input word. That is, when transitioning to the backoff state, a plurality of arcs are transited by a single input.
- FIG. 13 shows that the arc for each input label extends from the state “x”.
- arcs corresponding to the words “a”, “b”, “c”, and “y” are extended as input labels from the state “x”. For example, when the word “y" is input in the state "x", the arc corresponding to the input label "y" is searched.
- FIG. 14 shows the input/output relationship of the language model WFST.
- the input of the language model WFST is a state ID and a label, and its output is an arc.
- the arc is composed of the input label, the output label, the weight, and the state ID of the transition destination.
- the technology according to the present disclosure divides a huge WFST into two parts and arranges each in two storage areas to realize a WFTS search at high speed with less memory.
- a first embodiment that realizes on-the-fly combining using a many-core arithmetic unit such as a GPU and a second embodiment that realizes on-the-fly combining by arranging WFST data divided into two in a memory and a disk will be described.
- a third embodiment a specific example to which the large-scale graph search technique according to the present disclosure is applied will be described.
- a many-core arithmetic unit such as GPU may be used (described above).
- manycore arithmetic units such as GPUs generally have a limited memory capacity.
- the main memory that can be accessed from the CPU (Central Processing Unit) can be expanded relatively easily to several hundred GB (gigabytes).
- the memory mounted on the GPU is at most several GB to a dozen GB. Due to the exhaustion of the device memory, it is difficult to perform the search processing of the large vocabulary speech recognition in which the size of the WFST model is several tens GB or more on a many-core arithmetic unit such as GPU.
- a data processing method for performing a WFST search by on-the-fly synthesis (described above) in a hybrid environment in which both a CPU and a GPU are used has been proposed (see Patent Document 1).
- the operation using the smaller WFST model is performed by the GPU, and the operation using the larger WFST model is performed by the CPU.
- the smaller WFST model is expanded on the memory of the GPU, by arranging the large WFST model that consumes a large amount of memory in the main memory, there is a problem that the device memory is insufficient. Can be resolved.
- the state transition of the smaller WFST model is performed on the GPU, and the likelihood correction using the larger WFST model is performed on the CPU.
- processing is performed to acquire a specific arc extending from a certain state.
- Table 1 illustrates the data structure of the larger WFST model.
- the position of the corresponding Arc is searched for and referred to (see FIG. 13).
- the state transitions to a state called backoff, and the corresponding arc is searched for again from the backoff state.
- a binary search or a hash map is used for searching the position of the arc.
- the amount of calculation using the larger WFST model is large. Therefore, if the state transition of the smaller WFST model is executed on the GPU and the likelihood correction using the larger WFST model is executed on the CPU, the calculation on the CPU side becomes a bottleneck, and the GPU is introduced. However, there is concern that the performance (processing speed and throughput) will not be improved sufficiently.
- the GPU executes the search processing of a large-scale graph for speech recognition.
- the large-scale graph mentioned here is the larger language model obtained by dividing the language model into two, that is, the language model WFST.
- the large scale means a size of, for example, several tens GB or more, which cannot be expanded in the device memory.
- the application range of the technology according to the present disclosure is not limited to the graph search processing of the GPU and the voice recognition.
- the GPU can be replaced with a many-core arithmetic unit having a limited memory capacity (having a memory capacity smaller than the graph size), and the graph search process of speech recognition can be replaced with a general graph search process. ..
- FIG. 3 schematically shows a configuration example of a voice recognition system 300 to which the technique proposed as the first embodiment is applied.
- the illustrated voice recognition system 300 includes a hybrid environment using the CPU 310 and the GPU 320.
- the CPU 310 includes a main memory 311 having a relatively large capacity (for example, about several tens of GB) as a local memory.
- the GPU 320 is composed of a many-core arithmetic unit and can execute graph search processing such as WFST at high speed by parallel processing of each core.
- the GPU 320 also includes a local memory (here, referred to as “device memory”) 321, but its capacity is smaller than that of the main memory, for example, about several GB.
- main memory 311 can also be accessed from the GPU 320.
- the CPU 310 executes the copying of the data on the main memory 311 to the device memory 321.
- the GPU 320 may access the main memory 311 at high speed by using a DMA (Direct Memory Access) function.
- DMA Direct Memory Access
- the voice recognition system 300 divides the WFST model into two large and small ones, and performs on-the-fly synthesis to synthesize at the time of executing voice recognition.
- a language model having a large size is divided into large and small.
- the smaller language model considering the connection between the two words is combined with the acoustic model and the pronunciation dictionary to create the smaller WFST model.
- the smaller WFST model (small graph) is arranged in the device memory 321 having a relatively small capacity.
- the language model considering the connection of four words is the larger WFST model.
- the larger WFST model (large graph) has a size of about several tens GB and is arranged in the main memory 311.
- the state transition of the smaller WFST model is performed on the GPU, and the likelihood correction using the larger WFST model is performed on the CPU (described above).
- the WFST model search process is not performed on the CPU 310, but is basically performed only on the GPU 320.
- the GPU 320 basically transitions tokens in a small WFST model, but when a word is output from the transitioned arc and it is necessary to perform state transition in a large WFST model, the CPU 310 does not perform search processing, While copying only the data (specifically, the input label, the output label, the weight, and the ID of the transition destination state of the arc) of the large WFST model from the main memory 311 to the device memory 321, All search processing is performed on the GPU 320. By doing so, basically, the processes performed on the CPU 310 are only data transfer to the GPU 320 and control of the GPU 320. Therefore, the calculation resources of the GPU 320 can be effectively used, and the load on the CPU 310 side can be greatly reduced.
- (Advantage 1) Can be processed faster: By using an arithmetic unit having a large number of cores such as a GPU, it is possible to process a plurality of hypotheses (tokens) in parallel and reduce the processing time for search. Especially in voice recognition services such as voice agents, it is important to process quickly in order to respond quickly to the user.
- FIG. 4 shows a more specific functional configuration example of the voice recognition system 300.
- the voice recognition system 300 includes a hybrid environment using the CPU 310 and the GPU 320.
- the CPU 310 includes a main memory 311 having a relatively large capacity (for example, about several tens GB) as a local memory.
- the GPU 320 includes a small capacity device memory 321.
- a signal processing unit 401, a feature amount extraction unit 402, and a recognition result output processing unit 405 are arranged in the CPU 310.
- the GPU 320 includes an HMM score calculation unit 403 and a graph search unit 404. These functional modules indicated by reference numerals 401 to 405 may actually be software programs executed by the CPU 310 or GPU 320.
- the voice input unit 441 is composed of a microphone or the like and picks up a voice signal.
- the signal processing unit 401 performs predetermined digital processing on the audio signal received by the audio input unit 441.
- the feature amount extraction unit 402 extracts the feature amount of voice by using a known technique such as Fourier transform or mel filter bank.
- a known technique such as Fourier transform or mel filter bank.
- the feature amount extraction unit 402 is arranged on the CPU 310 side, but it may be executed by the GPU 320.
- the HMM score calculation unit 403 receives the information on the feature amount of the voice and calculates the score of each HMM state using the acoustic model 431.
- Gaussian Mixture Model (GMM) or DNN is used for the HMM.
- the acoustic model 431 is arranged in the GPU memory (device memory) 321 as shown in FIG.
- the processing of the HMM score calculation may be performed on the CPU 310 side, and in that case, the acoustic model 431 is arranged on the main memory 321.
- the graph search unit 404 receives the HMM state score and uses the small graph (smaller WFST model) 432 on the GPU memory (device memory) 321 and the large graph (larger WFST model) 421 on the main memory 311. Search processing by on-the-fly synthesis is performed.
- Midway recordings such as a hypothesis list of recognition results generated by the graph search unit 404 in the search process are temporarily stored in the work area 433 on the device memory 321.
- the above-mentioned intermediate recording may be saved in the work area on the main memory 311 or may be saved in both the device memory 321 and the main memory 311.
- the graph search unit 404 finally outputs the character string of the voice recognition result.
- the character string of the recognition result is sent from the work area 433 on the device memory 321 to the recognition result output processing unit 405 on the CPU 310 side.
- the recognition result output processing unit 405 performs processing for displaying or outputting the recognition result from the output unit 442 including a display and a speaker.
- the voice recognition system 300 may be configured as a device including at least one of the voice input unit 441 and the output unit 442.
- the CPU 310 and the GPU 320 may be installed in a server on the cloud, and the voice input unit 441 and the output unit 442 may be configured as a voice agent device (described later).
- FIG. 5 shows, in the form of a flowchart, the overall processing procedure of speech recognition executed by the speech recognition system 300 shown in FIG.
- the voice data after digital processing by the signal processing unit 401 is divided into, for example, every 10 milliseconds and input to the feature amount extraction unit 402. To be done.
- the feature amount extraction unit 402 extracts the feature amount of the voice based on the voice data that has been digitally processed by the signal processing unit 401, using a known technique such as Fourier transform or mel filter bank (step S502). ..
- a known technique such as Fourier transform or mel filter bank.
- the HMM score calculation unit 403 receives information on the feature amount of the voice, and calculates the score of each HMM state using the acoustic model 431 (step S504).
- the graph search unit 404 receives the HMM state score, and a small graph (smaller WFST model) 432 on the GPU memory (device memory) 321 and a large graph (larger WFST model) on the main memory 311.
- a search process by on-the-fly synthesis is performed using the 421 (step S505).
- step S505 the graph search unit 404 first transitions tokens in the small graph.
- the arc information of the large graph is copied to the device memory 321 of the GPU 320, and the token transition on the large graph is performed. Then, after transitioning all hypotheses, the graph search unit 404 prunes the hypotheses as a whole. However, the details of this processing will be described later (see FIG. 6).
- steps S502 to S505 are repeatedly executed for the voice data divided every 10 milliseconds.
- Step S501 When the end of the input voice is reached (No in step S501), the character string of the voice recognition result by the graph search unit 404 is copied from the work area 433 on the device memory 321 to the main memory 311 on the CPU 310 side (No). Step S506).
- the recognition result output processing unit 405 performs processing for displaying or outputting the recognition result from the output unit 442 including a display and a speaker (step S507).
- FIG. 6 shows, in the form of a flow chart, the detailed processing procedure of the graph search processing executed in step S505 in the flow chart shown in FIG.
- the graph search unit 404 transitions tokens in the small graph 432 (smaller WFST model) on the device memory 321 (step S601).
- the state transition of the token of the large graph (large WFST model) is performed.
- the graph search unit 404 calculates an address on the main memory 321 that stores the required large graph arc information, and from that address on the main memory 321 the large graph arc information. Is copied to the device memory 321 on the GPU 320 side (step S603), and the token of the large graph is changed on the device memory 321 (step S604). Then, after transitioning all the hypotheses, the graph search unit 404 prunes the hypotheses as a whole (step S605), and ends this processing. Also, when no word is output from the transitioned arc (No in step S602), the graph search unit 404 prunes the hypothesis as a whole (step S605), and ends this processing.
- the arc of the small graph is performed when the token is changed (by the graph search unit 404) on the GPU 320 using the small graph on the device memory 321.
- the arc information of the large graph is needed.
- the GPU 320 may calculate in advance the position (address information) in the main memory 311 where the necessary arc is arranged in the large graph.
- the CPU 310 does not need to perform a large graph searching process such as a binary search or a lookup of a hash table. The load on the CPI 310 can be reduced.
- the CPU 310 and the GPU 320 have a common page table, and when the GPU 320 refers to an access to a page that is not on the device memory 321, the page is transferred from the main memory 311 to the device. Move to memory 321.
- CUDA registered trademark
- a driver of the GPU 320 moves a page from the main memory 311 to the device memory 321. It can be carried out.
- the necessary arc information calculated in advance on the GPU 320 side is sent from the GPU 320 to the CPU 310.
- a list of necessary arc position information calculated on the GPU 320 side (for example, an arc array index on the main memory 311 or an arc address) is sent to the CPU 310.
- the necessary arc is copied to the device memory 321 based on the received list.
- Figure 7 illustrates the latter method.
- the GPU 320 sends a list of necessary arcs to the CPU 310 side.
- the GPU 310 transmits the arc list ⁇ 1, 5, 7, 11, 19 ⁇ and the arc ID to the CPU 310.
- the CPU 310 takes out five arcs ⁇ 1,5,7,11,19 ⁇ from the large graph stored in the main memory 311 based on the received list, arranges them, returns them to the GPU 320 side, and returns the device. Copy to memory 321.
- F-4 Modification F-4-1. Modification example in which the GPU memory is provided with a cache of a large graph arc Generally, the communication between the CPU 310 and the GPU 320 has a higher latency than the normal memory access. Therefore, the arc of the large graph is saved (or cached) in the device memory 321 on the GPU 320 side to improve the processing speed. Due to the nature of graph search in speech recognition, this method is considered to be effective because references to the same data often continue in large graphs.
- FIG. 8 illustrates a mechanism for caching an arc of a large graph in the device memory 321 on the GPU 320 side.
- the device memory 321 has a data structure in which the ID of the source state (the state before the transition) and the input label are input and the arc is returned.
- FIG. 9 shows a configuration example of a speech recognition system 300 including a large graph cache.
- a signal processing unit 401, a feature amount extraction unit 402, and a recognition result output processing unit 405 are arranged in the CPU 310.
- the GPU 320 includes an HMM score calculation unit 403 and a graph search unit 404. These functional modules indicated by reference numerals 401 to 405 may actually be software programs executed by the CPU 310 or GPU 320.
- the voice input unit 441 is composed of a microphone or the like and picks up a voice signal.
- the signal processing unit 401 performs predetermined digital processing on the audio signal received by the audio input unit 441.
- the feature amount extraction unit 402 extracts the feature amount of voice by using a known technique such as Fourier transform or a mel filter bank.
- the HMM score calculation unit 403 receives the information on the feature amount of the voice, and calculates the score of each HMM state by using the acoustic model 431 on the GPU memory (device memory) 321. GMM and DNN are used for HMM.
- the graph search unit 404 receives the HMM state score and receives a small graph (smaller WFST model) 432 on the GPU memory (device memory) 321, a large graph cache 901, and a large graph (larger graph on the main memory 311). WFST model) 421 to perform search processing by on-the-fly synthesis.
- the graph search unit 404 first transitions tokens in the small graph. When a word is output from the small graph in the transition, the ID of the source state (the state before the transition) and the input label are input, and the arc information of the large graph is acquired from the large graph cache 901, Transition of graph tokens. When a cache error occurs in the large graph cache 901, the graph search unit 404 copies the arc information of the large graph to the device memory 321 of the GPU 320, and copies the arc information of the large graph into the large graph cache 901. Cache in and perform token transition of large graph.
- the graph search unit 404 finally outputs the character string of the voice recognition result.
- the character string of the recognition result is sent from the work area 433 on the device memory 321 to the recognition result output processing unit 405 on the CPU 310 side.
- the recognition result output processing unit 405 performs processing for displaying or outputting the recognition result from the output unit 442 including a display and a speaker.
- FIG. 10 shows the overall processing procedure of voice recognition executed by the voice recognition system 300 shown in FIG. 9 in the form of a flowchart.
- the voice data after digital processing by the signal processing unit 401 is divided into, for example, every 10 milliseconds, and is input to the feature amount extraction unit 402. To be done.
- the feature amount extraction unit 402 extracts the feature amount of the sound based on the sound data that has been digitally processed by the signal processing unit 401, using a known technique such as Fourier transform or mel filter bank (step S1002). ..
- a known technique such as Fourier transform or mel filter bank.
- the HMM score calculation unit 403 receives information on the feature amount of the voice, and calculates the score of each HMM state using the acoustic model 431 (step S1004).
- the graph search unit 404 receives the HMM state score, and receives a small graph (smaller WFST model) 432 on the GPU memory (device memory) 321, a large graph cache 901, and a large graph on the main memory 311 ( The larger WFST model) 421 is used to perform search processing by on-the-fly synthesis (step S1005).
- step S1005 the graph search unit 404 first transitions tokens in the small graph.
- the ID of the source state (the state before the transition) and the input label are input, and the arc information of the large graph is acquired from the large graph cache 901, Transition of graph tokens.
- the graph search unit 404 searches the large graph (larger WFST model) 421 on the main memory 311, and acquires the target arc. Then, after transitioning all the hypotheses, the graph search unit 404 performs pruning of the hypotheses as a whole. However, details of the graph search processing will be given later (see FIG. 11).
- step S1001 Until the end of the input voice is reached (Yes in step S1001), for example, the processes of steps S1002 to S1005 are repeatedly executed for the voice data divided every 10 milliseconds.
- Step S1006 When the end of the input voice is reached (No in step S1001), the character string of the voice recognition result by the graph search unit 404 is copied from the work area 433 on the device memory 321 to the main memory 311 on the CPU 310 side (No). Step S1006).
- the recognition result output processing unit 405 performs processing for displaying or outputting the recognition result from the output unit 442 including a display and a speaker (step S1007).
- FIG. 11 shows a detailed processing procedure of the graph search process executed in step S1005 in the flowchart shown in FIG. 10 in the form of a flowchart.
- the graph search unit 404 transitions tokens in the small graph 432 (smaller WFST model) on the device memory 321 (step S1101).
- step S1103 when the word is output from the transitioned arc (Yes in step S1102), the source state (state before transition) and the input label are input, and the arc information of the desired large graph is large. It is checked whether or not it is in the graph cache 901 (step S1103).
- step S1103 if the desired arc information of the large graph is present in the large graph cache 901, that is, if there is a cache hit (Yes in step S1103), the information of the large graph arc is acquired from the large graph cache 901.
- the state transition of the token of the large graph (large WFST model) is performed (step S1104).
- the graph search unit 404 calculates the address on the main memory 321 that stores the necessary arc information of the large graph. , The arc information of the large graph is copied from the address in the main memory 321 to the small graph 432 in the device memory 321 on the GPU 320 side (step S1106), and the arc information of the large graph is stored in the large graph cache 901. The data is cached (step S1107), and the token of the large graph is changed on the device memory 321 (step S1104).
- the graph search unit 404 prunes the hypotheses as a whole (step S1105) and ends this processing.
- F-4-2 Modified Example of Expanding Large Graph in Other than Main Memory
- the large graph may be expanded to an external storage device such as an SSD, a memory of another system over the network, a memory of another device in the same system 300, or the like.
- the search process of a large-scale graph can be executed at high speed by a manycore arithmetic unit having a limited memory capacity.
- a speech recognition system to which the technique according to the first embodiment is applied is configured such that in a hybrid environment equipped with a CPU and a GPU (or another many-core arithmetic unit), a large-scale graph search process using on-the-fly synthesis is performed by the CPU. It can be run without overloading. This brings the following advantages.
- the technique described as the first embodiment can be applied to various cases in which the graph search processing capable of on-the-fly composition is applied to the hybrid environment.
- a WFST corresponding to a large vocabulary for speech recognition processing in which WFST data is arranged on a disk has a size of several tens GB to several hundreds GB, and a WFTS search cannot be operated unless the system has a large memory capacity. Therefore, a method of arranging all the WFST data on the disk and performing the search process has been proposed (for example, see Non-Patent Document 4). Specifically, a node file (nodes-file) describing the position of an arc extending from each state (node) of the WFST, an arc file (arcs-file) describing arc information, and a word corresponding to an output symbol are described.
- the information of an arbitrary arc can be acquired by accessing the disk twice. Further, by holding (that is, caching) the arc once read from the disk in the memory for a while, the number of times the disk is accessed can be reduced and the increase in processing time due to the disk access can be suppressed.
- the offset data of the WFST data corresponds to the "node file" which is the information of the position of the arc extending from each of the above nodes.
- the "real-time processing" referred to here means, for example, that one second of voice is processed within one second.
- voice recognition in a real service such as a voice agent, it is important to return a response to the user in real time.
- the disk IOPS the number of I/O accesses that the disk can process per second
- the number of times the disk is accessed is reduced and the processing can be performed at high speed, but this is against the reduction of the memory usage.
- the data to be placed in the memory that is, only useful data is carefully selected and placed in the memory
- high-speed voice recognition processing is realized while suppressing the memory usage.
- FIG. 15 schematically shows the configuration of a voice recognition system 1500 to which the technique proposed as the second embodiment is applied.
- the illustrated voice recognition system 1500 includes a CPU 1510, a main storage device (hereinafter referred to as “memory”) 1520, and an auxiliary storage device (hereinafter referred to as “disk”) 1530.
- a main storage device hereinafter referred to as “memory”
- disk auxiliary storage device
- the voice recognition system 1500 divides the WFST model into two large and small ones, and performs on-the-fly synthesis to synthesize at the time of executing voice recognition.
- a language model having a large size is divided into large and small.
- the smaller language model considering the connection between the two words is combined with the acoustic model and the pronunciation dictionary to create the smaller WFST model.
- the smaller WFST model (small graph) is located in memory 1520, which has a relatively small capacity.
- the language model considering the connection of four words is the larger WFST model.
- the larger WFST model (large graph) is placed on disk 1530.
- the CPU 1510 performs a state transition of the smaller WFST model on the memory 1520, and performs a likelihood correction using the larger WFST model on the disk 1530.
- the CPU 1510 basically transitions tokens with a small WFST model arranged in the memory 1520, but when a word is output from the transitioned arc and it becomes necessary to perform a state transition of a large WFST model, the disk 1520 And copying only the data (specifically, the input label, output label, weight, and ID of the arc transition destination state) necessary for processing in the large WFST model to the memory 1520, Performs all search processing.
- the WFST data (arc) used for voice recognition has a large bias in access frequency.
- the larger language model WFST is accessed only when a word is output by the smaller WFST, and the access frequency is low. That is, in on-the-fly synthesis, the portion that occupies most of the WFST data, which is infrequently accessed, can be separated into the larger language model WFST. Therefore, by allocating the smaller WFST model with high access frequency to the memory 1520 capable of high-speed processing and arranging the language model WFST with low access frequency and large size on the disk 1530, the number of accesses to the disk 1530 can be increased. Since it is reduced, it is possible to realize a high-speed WFST search while reducing the memory usage amount.
- the search for the language model WFST is a process of extracting the corresponding arc from the state ID and label (input label) of the source state.
- the search for the language model WFST is a process of extracting the corresponding arc from the state ID and label (input label) of the source state.
- the data of each arc of the language model is arranged on the disk 1530 as an array.
- the arc data includes the input label, the output label, the weight, and the state ID of the arc transition destination.
- the array of arcs arranged on the disk 1530 is also referred to as “arc array”.
- the arcs are arranged on the disk 1530 in the order of the state IDs in the source state, such as the arc in the state 0, the arc in the state 1, the arc in the state 2, and so on. Further, there are a plurality of arcs extending from each state, and arcs having the same source state are sorted by label (input label).
- the arcs having the source state are arranged on the arc arrangement of the disk 1530 in the order of labels (input labels). Binary search is possible by sorting the arcs in the same state by the input label.
- an “arc index (Arc Indexes)” storing the start position (offset) on the arc array of the arcs in each state is arranged.
- the arc index arranges the start positions of the arcs of each state on the arc array sorted in the order of the state IDs. For example, when the arc extending from the state 5 starts from the 10th position in the arc array, the 5th element of the array of arc indexes becomes 10.
- an "input label array” in which labels (input labels) corresponding to the arcs on the arc array are arranged in the same manner as the arc array is also arranged.
- the arcs are sorted and arranged in the order of the state ID of the source state and the arcs having the same source state in the order of the label (input label). Therefore, even on the input label array, the input label of each arc is arranged in the order of the arcs arranged on the arc array. For example, if the label of the 10th arc on the arc array is 3, then the 10th element of the input label array is 3.
- the arc index and the input label array arranged in the memory 1520 are used to determine the position on the disk 1530 of the arc corresponding to an arbitrary state ID and input label. , 1530 without having to access the disk 1530.
- the start position on the arc array of the state ID of the source state of the target arc is specified by the arc index, and then the input label of the target arc is searched from the element at the same start position on the input label array. , You can reach a position on the arc array located on the disk 1530. That is, an arbitrary arc can be acquired with one disk access.
- FIG. 17 shows an arc array on the disk 1530 and an arc index and input label array arranged in the memory 1520 as specific examples of WFST (large) access data.
- the data of each arc is arranged in the order of the state IDs of the source states and the arcs having the same source state are sorted in the order of labels (input labels).
- the arc data shall include the input label, output label, weight, and state ID of the arc transition destination.
- the element written as “A (i) j ” represents that the arc data of the jth input label is stored in the source state whose state ID is “i”. ..
- the fourth element from the beginning stores the data of the arcs whose state ID is "0" and whose input labels are 0, 1, 3, and 4, respectively.
- the fifth to seventh elements are source states with a state ID of "1" and store arc data with input labels 0, 2, and 7, respectively. Binary search is possible by sorting the arcs in the same state by the input label.
- the arc index 1702 arranged in the memory 1520 stores the start position on the arc array of the arc in each state.
- the arc index 1702 is array type data sorted by state ID. In the example shown in FIG. 17, the states are sorted in the order of 0, 4, 7, 13, 16, 21,... Based on the state ID, and each element has a start position on the arc array 1701 of the corresponding state ID. It is stored.
- the first element of the arc index 1702 stores 0 indicating the starting position of the arc of state 0 on the arc array 1701
- the second element stores the starting position of the arc of state 4 on the arc array 1701. Stored is 4.
- the input label array 1703 arranged in the memory 1520 stores labels (input labels) corresponding to arcs on the arc array 1701 arranged in the same manner as the arc array 1701. Therefore, each element of the input label array 1703 stores the input label of the arc of the element at the same position on the arc array 1701.
- the fourth element from the beginning stores the input labels 0, 1, 3, and 4 of each arc extending from the source state whose state ID is "0".
- the fifth to seventh elements respectively store the input labels 0, 2, and 7 possessed by each arc extending from the source state whose state ID is "1".
- the CPU 1510 searches the language model WFST arranged in the format of the arc array 1701 on the disk 1530, it first refers to the arc index 1702 in the memory 1520 to refer to the state ID of the source state of the target arc. The starting position on the arc array of is specified. Then, by searching the input label of the target arc from the element at the same start position of the input label array 1703, it is possible to reach the corresponding element on the arc array 1701 arranged on the disk 1530. That is, an arbitrary arc can be acquired with one disk access.
- the method described in the section G-2 has a problem that the size of the WFST (large) access data arranged in the memory 1520 is large. Particularly, since the input label array stores the data of the input label corresponding to the arc of each element of the arc array, the data size becomes about one fourth of the arc array arranged on the disk 1530, and the memory size of the memory 1520 is reduced. There is a concern that the purpose of reducing the amount of use may not be fully achieved.
- the data arranged on the disk 1530 is 16 GB.
- the data allocated in the memory 1530 is 4.4 GB.
- an arc is made up of four pieces of data including an input label, an output label, a weight, and a transition destination state ID. If each piece of data has a size of 4 bytes, the data size of one arc is 16 bytes. Become. Therefore, the number of arcs is 1 billion, and the data size of the arc array is 16 GB.
- the data size of the input label array arranged in the memory 1520 is 4 GB
- the data size of the arc index is 0.4 GB, and most of them are the input label array.
- FIG. 18 shows a specific example of WFST (large) access data for realizing the method proposed in this section.
- the arc array is arranged on the disk 1530, while the arc index and the input label array are arranged in the memory 1520, as in the example shown in FIG.
- the arc array 1801 on the disk 1530 sorts the arcs having the same source state in the order of the state IDs of the source states and the labels (input labels), and The data is arranged.
- detailed description of the arc array 1801 is omitted.
- the arc index 1802 arranged in the memory 1520 is the starting position on the arc array of the arc in each state.
- the arc index 1802 is array type data sorted by the state ID, as in the example shown in FIG.
- detailed description of the arc index 1802 is also omitted.
- the input label array 1703 stores the labels (input labels) corresponding to the arcs on the arc array 1701 in the same array as the arc array 1701.
- the input label array 1803 stores data for calculating the position of the page on which the target arc is arranged.
- the arc array 1801 is divided into 256 pieces (that is, for each page), and only the leading input label of each of the 256 arc arrays 1801 is stored in the input label array 1803.
- the 256 arcs correspond to 4 KB, or one page.
- the input label array 1803 stores data for calculating the position of the page on which the target arc is arranged, specifies the page including the target arc, and then outputs one page of arcs from the disk 1530. Data can be read into the memory 1520.
- the CPU 1510 searches the language model WFST arranged in the format of the arc array 1801 on the disk 1530, first, by referring to the arc index 1802 in the memory 1520, the element corresponding to the status ID is changed to the status. The start position on the arc array 1801 of the arc is specified and the page range in which the target arc can exist is calculated. Next, with reference to the input label array 1803 on the memory 1520, the label at the top of each page where the target arc may exist is compared with the input label to identify the page where the target arc exists. Then, the disk 1530 is accessed, one page, that is, the data of 256 arcs is read into the memory 1520, and then the target arc is searched from the 256 arcs.
- the size of the WFST (large) access data arranged in the memory 1520 can be reduced with almost no change in processing time from the method proposed in section G-2.
- the data amount of the input label array 1803 is 1/256 of the input label array 1703 shown in FIG.
- the number of disk accesses can be further reduced by rearranging the arc array 1801 so as to increase the useful arcs as much as possible among the 256 arcs read by one disk access.
- Increasing the number of useful arcs as much as possible means putting the arcs that are likely to be used at the same time on the same page (a group of 256 arcs).
- arcs extending from the same state (node) need to be collected and sorted in the order of labels to be arranged in the arc array, so arcs that are likely to be used at the same time need to be rearranged collectively. (That is, the state ID needs to be reassigned).
- a method of rearranging arcs a method based on the structure of WFST can be mentioned. Specifically, it is a method of collectively arranging arcs extending from connected states (nodes) on the WFST as close as possible.
- the arc of the language model may be pre-read. According to this method, it is possible to predict the arc that is likely to be read from the disk 1530 and read it into the memory 1520 in advance to reduce the processing time for the latency of the disk access. If the prediction is wrong, useless disk access will occur, but if the IOPS of the disk 1530 does not become a bottleneck, it is effective in reducing the processing time.
- the predictor of the access pattern of the language model can be learned by using a sequence model such as HMM or RNN (Recurring Neural Network).
- a pre-trained model may be used, or learning may be performed online during the processing of the speech recognition system 1500.
- FIG. 19 shows a specific functional configuration example of the voice recognition system 1500 to which the technique proposed as the second embodiment is applied.
- a signal processing unit 1901, a feature amount extraction unit 1902, an HMM score calculation unit 1903, a WFST search unit 1904, and a recognition result output unit 1905 are arranged in the CPU 1900.
- These functional modules indicated by reference numerals 1901 to 1905 may actually be software programs executed by the CPU 1900.
- a many-core arithmetic unit such as a GPU may be used instead of the CPU, or a combination of the CPU and the GPU may realize the functional modules indicated by reference numerals 1901 to 1905.
- the voice input unit 1931 is composed of a microphone or the like and picks up a voice signal.
- the signal processing unit 1901 performs predetermined digital processing on the voice signal received by the voice input unit 1931.
- the feature amount extraction unit 1902 extracts a feature amount of voice using a known technique such as Fourier transform or mel filter bank.
- the HMM score calculation unit 1903 receives information on the feature amount of the voice, and calculates the score of each HMM state by using the acoustic model 1911 in the RAM 1910. GMM and DNN are used for HMM.
- the WFST search unit 1904 receives the HMM state score, and a small graph (smaller WFST model) 1912 on the RAM (Random Access Memory) 1910 as the memory and the large graph (larger on the SSD 1920 as the disk described above. Search processing by on-the-fly synthesis is performed using the other WFST model) 1921.
- a large graph (larger WFST model) 1921 on the SSD 1920 is an arc array.
- the data of each arc is arranged in the order of the state ID of the source state and the arcs having the same source state are sorted in the order of the label (input label) (described above).
- the WFST search unit 1904 can utilize the arc index and the input label array stored as the WFST model (large) access data 1914 in the RAM 1910 to access the arc array in the SSD 1920 at high speed.
- the language model arc cache 1913 on the RAM 1910 stores arcs once read from the SSD 1920 in page units. Further, in the work area 1915 in the RAM 1910, data such as a token at the time of searching the WFST is temporarily stored.
- signal processing or WFST search processing is repeated until there is no input of voice data from the voice input unit 1931 (in other words, until the end of the utterance). Then, when there is no input of voice data, the WFST search unit 1904 outputs the recognition result extracted from the probable hypothesis to the recognition result output unit 1905. Then, the recognition result output unit 1905 performs a process for displaying or outputting the recognition result from the output unit 1932 including a display, a speaker, and the like.
- the voice recognition system 1500 may be configured as a device including at least one of the voice input unit 1931 and the output unit 1932.
- the CPU 1900 and GPU 320 may be mounted in a server on the cloud, and the voice input unit 441 and the output unit 442 may be configured as a voice agent device (described later).
- FIG. 20 shows, in the form of a flowchart, the overall processing procedure of voice recognition executed by the voice recognition system 1500 shown in FIG.
- the voice data after digital processing by the signal processing unit 1901 is divided into, for example, every 10 milliseconds and is input to the feature amount extraction unit 1902. Will be done.
- the feature amount extraction unit 1902 extracts the feature amount of the voice based on the voice data that has been digitally processed by the signal processing unit 1901 using a known technique such as Fourier transform or mel filter bank (step S1902). ), The feature amount data is input to the HMM score calculation unit 1903.
- the HMM score calculation unit 1903 receives the information of the voice feature amount, and calculates the score of each HMM state using the acoustic model 1921 (step S2003).
- the WFST search unit 1904 receives the HMM state score and performs on-the-fly synthesis using the small graph (smaller WFST model) 1912 on the RAM 1911 and the large graph (larger WFST model) 1921 on the SSD 1920. Search processing is performed (step S2004).
- step S2004 the WFST search unit 1904 first transitions tokens in the small graph 1912 on the RAM 1911.
- the transition in the large graph 1921 on the SSD 1920 is performed.
- the WFST search unit 1904 identifies the page on which the necessary arc is arranged, using the arc index and the input label array stored as the WFST model (large) access data 1914 in the RAM 1910.
- the WFST search unit 1904 reads a page including the corresponding arc from the language model arc cache 1913 if it exists in the language model arc cache 1913, and otherwise reads it from the large graph 1921 on the SSD 1920.
- the WFST search unit 1904 searches for a target arc from the read page, and uses the data of the arc to perform the transition of the token on the large graph.
- steps S2002 to S2004 are repeatedly performed on the voice data divided every 10 milliseconds, for example.
- the WFST search unit 1904 selects a hypothesis that seems likely from the tokens in the work area 1915 of the RAM 1910 and outputs it as a recognition result. Then, the recognition result output unit 1905 performs a process for displaying or outputting the recognition result from the output unit 1932 including a display and a speaker (step S2005).
- FIG. 21 shows an example of a detailed processing procedure of the WFST search process executed in step S2004 in the flowchart shown in FIG. 20 in the form of a flowchart.
- the processing procedure shown in FIG. 21 shall follow the disk access method (see FIG. 17) described in Section G-2 described above.
- the WFST search unit 1904 transitions tokens using the small graph 1912 (smaller WFST model) on the RAM 1910 (step S2101).
- step S2102 If no word is output from the transitioned arc (No in step S2102), the WFST search unit 1904 prunes the hypothesis as a whole (step S2107), and ends this processing.
- the WFST search unit 1904 uses the WFST (large) access data 1914 to position the target arc on the WFTS model (large) 1921. Is specified (step S2103).
- the WFST search unit 1904 first refers to the arc index in the WFST (large) access data 1914 to specify the start position on the arc array of the state ID of the source state of the target arc.
- the WFST search unit 1904 searches for the input label of the target arc from the element at the same start position of the input label array in the WFST (large) access data 1914, and thereby finds the position of the target arc on the arc array. To identify.
- the WFST search unit 1904 checks whether or not the corresponding page (that is, the page including the data of the target arc) exists in the language model arc cache 1913 (step S2104).
- step S2104 If the corresponding page already exists in the language model arc cache 1913 (Yes in step S2104), the WFST search unit 1904 reads the data of the target arc from the language model arc cache 1913 (step S2105). The transition of the token on the large graph is performed (step S2106).
- the WFST search unit 1904 uses the WFST model (large) 1921 arranged in the SSD 1920, that is, the arc array, in step S2103. A page including the specified position is read (step S2108) and written in the language model arc cache 1913 (step S2109). Then, the WFST search unit 1904 searches for a target arc from the read page, and uses the arc data to perform a token transition on the large graph (step S2106).
- step S2107 the hypothesis is pruned as a whole (step S2107), and this process is completed.
- FIG. 22 shows another example of the detailed processing procedure of the WFST search process executed in step S2004 in the flowchart shown in FIG. 20 in the form of a flowchart.
- the processing procedure shown in FIG. 22 is in accordance with the disk access method (see FIG. 18) described in the above section G-3.
- the WFST search unit 1904 transitions tokens in the small graph 1912 (smaller WFST model) on the RAM 1910 (step S2201).
- step S2202 If no word is output from the transitioned arc (No in step S2202), the WFST search unit 1904 prunes the hypothesis as a whole (step S2208), and ends this processing.
- the WFST search unit 1904 arranges the target arc on the WFTS model (large) 1921 using the WFST (large) access data 1914.
- the specified page is specified (step S2203).
- the WFST search unit 1904 first refers to the arc index in the WFST (large) access data 1914, identifies the start position on the arc array of the arc in that state from the element corresponding to the state ID, and Calculate the page range where arcs can exist. Next, by referring to the input label array in the WFST (large) access data 1914, the label at the top of each page where the target arc may exist and the input label are compared, and the page where the target arc exists is identified. Identify.
- the WFST search unit 1904 checks whether the corresponding page exists in the language model arc cache 1913 (step S2204).
- the WFST search unit 1904 reads the data of the corresponding page, that is, 256 arcs, from the language model arc cache 1913 (step). S2205), the target arc is searched from the 256 arcs (step S2206).
- the WFST search unit 1904 uses the WFST model (large) 1921 arranged in the SSD 1920, that is, the arc array, in step S2203.
- a page including the specified position is read (step S2209) and written in the language model arc cache 1913 (step S2210).
- the WFST search unit 1904 searches for the target arc from the read page (step S2206), and transitions the token on the large graph (step S2207).
- the WFST search unit 1904 prunes the hypotheses as a whole (step S2208), and ends this process.
- step S2203 is executed in step S2203 in the flowchart shown in FIG.
- the processing procedure is shown in the form of a flowchart.
- the WFST search unit 1904 first refers to the element corresponding to the state ID of the target arc and the next element in the arc index in the WFST (large) access data 1914, and the target arc may exist.
- the page range is calculated (step S2301).
- the state ID of the target arc is “0”, the first element “0” and the second element (that is, the state ID “4”) “4”. Since the arc extending from the source state with the state ID “0” is within the 1st to 256th range, it is possible to specify that the target arc exists on page 0.
- the page range in which the target arc can exist spans multiple pages.
- the start position on the arc array of the target arc state ID is the Nth
- the first arc extending from the source state exists on the [N/256] page (where [X] is a real number X). To the largest integer less than or equal to X).
- the state ID of the source state of the target arc is the 10th of the arc index
- the 10th element is 300
- the following 11th element is 900
- the WFST search unit 1904 refers to the input label array in the WFST (large) access data 1914, and inputs the target label of the top label of each page corresponding to the page range calculated in the preceding step S2301.
- the page in which the target arc exists is identified by comparison with the label (step S2302).
- the page can be specified by comparing the leading label of each page with the input label of the target arc.
- the input label of the target arc is 100
- the page range in which the target arc can exist is between the first page and the third page
- the first page, the second page, and the third page in the input label array are included.
- the first label of each eye is 300, 50, 150. It can be seen that the leading label on the first page is outside the range of the input label of the target arc and is the input label in the previous state. Therefore, since the input label of the target arc exists between the start position of the second page and the start position of the third page, it can be specified that the target arc exists on the second page. ..
- the WFST search unit 1904 reads the specified page from the large graph 1921 on the SSD 1920, that is, the arc array (step S2303).
- the WFST search unit 1904 searches for the target arc using the input label from the 256 arcs of the page read from the arc array on the SSD 1920 (step S2304).
- Each read arc has input label information (see, for example, FIG. 14).
- the arc index is referred to in step S2301
- the range of arcs extending from the source state of the state ID of the target arc is known from the 256 arcs. That is, the difference between the element corresponding to the target arc state ID and the next element is the number of arcs extending from that state. Therefore, by comparing the input labels within that range, the target arc can be specified as one (or it can be seen that the target arc does not exist).
- the WFST search unit 1904 checks whether or not the target arc exists (step S2305). If the target arc exists on the page read from SSD 1920 (Yes in step S2305), the WFST search unit 1904 ends this process.
- step S2305 when the target arc does not exist in the read page (No in step S2305), the WFST search unit 1904 transitions to the backoff state. Specifically, 0 is set in the input label (step S2306), the process returns to step S2301, and the same processing as above is repeated. Label 0 indicates an arc that makes a backoff transition.
- FIG. 24 shows a specific functional configuration example of a voice recognition system 1500 having an arc pre-reading function.
- a signal processing unit 2401, a feature amount extraction unit 2402, an HMM score calculation unit 2403, a WFST search unit 2404, and a recognition result output unit 2405 are arranged in the CPU 2400.
- These functional modules indicated by reference numerals 2401 to 2405 may actually be software programs executed by the CPU 2400.
- each functional module indicated by reference numerals 2401 to 2405 basically has the same function or role as the functional module of the same name in the voice recognition system 1500 shown in FIG. 19, detailed description will be given here. Is omitted.
- the RAM 2410 corresponds to the above-mentioned memory
- the SSD 2420 corresponds to the above-mentioned disk.
- the RAM 2410 includes an acoustic model 2411 used for score calculation in the HMM state, a small graph (smaller WFST model) 2512, a language model arc cache 2413 in which arcs once read from the SSD 2420 are stored in page units, and WFST model (large) access data 2414 including an arc index and a Uri label array is arranged.
- a large graph (larger WFST model) 2421 is arranged on the SSD 2420.
- the language model access pattern model 2416 used for pre-reading the arc of the language model is further arranged on the RAM 2410.
- the pre-reading function of the language model arc is described below.
- the arc is read from the SSD 2420 into the language model arc cache 2413 in the RAM 2410 in advance before it is actually needed.
- the WFST search unit 2404 (or another (not shown) functional module executed by the CPU 2400) uses the language model access pattern model 2416 arranged in the RAM 2410 to predict the arc that is likely to be required next. And pre-read.
- the language model access pattern model 2416 may use a sequence model such as a pre-learned HMM or LSTM (Long-Short Term Memory), or may be learned online while operating the processing of the speech recognition system 1500. You may.
- the language model access pattern model 2416 takes an access pattern (one or more times before) to a past arc as an input, and is likely to be accessed next (or N arcs from the top) (or , Page) is output.
- the pre-loaded arc is placed in the language model arc cache 2413 in RAM 2410.
- pre-loading may be in arc units or page units. If the language model arc cache 2413 is an arc unit, it is pre-read in arc units, and if the cache is in page units, it is pre-read in page units.
- 25 and 26 show, in the form of flowcharts, detailed processing procedures of the WFST search processing executed by the WFST search unit 2404 in the speech recognition system 1500 shown in FIG.
- arc pre-reading is performed in parallel with the WFST search processing.
- the illustrated processing procedure follows the disk access method (see FIG. 18) described in the above section G-3.
- the WFST search unit 2404 transitions tokens in the small graph 1912 (smaller WFST model) on the RAM 1910 (step S2501).
- step S2502 If no word is output from the transitioned arc (No in step S2502), the WFST search unit 2404 performs pruning of the hypothesis as a whole (step S2508) and ends this processing.
- Step S2503 is basically carried out according to the processing procedure shown in FIG.
- the WFST search unit 2404 checks whether or not the corresponding page exists in the language model arc cache 2413 (step S2504). If the corresponding page already exists in the language model arc cache 2413 (Yes in step S2504), the WFST search unit 2404 reads the corresponding page from the language model arc cache 2413 (step S2505) and of that page. A target arc is searched from the inside (step S2506).
- the WFST search unit 2404 determines in step S2503 from the WFST model (large) 2421 arranged in the SSD 2420, that is, the arc array.
- the page containing the specified position is read (step S2509) and written to the language model arc cache 2413 (step S2510).
- the WFST search unit 2404 searches for the target arc from the read page (step S2506), and transitions the token on the large graph (step S2507).
- the WFST search unit 2404 after transitioning all the hypotheses, prunes the hypotheses as a whole (step S2508), and ends this process.
- the WFST search unit 2404 (or the functional module for pre-reading executed by the CPU 2400) performs the pre-reading processing of the arc in parallel with the processing (step S2503) for identifying the page in which the target arc is arranged. carry out.
- the WFST search unit 2404 inputs the page access pattern into the language model access pattern model 2416 (step S2511).
- the language model access pattern model 2416 takes the access pattern (one time before or a plurality of times before) of the past arc as an input, and outputs the page that is likely to be accessed next.
- the WFST search unit 2404 checks whether or not the page that is output from the language model access pattern model 2416 and is likely to be accessed next exists in the language model arc cache 2413 (step S2512). If the page in question already exists in the language model arc cache 2413 (Yes in step S2504), there is no need for pre-reading, and this processing ends.
- the WFST search unit 2404 pre-reads the page output from the language model access pattern model 2416 in step S2511. To do. That is, the corresponding page is read from the WFST model (large) 2421 arranged in the SSD 2420, that is, the arc array (step S2513), and written in the language model arc cache 2413 (step S2514).
- FIG. 27 shows a functional configuration example of the voice recognition system 2700 that realizes on-the-fly synthesis using a disk in a hybrid environment.
- the voice recognition system 2700 includes a CPU 2710 and a GPU 2720 as processors that execute processes related to the voice recognition process.
- each function module of the signal processing unit 2701, the feature amount extraction unit 2702, and the recognition result processing unit 2705 is arranged.
- the GPU2720 each function module of the HMM score calculation unit 2703 and the WFST search unit 2704 is arranged.
- These functional modules shown by reference numbers 2701 to 2705 may actually be software programs executed by the CPU 2710 and the GPU 2720, respectively.
- the SSD 2740 is used as a disk, but the internal memory of the GPU 2720 (hereinafter referred to as “GPU memory”) 2730 is used as the memory.
- the voice input unit 2751 is composed of a microphone or the like, and inputs the collected voice signal to the CPU 2710.
- the signal processing unit 2701 performs predetermined digital processing on the audio signal.
- the feature amount extraction unit 2702 extracts the feature amount of the voice and outputs it to the GPU 2720.
- the HMM score calculation unit 2703 receives the information of the voice feature amount and calculates the score of each HMM state using the acoustic model 2731 in the GPU memory 2730. Then, the WFST search unit 2704 receives the HMM state score, and uses the small graph (smaller WFST model) 2732 in the GPU memory 2730 and the large graph (larger WFST model) 2741 on the SSD 2740 to perform on-the-fly synthesis. Search processing is performed by.
- the large graph (larger WFST model) 2741 on SSD2740 is an arc array.
- the WFST search unit 2704 can utilize the arc index and the input label array stored as the WFST model (large) access data 2734 in the GPU memory 2730 to access the arc array in the SSD 2740 at high speed (same as above). ).
- the WFST search unit 2704 When the WFST search unit 2704 performs the WFST search process, the arc model once read from the SSD 2740 is stored in page units in the language model arc cache 2733 in the GPU memory 2730. Further, in the work area 2735 in the GPU memory 2730, data such as a token at the time of WFST search is temporarily stored.
- the WFST search unit 2704 performs the WFST search process
- the arc pre-reading process is performed in parallel.
- the WFST search unit 2704 inputs a page access pattern into the language model access pattern model 2736 in the GPU memory 2730. Then, the WFST search unit 2704 reads the page, which is output from the language model access pattern model 2736 and is likely to be accessed next, from the WFST model (large) 2741 in the SSD 2740, and the language model in the GPU memory 2730. Write to arc cache 2733.
- the CPU 2710 and the GPU 2720 repeat the signal processing or the WFST search processing until there is no input of voice data from the voice input unit 2751 (in other words, until the end of the utterance). Then, when there is no input of voice data, the WFST search unit 2704 in the GPU 2720 outputs the recognition result extracted from the likely hypothesis to the recognition result output unit 2705 on the CPU 2710 side. Then, the recognition result output unit 2705 performs a process for displaying or outputting the recognition result from the output unit 2752 including a display, a speaker, or the like.
- FIG. 28 shows, in the form of a flowchart, the overall processing procedure of speech recognition executed by the speech recognition system 2700 shown in FIG.
- the voice data after digital processing by the signal processing unit 2701 is divided into, for example, every 10 milliseconds and is input to the feature amount extraction unit 2702. Will be done.
- the feature amount extraction unit 2702 extracts the feature amount of the sound based on the sound data after being digitally processed by the signal processing unit 2701, using a known technique such as Fourier transform or mel filter bank (step S2802). ..
- a known technique such as Fourier transform or mel filter bank.
- the HMM score calculation unit 273 receives the information of the feature amount of the voice and calculates the score of each HMM state using the acoustic model 2731 in the GPU memory 2730 (step S2804).
- the WFST search unit 2704 receives the HMM state score, and the small graph (smaller WFST model) 2732 on the GPU memory 2730, the language model arc cache 2733, and the large graph (larger WFST model) on the SSD 2740. ) 2741 is used to perform search processing by on-the-fly synthesis (step S2805).
- step S2805 the WFST search unit 2704 first transitions tokens in a small graph.
- the word is output from the small graph in the transition
- the ID of the source state (state before the transition) and the input label are input
- the arc information of the large graph is acquired from the language model arc cache 2733, Perform token transition of large graph.
- the large graph (larger WFST model) 2741 on the SSD 2740 is searched to read the target arc.
- the WFST search unit 2704 may search the large graph (larger WFST model) 2741 according to the processing procedure shown in FIGS. 25 and 26, for example, and may perform pre-reading of arcs in parallel. Then, the WFST search unit 2704, after transitioning all the hypotheses, prunes the hypotheses as a whole.
- step S2801 Until the end of the input voice is reached (Yes in step S2801), for example, the processes of steps S2802 to S2805 are repeatedly performed on the voice data divided every 10 milliseconds.
- step S2801 the character string of the voice recognition result by the WFST search unit 2704 is copied from the work area 2735 on the GPU memory 2730 to the main memory on the CPU 2710 side (step). S2806).
- the recognition result output processing unit 2705 on the CPU 2710 side executes a process for displaying or outputting the recognition result from the output unit 2752 including a display and a speaker (step S2807).
- processing is performed by arranging the WFST data divided into two in the memory and the disk and performing on-the-fly synthesis, thereby arranging all the WFST data in the disk.
- Real-time processing can be realized while suppressing an increase in time. This brings the following advantages.
- a large-scale graph search can be executed in a system with a limited memory capacity.
- B Even if the WFST model is placed on the disk, the graph search process can be executed at high speed.
- C A larger WFST model can be used with the same memory usage.
- a voice agent that performs on/off of a television, channel selection and volume adjustment, change of temperature setting of a refrigerator, on/off of home electric appliances such as lights and air conditioners, and adjustment operations.
- the voice agent can also reply by voice when asked about weather forecasts, stock/exchange information, news, accept product orders, and read the contents of purchased books.
- the agent function is provided, for example, by linking an agent device installed around the user at home or the like with an agent service built on the cloud (see, for example, Patent Document 2).
- the agent device mainly provides a user interface such as a voice input for receiving a voice spoken by a user and a voice output for answering an inquiry from the user by voice.
- the agent service side performs recognition and semantic analysis of the voice input by the agent device. Further, the agent service side may also execute a process with a high load such as a process such as information retrieval in response to a user inquiry and a voice synthesis based on the process result.
- FIG. 12 shows a functional configuration example of an agent system 1200 including a voice recognition system to which the technology according to the present disclosure is applied.
- the agent system 1200 includes an agent device 1201 and an agent service 1202.
- the agent device 1201 is installed around the user, for example, in the home.
- the agent device 1201 includes a TV 1211, a refrigerator 1212, an LED (Light Emitting Diode) lighting 1213 via a wired LAN (Local Area Network) such as Ethernet (registered trademark) or a wireless LAN such as Wi-Fi (registered trademark). , Interconnected with various home appliances.
- the agent device 1201 includes an audio input unit such as a microphone and an output unit such as a speaker and a display.
- the agent service 1202 includes a voice recognition system 1204 and a semantic analysis unit 1203.
- the voice recognition system 1204 is assumed to have the functional configuration shown in any one of FIG. 4, FIG. 9, FIG. 19, FIG. 24, and FIG. 27, and the detailed description thereof will be omitted here.
- the agent service 1202 is configured as a server on the cloud, for example.
- the agent device 1201 and the agent service 1202 are interconnected via a wide area network such as the Internet.
- a system configuration in which the function of the agent service 1202 is incorporated in the agent device 1201 is also possible.
- the agent device 1201 transmits a voice signal obtained by picking up a voice command uttered by the user to the agent service 1202.
- Voice commands also include instructions for home appliances such as "turn on the TV,” “tell me what's in the refrigerator,” and “turn off the lights.”
- the voice recognition signal received by the voice recognition system 1204 is output as the text of the recognition result by the voice recognition process using on-the-fly synthesis. Then, the semantic analysis unit 1203 performs semantic analysis on the text of the recognition result, and returns the semantic analysis result to the agent device 1201.
- the result of the semantic analysis of the user's voice command includes operation commands for each home appliance such as turning on/off the television 1211, tuning and adjusting the volume, changing the temperature setting of the refrigerator 1212, turning on/off the LED lighting 1213 and adjusting the light amount.
- the agent device 1201 Based on the result of the semantic analysis received from the agent service 1202, the agent device 1201 has operation signals for turning on/off the television 1211 and tuning and adjusting the volume, operation signals for changing the temperature setting for the refrigerator 1212, turning on/off the LED lighting 1213, and the amount of light. Operation signals such as adjustment are transmitted via the home network.
- the embodiment applied to the WFST for voice recognition has been mainly described as an example of the graph search, but the application of the technology according to the present disclosure is not limited to this, and other similar processing is performed.
- the technique according to the present disclosure can be applied to the graph search process of.
- the technique described as the first embodiment can be similarly applied to various cases in which the graph search processing capable of on-the-fly synthesis is applied to the hybrid environment of the CPU and the GPU.
- the technique described as the second embodiment is applied not only to the combination of the main storage device and the auxiliary storage device, but also to the combination of any storage device having different access performance or capacity such as the combination of the GPU memory and the auxiliary storage device. Can be applied to
- the application target of the technology according to the present disclosure is not limited to the graph search processing of the GPU and the speech recognition, and the GPU is a many-core arithmetic unit having a limited memory capacity (having a memory capacity smaller than the graph size).
- the graph search process for replacement and voice recognition can be replaced with a general graph search process.
- the WFST voice recognition system to which the technology according to the present disclosure is applied can be installed in various types of information processing devices or information terminals such as personal computers, smartphones, tablets, and voice agents.
- a calculation unit, a first storage device, and a second storage device are provided.
- the graph information is divided into first graph information and second graph information, Arranging the first graph information in the first storage device, Arranging the second graph information in the second storage device,
- the arithmetic unit performs a graph search process using the first graph information arranged in the first storage device and the second graph information arranged in the second storage device, Information processing device.
- the size of the first graph information is smaller than that of the second graph information.
- the first storage device has a smaller capacity than the second storage device, The information processing device according to (1) above.
- the graph information is a WFST model expressing an acoustic model, a pronunciation dictionary, and a language model in speech recognition
- the first graph is the smaller WFST model obtained by dividing the WFST model into two large and small ones
- the second graph is the larger WFST model.
- the small WFST model obtained by dividing the language model into two large and small ones and synthesizing the smaller language model considering the connection of words less than the first number with the acoustic model and the pronunciation dictionary is used as the first graph information.
- a large WFST model consisting of a language model considering the connection of an arbitrary number of words larger than the first number is used as the second graph information.
- the calculation unit when it is necessary to refer to the second graph information while executing the search process using the first graph information, a necessary part of the second graph information. Is copied from the second storage device to the first storage device, and the search process is continued.
- the information processing apparatus according to any one of (1) to (4) above.
- the computing unit includes a first computing unit including a GPU or other many-core computing unit and a second computing unit including a CPU,
- the first storage device is a memory in the GPU, and the second storage device is a local memory of the CPU.
- the graph information is a WFST model
- the first arithmetic unit transitions tokens with a small WFST model, but when a word is output from the transitioned arc and it becomes necessary to perform a state transition of the token of a large WFST model, it is necessary for processing.
- the information processing device according to (6) above.
- the first arithmetic unit previously calculates a position on the second storage device in which a necessary arc is arranged in the second graph, The information processing device according to (6) above.
- the list of the necessary position information of the arc calculated in advance by the first arithmetic unit is transmitted to the second arithmetic unit,
- the second operation unit copies an arc, which is required during the graph search by the first operation unit, from the second storage device to the first storage device based on the list.
- the first storage device includes a cache that holds the second graph information.
- the information processing device according to (1) above.
- the cache has a data structure that takes the identification information of the source state and the input label as inputs and returns an arc.
- the information processing device according to (11) above.
- the feature amount extraction for calculating the feature amount of the input voice is executed, and the feature amount extraction is executed.
- the first storage device is a local memory of the arithmetic unit
- the second storage device is an auxiliary storage device
- the calculation unit transitions tokens with a small WFST model, but when a word is output from the transitioned arc and it becomes necessary to perform a state transition of the token of a large WFST model, the data necessary for processing is input.
- the search process is performed while copying from the second storage device to the first storage device.
- the arithmetic unit is composed of a CPU or a GPU, The information processing apparatus according to (15) above.
- the calculation unit includes feature amount extraction for calculating the feature amount of the input voice, HMM score calculation for calculating the HMM state score from the feature amount, and the first graph information arranged in the first storage device. Executing a search process by on-the-fly synthesis using the second graph information arranged in the second storage device, The information processing apparatus according to (15) above.
- the first storage device holds data for accessing the larger WFST model in the second storage device,
- the arithmetic unit copies data necessary for processing from the second storage device to the first storage device based on the access data.
- the larger WFST model consists of an arc array in which arcs are sorted by the state ID of the source state and the input label
- the first storage device has, as the access data, an arc index for storing a start position of an arc in each state on the arc array and an input label corresponding to the arc on the arc array, which is the same as the arc array. Equipped with an input label array to store in an array, The calculation unit specifies the start position on the arc array of the state ID of the source state of the target arc by the arc index, and searches the input label of the target arc from the element at the start position on the input label array. Thereby, the position where the target arc is stored on the arc array is specified, and the data of the target arc is acquired from the arc array of the second storage device.
- the information processing device according to (16).
- the larger WFST model consists of an arc array in which arcs are sorted by the state ID of the source state and the input label
- the first storage device stores, as the access data, an arc index that stores a start position of an arc in each state on the arc array, a division of the arc array for each page, and a head of the arc array of each page.
- the calculation unit calculates a page range in which a target arc exists based on the arc index, identifies a page in which the target arc exists from the page range based on the input label array, and Acquiring the specified page from the arc array of the second storage device,
- the information processing device according to (16).
- an information processing device including a calculation unit, a first storage device, and a second storage device, A step of arranging the first graph information obtained by dividing the graph information in the first storage device, and A step of arranging the second graph information obtained by dividing the graph information in the second storage device, and A step in which the arithmetic unit executes a graph search process using the first graph information arranged in the first storage device and the second graph information arranged in the second storage device;
- An information processing method including.
- the graph information is divided into the first graph information and the second graph information.
- the first graph information is arranged in the first memory of the first calculation unit, and the first graph information is arranged in the first memory.
- the second graph information is arranged in the second memory of the second calculation unit, and the second graph information is arranged in the second memory.
- the first arithmetic unit performs a graph search process using the first graph information arranged in the first memory and the second graph information arranged in the second memory.
- the size of the first graph information is smaller than that of the second graph information.
- the first memory has a smaller capacity than the second memory, The information processing device according to (101).
- the first arithmetic unit is a GPU or other many-core arithmetic unit
- the second arithmetic unit is a CPU.
- the information processing apparatus according to any of (101) and (102) above.
- the graph information is a WFST model expressing an acoustic model, a pronunciation dictionary, and a language model in speech recognition
- the first graph is the smaller WFST model obtained by dividing the WFST model into two large and small ones
- the second graph is the larger WFST model.
- the first graph shows the smaller WFST model obtained by dividing the language model into two large and small ones and synthesizing the smaller language model considering the connection of words less than the first number with the acoustic model and the pronunciation dictionary.
- the larger WFST model composed of a language model considering the connection of an arbitrary number of words larger than the first number is used as the second graph information.
- the information processing device according to (104).
- the graph information is a WFST model
- the token is transitioned by the small WFST model, but when a word is output from the transitioned arc and it becomes necessary to perform the state transition of the token of the large WFST model, the data required for processing is transferred from the second memory. While copying to the first memory for copying, the first arithmetic unit performs all search processing.
- the information processing apparatus according to (106) above.
- the first arithmetic unit previously calculates a position on the second memory in which a necessary arc is arranged in the second graph,
- the information processing apparatus according to any of (101) to (107).
- the first memory includes a cache that holds the second graph information.
- the information processing apparatus according to any of (101) to (110).
- the cache has a data structure that takes the source state identification information and the input label as inputs and returns an arc.
- the information processing device according to (111) above.
- the feature amount extraction for calculating the feature amount of the input voice is executed, and the feature amount extraction is executed.
- the information processing device according to any one of (101) to (112).
- the second arithmetic unit further executes a process for outputting a voice recognition result obtained by the search process executed by the first arithmetic unit, The information processing device according to (113).
- At least one of a voice input unit for inputting voice and an output unit for outputting voice recognition result is further provided.
- WFST search unit 2405 Recognition result output Department, 2410 ... RAM 2411... acoustic model, 2412... WFST model (small) 2413... Language model arc cache 2414... WFST model (large) access data 2415... Work area, 2416... Language model access pattern model 2420... SSD, 2421... WFST model (large) 2431 ... Voice input unit, 2432 ... Output unit 2700 ... Voice recognition system, 2701 ... Signal processing unit 2702 ... Feature amount extraction unit, 2703 ... HMM score calculation unit 2704 ... WFST search unit, 2705 ... Recognition result output unit 2710 ... CPU, 2720... GPU, 2730... GPU memory 2731... Acoustic model, 2732... WFST model (small) 2733... Language model arc cache 2734... WFST model (large) access data 2735... Work area, 2736... Language model access pattern model 2740... SSD, 2741... WFST model (large) 2751... Voice input section, 2752... Output section
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/433,389 US20220147570A1 (en) | 2019-03-04 | 2019-12-19 | Information processing apparatus and information processing method |
| JP2021503424A JPWO2020179193A1 (https=) | 2019-03-04 | 2019-12-19 |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2019039051 | 2019-03-04 | ||
| JP2019-039051 | 2019-03-04 | ||
| JP2019-182142 | 2019-10-02 | ||
| JP2019182142 | 2019-10-02 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020179193A1 true WO2020179193A1 (ja) | 2020-09-10 |
Family
ID=72338554
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2019/049771 Ceased WO2020179193A1 (ja) | 2019-03-04 | 2019-12-19 | 情報処理装置及び情報処理方法 |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20220147570A1 (https=) |
| JP (1) | JPWO2020179193A1 (https=) |
| WO (1) | WO2020179193A1 (https=) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11605376B1 (en) * | 2020-06-26 | 2023-03-14 | Amazon Technologies, Inc. | Processing orchestration for systems including machine-learned components |
| US20240356948A1 (en) * | 2023-04-21 | 2024-10-24 | Barracuda Networks, Inc. | System and method for utilizing multiple machine learning models for high throughput fraud electronic message detection |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2015011709A (ja) * | 2013-07-01 | 2015-01-19 | パロ・アルト・リサーチ・センター・インコーポレーテッドPalo Alto Research Center Incorporated | 明示的に表されたグラフで並列探索を行うシステムおよび方法 |
| JP2015014774A (ja) * | 2013-06-03 | 2015-01-22 | 日本電信電話株式会社 | 音声認識用wfst作成装置、音声認識装置、音声認識用wfst作成方法、音声認識方法及びプログラム |
| JP2015041055A (ja) * | 2013-08-23 | 2015-03-02 | ヤフー株式会社 | 音声認識装置、音声認識方法、およびプログラム |
| JP2015529350A (ja) * | 2012-09-07 | 2015-10-05 | カーネギー メロン ユニバーシティCarnegie Mellon University | ハイブリッドgpu/cpuデータ処理方法 |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016099301A1 (en) * | 2014-12-17 | 2016-06-23 | Intel Corporation | System and method of automatic speech recognition using parallel processing for weighted finite state transducer-based speech decoding |
| US9972314B2 (en) * | 2016-06-01 | 2018-05-15 | Microsoft Technology Licensing, Llc | No loss-optimization for weighted transducer |
| US11366866B2 (en) * | 2017-12-08 | 2022-06-21 | Apple Inc. | Geographical knowledge graph |
-
2019
- 2019-12-19 JP JP2021503424A patent/JPWO2020179193A1/ja not_active Abandoned
- 2019-12-19 US US17/433,389 patent/US20220147570A1/en not_active Abandoned
- 2019-12-19 WO PCT/JP2019/049771 patent/WO2020179193A1/ja not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2015529350A (ja) * | 2012-09-07 | 2015-10-05 | カーネギー メロン ユニバーシティCarnegie Mellon University | ハイブリッドgpu/cpuデータ処理方法 |
| JP2015014774A (ja) * | 2013-06-03 | 2015-01-22 | 日本電信電話株式会社 | 音声認識用wfst作成装置、音声認識装置、音声認識用wfst作成方法、音声認識方法及びプログラム |
| JP2015011709A (ja) * | 2013-07-01 | 2015-01-19 | パロ・アルト・リサーチ・センター・インコーポレーテッドPalo Alto Research Center Incorporated | 明示的に表されたグラフで並列探索を行うシステムおよび方法 |
| JP2015041055A (ja) * | 2013-08-23 | 2015-03-02 | ヤフー株式会社 | 音声認識装置、音声認識方法、およびプログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2020179193A1 (https=) | 2020-09-10 |
| US20220147570A1 (en) | 2022-05-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3966813B1 (en) | Online verification of custom wake word | |
| EP3966807B1 (en) | On-device custom wake word detection | |
| Price et al. | A low-power speech recognizer and voice activity detector using deep neural networks | |
| Franz et al. | Searching the web by voice | |
| CN107408111B (zh) | 端对端语音识别 | |
| KR101780760B1 (ko) | 가변길이 문맥을 이용한 음성인식 | |
| US6574597B1 (en) | Fully expanded context-dependent networks for speech recognition | |
| KR101970041B1 (ko) | 하이브리드 지피유/씨피유(gpu/cpu) 데이터 처리 방법 | |
| Arısoy et al. | Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition | |
| Price et al. | A 6 mW, 5,000-word real-time speech recognizer using WFST models | |
| CN110364171A (zh) | 一种语音识别方法、语音识别系统及存储介质 | |
| GB2453366A (en) | Automatic speech recognition method and apparatus | |
| CN112151003A (zh) | 并行语音合成方法、装置、设备以及计算机可读存储介质 | |
| CN112420026A (zh) | 优化关键词检索系统 | |
| Bai et al. | End-to-end keywords spotting based on connectionist temporal classification for mandarin | |
| Suyanto et al. | End-to-end speech recognition models for a low-resourced indonesian language | |
| US20230237990A1 (en) | Training speech processing models using pseudo tokens | |
| WO2020179193A1 (ja) | 情報処理装置及び情報処理方法 | |
| Bataev et al. | NGPU-LM: GPU-Accelerated N-Gram Language Model for context-biasing in greedy ASR decoding | |
| Markovnikov et al. | Investigating joint CTC-attention models for end-to-end Russian speech recognition | |
| Ström | Continuous speech recognition in the WAXHOLM dialogue system | |
| Price | Energy-scalable speech recognition circuits | |
| Domokos et al. | Romanian phonetic transcription dictionary for speeding up language technology development | |
| Pinto et al. | Design and evaluation of an ultra low-power human-quality speech recognition system | |
| Tamburini | Playing with NeMo for building an automatic speech recogniser for Italian |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19918308 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021503424 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19918308 Country of ref document: EP Kind code of ref document: A1 |