KR20170091903A - Voice recongintion system and methode based on deep neural network - Google Patents
Voice recongintion system and methode based on deep neural network Download PDFInfo
- Publication number
- KR20170091903A KR20170091903A KR1020160012761A KR20160012761A KR20170091903A KR 20170091903 A KR20170091903 A KR 20170091903A KR 1020160012761 A KR1020160012761 A KR 1020160012761A KR 20160012761 A KR20160012761 A KR 20160012761A KR 20170091903 A KR20170091903 A KR 20170091903A
- Authority
- KR
- South Korea
- Prior art keywords
- neural network
- nodes
- present
- recognition system
- speech recognition
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 description 18
- 238000004364 calculation method Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000000747 cardiac effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 229920006395 saturated elastomer Polymers 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Image Analysis (AREA)
Abstract
Description
The present invention relates to a speech recognition system and method.
Generally, the speech recognition system extracts a word W outputting a maximum likelihood for a given feature parameter X as shown in [Equation 1].
[Equation 1]
In this case, the three probability models P (X | M), P (M | W) and P (W) are acoustic models, pronunciation models and language models.
The language model P (W) includes probability information about the word connection.
The pronunciation model P (M | W) expresses information on which pronunciation symbol the word is composed.
The acoustic model P (X | M) models the probability of observing the actual feature vector X with respect to the pronunciation symbol.
Of the three probability models, the acoustic model P (X | M) can be calculated using a depth-of-field neural network.
In traditional deep neural networks, the probability of the jth node of the Lth output layer
Is obtained by a softmax function as shown in [Equation 2].&Quot; (2) "
That is, the output for all the K nodes of the Lth output layer
And the output value of each node is Lt; / RTI > In this case, the output of the node Is obtained in the same manner as in (3).&Quot; (3) "
The calculation amount O (n) can be obtained using the observation probability based on [Equation 3].
The decoding process is the same as argmax w .
In the speech recognition system, as shown in the eighth line of the algorithm described in [Table 1], only the observation probability for the active state j is required.
However, when a softmax scheme such as Equation (2) is used, all L hidden layers must be calculated irrespective of the number of activated state j.
That is, L hidden layers must always be computed regardless of the number of active states. Therefore, the amount of computation of the neural network is always O (n) = I × I + L × (M × M) + M × K irrespective of the number of activated states required in the actual Viterbi decoding.
Where I is the dimension of the input vector, M is the number of nodes per hidden layer, and K is the number of output nodes.
As described above, in the conventional speech recognition system, since the amount of computation of the depth-based neural network is determined irrespective of the number of activated states, the amount of computation of the deep-layer neural network can be unnecessarily increased.
In this regard, Korean Patent Laid-Open Publication No. 10-2006-0133610 (entitled "Cardiac Classification Method Using Hidden Markov Model") is a method for classifying HMMs using heart sound data in a heart sound classification method, And the like.
Embodiments of the present invention provide a neural network-based speech recognition system and method that can improve the speed of speech recognition by reducing the time required for computation of the neural network.
It should be understood, however, that the technical scope of the present invention is not limited to the above-described technical problems, and other technical problems may exist.
According to another aspect of the present invention, there is provided a depth-of-speech network-based speech recognition system comprising: an input unit for receiving various information; A storage unit for storing programs and information; A control unit for processing information input through the input unit using the programs; And an output unit outputting a result processed by the control unit, wherein the controller is configured to represent nodes having a large degree of influence on an error of each output node of the neural network as a subgraph, .
In order to improve the computation speed of the depth-based neural network in the depth-of-neural network-based speech recognition system, any one of the above-mentioned problems By graphing and computing the nodes during Viterd decoding, the speed of the speech recognition system based on the neural network can be improved.
That is, the present invention can reduce the computational complexity of the deep neural network by having a variable amount of calculation according to the number of activated states.
According to the present invention, since the time required for calculation of the depth-of-field network is reduced, the speech recognition speed can be improved.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram illustrating a configuration of a depth-of-field neural network-based speech recognition system according to the present invention; FIG.
FIG. 2 is a diagram illustrating an in-depth neural network applied to a depth-based neural network-based speech recognition system according to the present invention, in terms of a graph; FIG.
FIG. 3 is a diagram illustrating a depth neural network decomposed by a depth-based neural network-based speech recognition system according to the present invention; FIG.
BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, advantages and features of the present invention and methods of achieving them will be apparent from the following detailed description of embodiments thereof taken in conjunction with the accompanying drawings.
The present invention may, however, be embodied in many different forms and should not be construed as being limited to the exemplary embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, And advantages of the present invention are defined by the description of the claims.
It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. &Quot; comprises "and / or" comprising ", as used herein, unless the recited component, step, operation, and / Or added.
FIG. 1 is a diagram illustrating a configuration of a depth-of-neural network-based
The speech recognition system based on the depth-of-field neural network according to the present invention (hereinafter, simply referred to as speech recognition system 10) has a maximum likelihood for the feature parameter X, as shown in Equation (1) The word W to be output is extracted.
In this case, the three probability models P (X | M), P (M | W) and P (W) are acoustic models, pronunciation models and language models. The language model P (W) includes probability information about the word connection, and the pronunciation model P (M | W) expresses information on which phonetic symbol the word is composed, and the acoustic model P (X M) models the probability of observing the actual feature vector X for phonetic symbols.
Of the three probability models, the acoustic model P (X | M) can be calculated using a depth-of-field neural network.
In other words, one of the important factors in improving speech recognition performance is the acoustic model. The acoustic model aims at learning how to distinguish phonemes by using large-capacity speech data.
In a typical typical acoustic model, the Gaussian distribution for each phoneme is obtained using the learning speech data. The conventional acoustic model has an advantage that the model learning speed is fast. However, when the performance of the apparatus using the above method reaches a certain limit, the performance of the apparatus is saturated despite the increase of the learning data.
As a method to solve this problem, a technique of generating an acoustic model using a deep neural network has been proposed.
In the neural network, the probability of the jth node of the Lth output layer
Is obtained through a softmax function such as Equation (2) mentioned in the description of the background of the invention.That is, the output for all the K nodes of the Lth output layer
And the output value of each node is Lt; / RTI > In this case, the output of the node Is obtained in the same manner as in Equation (3) mentioned in the prior art.In addition, the decoding process is implemented by an argmax w operation such as Table 1 mentioned in the background art.
In the
Conventionally, when a softmax scheme such as Equation (2) is used, all L hidden layers have to be calculated regardless of the number of activated state j. That is, conventionally, since the amount of computation of the deep-layer neural network is determined irrespective of the number of activated states, the amount of computation of the deep-layer neural network has to be increased unnecessarily.
However, the present invention provides a neural network-based speech recognition system capable of improving the speed of speech recognition by reducing the time required for the computation of the neural network.
In particular, in order to improve the computational speed of the neural network in the depth-of-neural network-based
That is, the present invention can reduce the computational complexity of the deep neural network by having a variable amount of calculation according to the number of activated states.
1, the
Here, the
For example, the
The
The
The
The
1 may be implemented in hardware such as software or an FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and may perform predetermined roles can do.
However, 'components' are not meant to be limited to software or hardware, and each component may be configured to reside on an addressable storage medium and configured to play one or more processors.
Thus, by way of example, an element may comprise components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, Routines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
The components and functions provided within those components may be combined into a smaller number of components or further separated into additional components.
Hereinafter, the
FIG. 2 is a graphical representation of an in-depth neural network applied to a depth-of-neural network-based
That is, in the present invention, the depth neural network is interpreted in terms of the graph. A general in-depth neural network as shown in FIG. 2 can be expressed in a graph form as shown in Equation (4).
&Quot; (4) "
G = {I, H, F, E}
Where H is the node set of the hidden layer, I is the start node set, F is the output node set, and E is the connection set between the nodes. A set of all paths starting from a subset i of I and reaching a subset f of F through a subset h of H can be expressed as: " (5) "
&Quot; (5) "
P (i, h, f)
In this case, a conventional in-depth neural network corresponding to the expression (2) mentioned in the background art is represented by P (I, H, F).
However, in the case of speech recognition as described above, the score value need not be calculated for every output node every frame.
Also, to obtain the score value of a particular output node, there is no need to calculate for all input and hidden nodes.
For example, it is sufficient to calculate only those nodes of the hidden layer that have a large effect on the specific output node to be sought.
Accordingly, in the present invention, the
&Quot; (6) "
G = {G f , where f? F
In other words, the neural network G is composed of the sum of subgraphs G f decomposed by output nodes.
In the present invention, only the subgraph corresponding to the activated state in decoding is evaluated among the subgraphs decomposed by output nodes. Thus, the analysis speed can be improved.
In this case, how to construct each subgraph corresponding to the output node can be an important problem.
In general, in-depth neural networks train model parameters in the direction of minimizing errors.
Thus, in order to construct the subgraph G f for the output node f, nodes that affect the error E (f) of the output node f must be selected.
In other words, the degree of error of the I-th node of the 1-th layer on the f-th output node can be defined as Equation (7).
&Quot; (7) "
When the degree of the error is defined by [Equation 7], the sub-graph G f may be obtained as shown in [Equation 8].
&Quot; (8) "
That is, Equation (8)
And a maximum number of N nodes.To separate the presence or absence of the I-th node in the l-th layer, Equation (3) can be changed to Equation (9).
&Quot; (9) "
In Equation (9)
If it is, If it is, it corresponds to with.If (7) is differentiated, (10) can be obtained.
&Quot; (10) "
In the case of? = 0 in Equation (7), if the approximation is performed, Equation (11) can be obtained.
&Quot; (11) "
therefore,
Can be approximated as in Equation (12).&Quot; (12) "
In summary, the depth neural network decomposition method applied to the
In other words, the present invention can be applied to a
For example, FIG. 3 schematically shows a case where one node of nodes having a high error reduction contribution is selected for each output node when N = 1 is set.
According to one embodiment of the present invention, in order to improve the computational speed of the depth-based neural network in the deep-layer neural network-based
That is, the present invention can reduce the computational complexity of the deep neural network by having a variable amount of calculation according to the number of activated states.
Further, according to the present invention, since the time required for calculation of the depth-of-field network is reduced, the speech recognition speed can be improved.
The speech recognition method in the conversation
While the methods and systems of the present invention have been described in connection with specific embodiments, some or all of those elements or operations may be implemented using a computer system having a general purpose hardware architecture.
It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.
The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.
10: voice recognition system 11: input unit
12: control unit 13:
14: Output section
Claims (1)
A storage unit for storing programs and information;
A control unit for processing information input through the input unit using the programs; And
And an output unit for outputting a result processed by the control unit,
Wherein the control unit expresses nodes having a degree of influence on errors of output nodes of the neural network exceeding a predetermined threshold value by a subgraph and calculates the nodes upon biter decoding.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160012761A KR20170091903A (en) | 2016-02-02 | 2016-02-02 | Voice recongintion system and methode based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160012761A KR20170091903A (en) | 2016-02-02 | 2016-02-02 | Voice recongintion system and methode based on deep neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20170091903A true KR20170091903A (en) | 2017-08-10 |
Family
ID=59652430
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020160012761A KR20170091903A (en) | 2016-02-02 | 2016-02-02 | Voice recongintion system and methode based on deep neural network |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20170091903A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190100498A (en) * | 2018-02-06 | 2019-08-29 | 한국전자통신연구원 | Apparatus and method for correcting speech recognition result |
US11429180B2 (en) | 2019-01-04 | 2022-08-30 | Deepx Co., Ltd. | Trained model creation method for performing specific function for electronic device, trained model for performing same function, exclusive chip and operation method for the same, and electronic device and system using the same |
-
2016
- 2016-02-02 KR KR1020160012761A patent/KR20170091903A/en unknown
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190100498A (en) * | 2018-02-06 | 2019-08-29 | 한국전자통신연구원 | Apparatus and method for correcting speech recognition result |
US11429180B2 (en) | 2019-01-04 | 2022-08-30 | Deepx Co., Ltd. | Trained model creation method for performing specific function for electronic device, trained model for performing same function, exclusive chip and operation method for the same, and electronic device and system using the same |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9653093B1 (en) | Generative modeling of speech using neural networks | |
US8484023B2 (en) | Sparse representation features for speech recognition | |
KR102399535B1 (en) | Learning method and apparatus for speech recognition | |
US9600764B1 (en) | Markov-based sequence tagging using neural networks | |
KR20170086214A (en) | Apparatus and method for recognizing speech | |
JP7209330B2 (en) | classifier, trained model, learning method | |
Xia et al. | DBN-ivector Framework for Acoustic Emotion Recognition. | |
JP6336219B1 (en) | Speech recognition apparatus and speech recognition method | |
Ferrer et al. | Spoken language recognition based on senone posteriors. | |
JP2018159917A (en) | Method and apparatus for training acoustic model | |
Xiao et al. | A Initial Attempt on Task-Specific Adaptation for Deep Neural Network-based Large Vocabulary Continuous Speech Recognition. | |
CN113674733A (en) | Method and apparatus for speaking time estimation | |
KR20170091903A (en) | Voice recongintion system and methode based on deep neural network | |
WO2022028378A1 (en) | Voice intention recognition method, apparatus and device | |
Zhang et al. | Towards end-to-end speaker diarization with generalized neural speaker clustering | |
Liu et al. | Detecting adversarial audio via activation quantization error | |
JP6027754B2 (en) | Adaptation device, speech recognition device, and program thereof | |
Chang et al. | On the importance of modeling and robustness for deep neural network feature | |
JP3920749B2 (en) | Acoustic model creation method for speech recognition, apparatus thereof, program thereof and recording medium thereof, speech recognition apparatus using acoustic model | |
US9892726B1 (en) | Class-based discriminative training of speech models | |
JP4950600B2 (en) | Acoustic model creation apparatus, speech recognition apparatus using the apparatus, these methods, these programs, and these recording media | |
JP6137477B2 (en) | Basic frequency model parameter estimation apparatus, method, and program | |
Li et al. | Labeling unsegmented sequence data with DNN-HMM and its application for speech recognition | |
Zhu et al. | Gaussian free cluster tree construction using deep neural network. | |
JP6978792B2 (en) | Methods and devices for recognizing speech |