KR20170091903A - Voice recongintion system and methode based on deep neural network - Google Patents

Voice recongintion system and methode based on deep neural network Download PDF

Info

Publication number
KR20170091903A
KR20170091903A KR1020160012761A KR20160012761A KR20170091903A KR 20170091903 A KR20170091903 A KR 20170091903A KR 1020160012761 A KR1020160012761 A KR 1020160012761A KR 20160012761 A KR20160012761 A KR 20160012761A KR 20170091903 A KR20170091903 A KR 20170091903A
Authority
KR
South Korea
Prior art keywords
neural network
nodes
present
recognition system
speech recognition
Prior art date
Application number
KR1020160012761A
Other languages
Korean (ko)
Inventor
정훈
박전규
이성주
이윤근
최우용
Original Assignee
한국전자통신연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국전자통신연구원 filed Critical 한국전자통신연구원
Priority to KR1020160012761A priority Critical patent/KR20170091903A/en
Publication of KR20170091903A publication Critical patent/KR20170091903A/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Image Analysis (AREA)

Abstract

A purpose of the present invention provides a voice recognition system based on depth neural network which can improve voice recognition speed by reducing a time spent calculating depth neural network. To this end, the voice recognition system based on the depth neural network according to the present invention comprises: an input part receiving various information; a storage part storing programs and the information; a control part processing the information input through the input part by using the programs; and an output part outputting a processed result by the control part. The control part represents nodes with a high degree of influence of an error for each output nodes of the depth neural network as a sub graph, and calculates the nodes at viterbi decoding.

Description

TECHNICAL FIELD [0001] The present invention relates to a voice recognition system and method based on a neural network,

The present invention relates to a speech recognition system and method.

Generally, the speech recognition system extracts a word W outputting a maximum likelihood for a given feature parameter X as shown in [Equation 1].

[Equation 1]

Figure pat00001

In this case, the three probability models P (X | M), P (M | W) and P (W) are acoustic models, pronunciation models and language models.

The language model P (W) includes probability information about the word connection.

The pronunciation model P (M | W) expresses information on which pronunciation symbol the word is composed.

The acoustic model P (X | M) models the probability of observing the actual feature vector X with respect to the pronunciation symbol.

Of the three probability models, the acoustic model P (X | M) can be calculated using a depth-of-field neural network.

In traditional deep neural networks, the probability of the jth node of the Lth output layer

Figure pat00002
Is obtained by a softmax function as shown in [Equation 2].

&Quot; (2) "

Figure pat00003

That is, the output for all the K nodes of the Lth output layer

Figure pat00004
And the output value of each node is
Figure pat00005
Lt; / RTI > In this case, the output of the node
Figure pat00006
Is obtained in the same manner as in (3).

&Quot; (3) "

Figure pat00007

The calculation amount O (n) can be obtained using the observation probability based on [Equation 3].

The decoding process is the same as argmax w .

Figure pat00008

In the speech recognition system, as shown in the eighth line of the algorithm described in [Table 1], only the observation probability for the active state j is required.

However, when a softmax scheme such as Equation (2) is used, all L hidden layers must be calculated irrespective of the number of activated state j.

That is, L hidden layers must always be computed regardless of the number of active states. Therefore, the amount of computation of the neural network is always O (n) = I × I + L × (M × M) + M × K irrespective of the number of activated states required in the actual Viterbi decoding.

Where I is the dimension of the input vector, M is the number of nodes per hidden layer, and K is the number of output nodes.

As described above, in the conventional speech recognition system, since the amount of computation of the depth-based neural network is determined irrespective of the number of activated states, the amount of computation of the deep-layer neural network can be unnecessarily increased.

In this regard, Korean Patent Laid-Open Publication No. 10-2006-0133610 (entitled "Cardiac Classification Method Using Hidden Markov Model") is a method for classifying HMMs using heart sound data in a heart sound classification method, And the like.

Embodiments of the present invention provide a neural network-based speech recognition system and method that can improve the speed of speech recognition by reducing the time required for computation of the neural network.

It should be understood, however, that the technical scope of the present invention is not limited to the above-described technical problems, and other technical problems may exist.

According to another aspect of the present invention, there is provided a depth-of-speech network-based speech recognition system comprising: an input unit for receiving various information; A storage unit for storing programs and information; A control unit for processing information input through the input unit using the programs; And an output unit outputting a result processed by the control unit, wherein the controller is configured to represent nodes having a large degree of influence on an error of each output node of the neural network as a subgraph, .

In order to improve the computation speed of the depth-based neural network in the depth-of-neural network-based speech recognition system, any one of the above-mentioned problems By graphing and computing the nodes during Viterd decoding, the speed of the speech recognition system based on the neural network can be improved.

That is, the present invention can reduce the computational complexity of the deep neural network by having a variable amount of calculation according to the number of activated states.

According to the present invention, since the time required for calculation of the depth-of-field network is reduced, the speech recognition speed can be improved.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram illustrating a configuration of a depth-of-field neural network-based speech recognition system according to the present invention; FIG.
FIG. 2 is a diagram illustrating an in-depth neural network applied to a depth-based neural network-based speech recognition system according to the present invention, in terms of a graph; FIG.
FIG. 3 is a diagram illustrating a depth neural network decomposed by a depth-based neural network-based speech recognition system according to the present invention; FIG.

BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, advantages and features of the present invention and methods of achieving them will be apparent from the following detailed description of embodiments thereof taken in conjunction with the accompanying drawings.

The present invention may, however, be embodied in many different forms and should not be construed as being limited to the exemplary embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, And advantages of the present invention are defined by the description of the claims.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. &Quot; comprises "and / or" comprising ", as used herein, unless the recited component, step, operation, and / Or added.

FIG. 1 is a diagram illustrating a configuration of a depth-of-neural network-based speech recognition system 10 according to the present invention.

The speech recognition system based on the depth-of-field neural network according to the present invention (hereinafter, simply referred to as speech recognition system 10) has a maximum likelihood for the feature parameter X, as shown in Equation (1) The word W to be output is extracted.

In this case, the three probability models P (X | M), P (M | W) and P (W) are acoustic models, pronunciation models and language models. The language model P (W) includes probability information about the word connection, and the pronunciation model P (M | W) expresses information on which phonetic symbol the word is composed, and the acoustic model P (X M) models the probability of observing the actual feature vector X for phonetic symbols.

Of the three probability models, the acoustic model P (X | M) can be calculated using a depth-of-field neural network.

In other words, one of the important factors in improving speech recognition performance is the acoustic model. The acoustic model aims at learning how to distinguish phonemes by using large-capacity speech data.

In a typical typical acoustic model, the Gaussian distribution for each phoneme is obtained using the learning speech data. The conventional acoustic model has an advantage that the model learning speed is fast. However, when the performance of the apparatus using the above method reaches a certain limit, the performance of the apparatus is saturated despite the increase of the learning data.

As a method to solve this problem, a technique of generating an acoustic model using a deep neural network has been proposed.

In the neural network, the probability of the jth node of the Lth output layer

Figure pat00009
Is obtained through a softmax function such as Equation (2) mentioned in the description of the background of the invention.

That is, the output for all the K nodes of the Lth output layer

Figure pat00010
And the output value of each node is
Figure pat00011
Lt; / RTI > In this case, the output of the node
Figure pat00012
Is obtained in the same manner as in Equation (3) mentioned in the prior art.

In addition, the decoding process is implemented by an argmax w operation such as Table 1 mentioned in the background art.

In the speech recognition system 10, as shown in the eighth line of the algorithm described in [Table 1], only the observation probability for the active state j need be satisfied.

Conventionally, when a softmax scheme such as Equation (2) is used, all L hidden layers have to be calculated regardless of the number of activated state j. That is, conventionally, since the amount of computation of the deep-layer neural network is determined irrespective of the number of activated states, the amount of computation of the deep-layer neural network has to be increased unnecessarily.

However, the present invention provides a neural network-based speech recognition system capable of improving the speed of speech recognition by reducing the time required for the computation of the neural network.

In particular, in order to improve the computational speed of the neural network in the depth-of-neural network-based speech recognition system 10, a node having a large influence on the error of the output node of the neural network is represented by a subgraph, By calculating the nodes at the time of bitter decoding, the speed of the deep-network-based speech recognition system 10 can be improved.

That is, the present invention can reduce the computational complexity of the deep neural network by having a variable amount of calculation according to the number of activated states.

1, the speech recognition system 10 according to the present invention includes an input unit 11 for receiving various kinds of information, a storage unit 13 for storing various programs and information, A control unit 12 for processing the information inputted through the control unit 11 using the programs and an output unit 14 for outputting the result processed by the control unit 12. [

Here, the storage unit 13 is collectively referred to as a nonvolatile storage device and a volatile storage device which keep the stored information even when power is not supplied.

For example, the storage unit 13 may be a compact flash (CF) card, a secure digital (SD) card, a memory stick, a solid-state drive (SSD) A magnetic computer storage device such as a NAND flash memory, a hard disk drive (HDD) and the like, an optical disc drive such as a CD-ROM, a DVD-ROM, etc., can do.

The speech recognition system 10 may be constituted by, for example, a personal computer (PC) or the like.

The input unit 11 receives various information by performing communication through a wired network or a wireless network, receives various information from a storage medium such as USB, or receives various information through a keyboard or a scanner. The various kinds of information may include voice data.

The output unit 14 may be a monitor, a speaker, a printer, or the like.

The control unit 12 performs various functions described below.

1 may be implemented in hardware such as software or an FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and may perform predetermined roles can do.

However, 'components' are not meant to be limited to software or hardware, and each component may be configured to reside on an addressable storage medium and configured to play one or more processors.

Thus, by way of example, an element may comprise components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, Routines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

The components and functions provided within those components may be combined into a smaller number of components or further separated into additional components.

Hereinafter, the speech recognition system 10 according to an embodiment of the present invention will be described in more detail with reference to FIG. 2 to FIG.

FIG. 2 is a graphical representation of an in-depth neural network applied to a depth-of-neural network-based speech recognition system 10 according to the present invention. FIG. Fig. 6 is an exemplary view showing the decomposed depth neural network.

That is, in the present invention, the depth neural network is interpreted in terms of the graph. A general in-depth neural network as shown in FIG. 2 can be expressed in a graph form as shown in Equation (4).

&Quot; (4) "

G = {I, H, F, E}

Where H is the node set of the hidden layer, I is the start node set, F is the output node set, and E is the connection set between the nodes. A set of all paths starting from a subset i of I and reaching a subset f of F through a subset h of H can be expressed as: " (5) "

&Quot; (5) "

P (i, h, f)

In this case, a conventional in-depth neural network corresponding to the expression (2) mentioned in the background art is represented by P (I, H, F).

However, in the case of speech recognition as described above, the score value need not be calculated for every output node every frame.

Also, to obtain the score value of a particular output node, there is no need to calculate for all input and hidden nodes.

For example, it is sufficient to calculate only those nodes of the hidden layer that have a large effect on the specific output node to be sought.

Accordingly, in the present invention, the controller 12 decomposes the in-depth neural network into a sub-graph dependent on the output node as shown in Equation (6).

&Quot; (6) "

G = {G f , where f? F

In other words, the neural network G is composed of the sum of subgraphs G f decomposed by output nodes.

In the present invention, only the subgraph corresponding to the activated state in decoding is evaluated among the subgraphs decomposed by output nodes. Thus, the analysis speed can be improved.

In this case, how to construct each subgraph corresponding to the output node can be an important problem.

In general, in-depth neural networks train model parameters in the direction of minimizing errors.

Thus, in order to construct the subgraph G f for the output node f, nodes that affect the error E (f) of the output node f must be selected.

In other words, the degree of error of the I-th node of the 1-th layer on the f-th output node can be defined as Equation (7).

&Quot; (7) "

Figure pat00013

When the degree of the error is defined by [Equation 7], the sub-graph G f may be obtained as shown in [Equation 8].

&Quot; (8) "

Figure pat00014

That is, Equation (8)

Figure pat00015
And a maximum number of N nodes.

To separate the presence or absence of the I-th node in the l-th layer, Equation (3) can be changed to Equation (9).

&Quot; (9) "

Figure pat00016

In Equation (9)

Figure pat00017
If it is,
Figure pat00018
If it is, it corresponds to with.

If (7) is differentiated, (10) can be obtained.

&Quot; (10) "

Figure pat00019

In the case of? = 0 in Equation (7), if the approximation is performed, Equation (11) can be obtained.

&Quot; (11) "

Figure pat00020

therefore,

Figure pat00021
Can be approximated as in Equation (12).

&Quot; (12) "

Figure pat00022

In summary, the depth neural network decomposition method applied to the speech recognition system 10 according to the present invention generates a subgraph composed of N nodes in the layer, which causes a lot of errors when the output node f is lost.

In other words, the present invention can be applied to a speech recognition system 10 by generating a subgraph composed of nodes having a high degree of influence on errors of output nodes of a neural network, Can be improved.

For example, FIG. 3 schematically shows a case where one node of nodes having a high error reduction contribution is selected for each output node when N = 1 is set.

According to one embodiment of the present invention, in order to improve the computational speed of the depth-based neural network in the deep-layer neural network-based speech recognition system 10, Graphically, and calculating the nodes at the time of Viterd decoding, the speed of the speech recognition system 10 based on the neural network can be improved.

That is, the present invention can reduce the computational complexity of the deep neural network by having a variable amount of calculation according to the number of activated states.

Further, according to the present invention, since the time required for calculation of the depth-of-field network is reduced, the speech recognition speed can be improved.

The speech recognition method in the conversation speech recognition system 10 according to an embodiment of the present invention can also be implemented in the form of a recording medium including a computer program stored in a medium executed by the computer or an instruction executable by the computer have. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable medium may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes any information delivery media, including computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism.

While the methods and systems of the present invention have been described in connection with specific embodiments, some or all of those elements or operations may be implemented using a computer system having a general purpose hardware architecture.

It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

10: voice recognition system 11: input unit
12: control unit 13:
14: Output section

Claims (1)

An input unit for inputting various information;
A storage unit for storing programs and information;
A control unit for processing information input through the input unit using the programs; And
And an output unit for outputting a result processed by the control unit,
Wherein the control unit expresses nodes having a degree of influence on errors of output nodes of the neural network exceeding a predetermined threshold value by a subgraph and calculates the nodes upon biter decoding.
KR1020160012761A 2016-02-02 2016-02-02 Voice recongintion system and methode based on deep neural network KR20170091903A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020160012761A KR20170091903A (en) 2016-02-02 2016-02-02 Voice recongintion system and methode based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020160012761A KR20170091903A (en) 2016-02-02 2016-02-02 Voice recongintion system and methode based on deep neural network

Publications (1)

Publication Number Publication Date
KR20170091903A true KR20170091903A (en) 2017-08-10

Family

ID=59652430

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160012761A KR20170091903A (en) 2016-02-02 2016-02-02 Voice recongintion system and methode based on deep neural network

Country Status (1)

Country Link
KR (1) KR20170091903A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190100498A (en) * 2018-02-06 2019-08-29 한국전자통신연구원 Apparatus and method for correcting speech recognition result
US11429180B2 (en) 2019-01-04 2022-08-30 Deepx Co., Ltd. Trained model creation method for performing specific function for electronic device, trained model for performing same function, exclusive chip and operation method for the same, and electronic device and system using the same

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190100498A (en) * 2018-02-06 2019-08-29 한국전자통신연구원 Apparatus and method for correcting speech recognition result
US11429180B2 (en) 2019-01-04 2022-08-30 Deepx Co., Ltd. Trained model creation method for performing specific function for electronic device, trained model for performing same function, exclusive chip and operation method for the same, and electronic device and system using the same

Similar Documents

Publication Publication Date Title
US9653093B1 (en) Generative modeling of speech using neural networks
US8484023B2 (en) Sparse representation features for speech recognition
KR102399535B1 (en) Learning method and apparatus for speech recognition
US9600764B1 (en) Markov-based sequence tagging using neural networks
KR20170086214A (en) Apparatus and method for recognizing speech
JP7209330B2 (en) classifier, trained model, learning method
Xia et al. DBN-ivector Framework for Acoustic Emotion Recognition.
JP6336219B1 (en) Speech recognition apparatus and speech recognition method
Ferrer et al. Spoken language recognition based on senone posteriors.
JP2018159917A (en) Method and apparatus for training acoustic model
Xiao et al. A Initial Attempt on Task-Specific Adaptation for Deep Neural Network-based Large Vocabulary Continuous Speech Recognition.
CN113674733A (en) Method and apparatus for speaking time estimation
KR20170091903A (en) Voice recongintion system and methode based on deep neural network
WO2022028378A1 (en) Voice intention recognition method, apparatus and device
Zhang et al. Towards end-to-end speaker diarization with generalized neural speaker clustering
Liu et al. Detecting adversarial audio via activation quantization error
JP6027754B2 (en) Adaptation device, speech recognition device, and program thereof
Chang et al. On the importance of modeling and robustness for deep neural network feature
JP3920749B2 (en) Acoustic model creation method for speech recognition, apparatus thereof, program thereof and recording medium thereof, speech recognition apparatus using acoustic model
US9892726B1 (en) Class-based discriminative training of speech models
JP4950600B2 (en) Acoustic model creation apparatus, speech recognition apparatus using the apparatus, these methods, these programs, and these recording media
JP6137477B2 (en) Basic frequency model parameter estimation apparatus, method, and program
Li et al. Labeling unsegmented sequence data with DNN-HMM and its application for speech recognition
Zhu et al. Gaussian free cluster tree construction using deep neural network.
JP6978792B2 (en) Methods and devices for recognizing speech