KR20170091903A

KR20170091903A - Voice recongintion system and methode based on deep neural network

Info

Publication number: KR20170091903A
Application number: KR1020160012761A
Authority: KR
Inventors: 정훈; 박전규; 이성주; 이윤근; 최우용
Original assignee: 한국전자통신연구원
Priority date: 2016-02-02
Filing date: 2016-02-02
Publication date: 2017-08-10

Abstract

A purpose of the present invention provides a voice recognition system based on depth neural network which can improve voice recognition speed by reducing a time spent calculating depth neural network. To this end, the voice recognition system based on the depth neural network according to the present invention comprises: an input part receiving various information; a storage part storing programs and the information; a control part processing the information input through the input part by using the programs; and an output part outputting a processed result by the control part. The control part represents nodes with a high degree of influence of an error for each output nodes of the depth neural network as a sub graph, and calculates the nodes at viterbi decoding.

Description

TECHNICAL FIELD [0001] The present invention relates to a voice recognition system and method based on a neural network,

The present invention relates to a speech recognition system and method.

Generally, the speech recognition system extracts a word W outputting a maximum likelihood for a given feature parameter X as shown in [Equation 1].

[Equation 1]

In this case, the three probability models P (X | M), P (M | W) and P (W) are acoustic models, pronunciation models and language models.

The language model P (W) includes probability information about the word connection.

The pronunciation model P (M | W) expresses information on which pronunciation symbol the word is composed.

The acoustic model P (X | M) models the probability of observing the actual feature vector X with respect to the pronunciation symbol.

Of the three probability models, the acoustic model P (X | M) can be calculated using a depth-of-field neural network.

In traditional deep neural networks, the probability of the jth node of the Lth output layer

Is obtained by a softmax function as shown in [Equation 2].

&Quot; (2) "

That is, the output for all the K nodes of the Lth output layer

And the output value of each node is

Lt; / RTI > In this case, the output of the node

Is obtained in the same manner as in (3).

&Quot; (3) "

The calculation amount O (n) can be obtained using the observation probability based on [Equation 3].

The decoding process is the same as argmax _w .

In the speech recognition system, as shown in the eighth line of the algorithm described in [Table 1], only the observation probability for the active state j is required.

However, when a softmax scheme such as Equation (2) is used, all L hidden layers must be calculated irrespective of the number of activated state j.

That is, L hidden layers must always be computed regardless of the number of active states. Therefore, the amount of computation of the neural network is always O (n) = I × I + L × (M × M) + M × K irrespective of the number of activated states required in the actual Viterbi decoding.

Where I is the dimension of the input vector, M is the number of nodes per hidden layer, and K is the number of output nodes.

As described above, in the conventional speech recognition system, since the amount of computation of the depth-based neural network is determined irrespective of the number of activated states, the amount of computation of the deep-layer neural network can be unnecessarily increased.

In this regard, Korean Patent Laid-Open Publication No. 10-2006-0133610 (entitled "Cardiac Classification Method Using Hidden Markov Model") is a method for classifying HMMs using heart sound data in a heart sound classification method, And the like.

Embodiments of the present invention provide a neural network-based speech recognition system and method that can improve the speed of speech recognition by reducing the time required for computation of the neural network.

It should be understood, however, that the technical scope of the present invention is not limited to the above-described technical problems, and other technical problems may exist.

According to another aspect of the present invention, there is provided a depth-of-speech network-based speech recognition system comprising: an input unit for receiving various information; A storage unit for storing programs and information; A control unit for processing information input through the input unit using the programs; And an output unit outputting a result processed by the control unit, wherein the controller is configured to represent nodes having a large degree of influence on an error of each output node of the neural network as a subgraph, .

In order to improve the computation speed of the depth-based neural network in the depth-of-neural network-based speech recognition system, any one of the above-mentioned problems By graphing and computing the nodes during Viterd decoding, the speed of the speech recognition system based on the neural network can be improved.

That is, the present invention can reduce the computational complexity of the deep neural network by having a variable amount of calculation according to the number of activated states.

According to the present invention, since the time required for calculation of the depth-of-field network is reduced, the speech recognition speed can be improved.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram illustrating a configuration of a depth-of-field neural network-based speech recognition system according to the present invention; FIG.
FIG. 2 is a diagram illustrating an in-depth neural network applied to a depth-based neural network-based speech recognition system according to the present invention, in terms of a graph; FIG.
FIG. 3 is a diagram illustrating a depth neural network decomposed by a depth-based neural network-based speech recognition system according to the present invention; FIG.

BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, advantages and features of the present invention and methods of achieving them will be apparent from the following detailed description of embodiments thereof taken in conjunction with the accompanying drawings.

The present invention may, however, be embodied in many different forms and should not be construed as being limited to the exemplary embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, And advantages of the present invention are defined by the description of the claims.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. &Quot; comprises "and / or" comprising ", as used herein, unless the recited component, step, operation, and / Or added.

FIG. 1 is a diagram illustrating a configuration of a depth-of-neural network-based speech recognition system 10 according to the present invention.

The speech recognition system based on the depth-of-field neural network according to the present invention (hereinafter, simply referred to as speech recognition system 10) has a maximum likelihood for the feature parameter X, as shown in Equation (1) The word W to be output is extracted.

In this case, the three probability models P (X | M), P (M | W) and P (W) are acoustic models, pronunciation models and language models. The language model P (W) includes probability information about the word connection, and the pronunciation model P (M | W) expresses information on which phonetic symbol the word is composed, and the acoustic model P (X M) models the probability of observing the actual feature vector X for phonetic symbols.

In other words, one of the important factors in improving speech recognition performance is the acoustic model. The acoustic model aims at learning how to distinguish phonemes by using large-capacity speech data.

In a typical typical acoustic model, the Gaussian distribution for each phoneme is obtained using the learning speech data. The conventional acoustic model has an advantage that the model learning speed is fast. However, when the performance of the apparatus using the above method reaches a certain limit, the performance of the apparatus is saturated despite the increase of the learning data.

As a method to solve this problem, a technique of generating an acoustic model using a deep neural network has been proposed.

In the neural network, the probability of the jth node of the Lth output layer

Is obtained through a softmax function such as Equation (2) mentioned in the description of the background of the invention.

That is, the output for all the K nodes of the Lth output layer

And the output value of each node is

Lt; / RTI > In this case, the output of the node

Is obtained in the same manner as in Equation (3) mentioned in the prior art.

In addition, the decoding process is implemented by an argmax _w operation such as Table 1 mentioned in the background art.

In the speech recognition system 10, as shown in the eighth line of the algorithm described in [Table 1], only the observation probability for the active state j need be satisfied.

Conventionally, when a softmax scheme such as Equation (2) is used, all L hidden layers have to be calculated regardless of the number of activated state j. That is, conventionally, since the amount of computation of the deep-layer neural network is determined irrespective of the number of activated states, the amount of computation of the deep-layer neural network has to be increased unnecessarily.

However, the present invention provides a neural network-based speech recognition system capable of improving the speed of speech recognition by reducing the time required for the computation of the neural network.

In particular, in order to improve the computational speed of the neural network in the depth-of-neural network-based speech recognition system 10, a node having a large influence on the error of the output node of the neural network is represented by a subgraph, By calculating the nodes at the time of bitter decoding, the speed of the deep-network-based speech recognition system 10 can be improved.

1, the speech recognition system 10 according to the present invention includes an input unit 11 for receiving various kinds of information, a storage unit 13 for storing various programs and information, A control unit 12 for processing the information inputted through the control unit 11 using the programs and an output unit 14 for outputting the result processed by the control unit 12. [

Here, the storage unit 13 is collectively referred to as a nonvolatile storage device and a volatile storage device which keep the stored information even when power is not supplied.

For example, the storage unit 13 may be a compact flash (CF) card, a secure digital (SD) card, a memory stick, a solid-state drive (SSD) A magnetic computer storage device such as a NAND flash memory, a hard disk drive (HDD) and the like, an optical disc drive such as a CD-ROM, a DVD-ROM, etc., can do.

The speech recognition system 10 may be constituted by, for example, a personal computer (PC) or the like.

The input unit 11 receives various information by performing communication through a wired network or a wireless network, receives various information from a storage medium such as USB, or receives various information through a keyboard or a scanner. The various kinds of information may include voice data.

The output unit 14 may be a monitor, a speaker, a printer, or the like.

The control unit 12 performs various functions described below.

1 may be implemented in hardware such as software or an FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and may perform predetermined roles can do.

However, 'components' are not meant to be limited to software or hardware, and each component may be configured to reside on an addressable storage medium and configured to play one or more processors.

Thus, by way of example, an element may comprise components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, Routines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

The components and functions provided within those components may be combined into a smaller number of components or further separated into additional components.

Hereinafter, the speech recognition system 10 according to an embodiment of the present invention will be described in more detail with reference to FIG. 2 to FIG.

FIG. 2 is a graphical representation of an in-depth neural network applied to a depth-of-neural network-based speech recognition system 10 according to the present invention. FIG. Fig. 6 is an exemplary view showing the decomposed depth neural network.

That is, in the present invention, the depth neural network is interpreted in terms of the graph. A general in-depth neural network as shown in FIG. 2 can be expressed in a graph form as shown in Equation (4).

&Quot; (4) "

G = {I, H, F, E}

Where H is the node set of the hidden layer, I is the start node set, F is the output node set, and E is the connection set between the nodes. A set of all paths starting from a subset i of I and reaching a subset f of F through a subset h of H can be expressed as: " (5) "

&Quot; (5) "

P (i, h, f)

In this case, a conventional in-depth neural network corresponding to the expression (2) mentioned in the background art is represented by P (I, H, F).

However, in the case of speech recognition as described above, the score value need not be calculated for every output node every frame.

Also, to obtain the score value of a particular output node, there is no need to calculate for all input and hidden nodes.

For example, it is sufficient to calculate only those nodes of the hidden layer that have a large effect on the specific output node to be sought.

Accordingly, in the present invention, the controller 12 decomposes the in-depth neural network into a sub-graph dependent on the output node as shown in Equation (6).

&Quot; (6) "

G = {G _f , where _f? F

In other words, the neural network G is composed of the sum of subgraphs G _f decomposed by output nodes.

In the present invention, only the subgraph corresponding to the activated state in decoding is evaluated among the subgraphs decomposed by output nodes. Thus, the analysis speed can be improved.

In this case, how to construct each subgraph corresponding to the output node can be an important problem.

In general, in-depth neural networks train model parameters in the direction of minimizing errors.

Thus, in order to construct the subgraph G _f for the output node f, nodes that affect the error E (f) of the output node f must be selected.

In other words, the degree of error of the I-th node of the 1-th layer on the f-th output node can be defined as Equation (7).

&Quot; (7) "

When the degree of the error is defined by [Equation 7], the sub-graph G _f may be obtained as shown in [Equation 8].

&Quot; (8) "

That is, Equation (8)

And a maximum number of N nodes.

To separate the presence or absence of the I-th node in the l-th layer, Equation (3) can be changed to Equation (9).

&Quot; (9) "

In Equation (9)

If it is,

If it is, it corresponds to with.

If (7) is differentiated, (10) can be obtained.

&Quot; (10) "

In the case of? = 0 in Equation (7), if the approximation is performed, Equation (11) can be obtained.

&Quot; (11) "

therefore,

Can be approximated as in Equation (12).

&Quot; (12) "

In summary, the depth neural network decomposition method applied to the speech recognition system 10 according to the present invention generates a subgraph composed of N nodes in the layer, which causes a lot of errors when the output node f is lost.

In other words, the present invention can be applied to a speech recognition system 10 by generating a subgraph composed of nodes having a high degree of influence on errors of output nodes of a neural network, Can be improved.

For example, FIG. 3 schematically shows a case where one node of nodes having a high error reduction contribution is selected for each output node when N = 1 is set.

According to one embodiment of the present invention, in order to improve the computational speed of the depth-based neural network in the deep-layer neural network-based speech recognition system 10, Graphically, and calculating the nodes at the time of Viterd decoding, the speed of the speech recognition system 10 based on the neural network can be improved.

Further, according to the present invention, since the time required for calculation of the depth-of-field network is reduced, the speech recognition speed can be improved.

The speech recognition method in the conversation speech recognition system 10 according to an embodiment of the present invention can also be implemented in the form of a recording medium including a computer program stored in a medium executed by the computer or an instruction executable by the computer have. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable medium may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes any information delivery media, including computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism.

While the methods and systems of the present invention have been described in connection with specific embodiments, some or all of those elements or operations may be implemented using a computer system having a general purpose hardware architecture.

It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

10: voice recognition system 11: input unit
12: control unit 13:
14: Output section

Claims

An input unit for inputting various information;
A storage unit for storing programs and information;
A control unit for processing information input through the input unit using the programs; And
And an output unit for outputting a result processed by the control unit,
Wherein the control unit expresses nodes having a degree of influence on errors of output nodes of the neural network exceeding a predetermined threshold value by a subgraph and calculates the nodes upon biter decoding.