WO2024049025A1

WO2024049025A1 - Electronic device for training speech recognition model and control method thereof

Info

Publication number: WO2024049025A1
Application number: PCT/KR2023/011062
Authority: WO
Inventors: 김찬우
Original assignee: 삼성전자주식회사
Priority date: 2022-09-01
Filing date: 2023-07-28
Publication date: 2024-03-07
Also published as: KR20240031784A

Abstract

Provided are an electronic device for training a speech recognition model and a control method thereof. The control method of the electronic device for training a speech recognition model comprises the steps of: inputting a training speech sequence into a speech recognition model including a plurality of layers; obtaining a plurality of loss values respectively from output terminals of the plurality of layers; and training the speech recognition model on the basis of the plurality of loss values.

Description

Electronic device for learning a voice recognition model and method for controlling the same

This disclosure relates to an electronic device and a control method thereof for learning an end-to-end speech recognition model.

Recently, with the development of artificial intelligence-related technology, voice recognition technology has been developed to recognize voices uttered by users.

Conventional speech recognition systems generally include an Acoustic Model (AM) that extracts acoustic features and predicts sub-words such as phonemes, a Pronunciation Model (PM) that maps phoneme sequences to word sequences, and a Pronunciation Model (PM) that maps phoneme sequences to word sequences. It may include an LM (Language Model) that specifies probability. And, in conventional speech recognition systems, it was common for AM, PM, and LM to be learned independently from different data sets.

However, recently, an end-to-end speech recognition model has been developed that combines AM, PM, and LM components into a single neural network.

According to an embodiment of the present disclosure, a method of controlling an electronic device for learning a voice recognition model includes inputting a learning voice sequence into a voice recognition model including a plurality of layers; Obtaining a plurality of loss values at the output terminal of each of the plurality of layers; and learning the speech recognition model based on the plurality of loss values.

According to an embodiment of the present disclosure, an electronic device for learning a voice recognition model includes a memory that stores data about the voice recognition model; and at least one processor. When a learning voice sequence is input to a voice recognition model including a plurality of layers, the at least one processor obtains a plurality of loss values from an output terminal of each of the plurality of layers. And, at least one processor trains the speech recognition model based on the plurality of loss values.

According to an embodiment of the present disclosure, in a non-transitory computer-readable recording medium including a program for executing a control method of an electronic device for learning a voice recognition model, the control method includes a plurality of layers. Inputting a learning speech sequence into a speech recognition model; Obtaining a plurality of loss values at the output terminal of each of the plurality of layers; and learning the speech recognition model based on the plurality of loss values.

1 is a block diagram showing the configuration of an electronic device according to an embodiment of the present disclosure;

2 is a diagram briefly explaining a voice recognition model according to an embodiment of the present disclosure;

Figure 3a is a diagram showing the CTC model;

Figure 3b is a diagram showing the RNN-T model;

Figure 3c is a diagram showing the AED model;

FIG. 4A is a diagram for explaining a method of learning a CTC model according to an embodiment of the present disclosure;

Figure 4b is a diagram for explaining a method of learning an RNN-T model according to an embodiment of the present disclosure;

Figure 4c is a diagram for explaining a method of learning an AED model according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a method of controlling an electronic device for learning a voice recognition model, according to an embodiment of the present disclosure.

Since these embodiments can be modified in various ways and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope to specific embodiments, and should be understood to include various modifications, equivalents, and/or alternatives to the embodiments of the present disclosure. In connection with the description of the drawings, similar reference numbers may be used for similar components.

In describing the present disclosure, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present disclosure, the detailed description thereof will be omitted.

In addition, the following examples may be modified into various other forms, and the scope of the technical idea of the present disclosure is not limited to the following examples. Rather, these embodiments are provided to make the present disclosure more faithful and complete and to completely convey the technical idea of the present disclosure to those skilled in the art.

The terms used in this disclosure are merely used to describe specific embodiments and are not intended to limit the scope of rights. Singular expressions include plural expressions unless the context clearly dictates otherwise.

In the present disclosure, expressions such as “have,” “may have,” “includes,” or “may include” refer to the presence of the corresponding feature (e.g., component such as numerical value, function, operation, or part). , and does not rule out the existence of additional features.

In the present disclosure, expressions such as “A or B,” “at least one of A or/and B,” or “one or more of A or/and B” may include all possible combinations of the items listed together. . For example, “A or B,” “at least one of A and B,” or “at least one of A or B” includes (1) at least one A, (2) at least one B, or (3) it may refer to all cases including both at least one A and at least one B.

Expressions such as “first,” “second,” “first,” or “second,” used in the present disclosure can modify various components regardless of order and/or importance, and can refer to one component. It is only used to distinguish from other components and does not limit the components.

A component (e.g., a first component) is “(operatively or communicatively) coupled with/to” another component (e.g., a second component). When referred to as being “connected to,” it should be understood that any component may be directly connected to the other component or may be connected through another component (e.g., a third component).

On the other hand, when a component (e.g., a first component) is said to be “directly connected” or “directly connected” to another component (e.g., a second component), It may be understood that no other component (e.g., a third component) exists between other components.

The expression “configured to” used in the present disclosure may mean, for example, “suitable for,” “having the capacity to,” depending on the situation. ," can be used interchangeably with "designed to," "adapted to," "made to," or "capable of." The term “configured (or set to)” may not necessarily mean “specifically designed to” in hardware.

Instead, in some contexts, the expression “a device configured to” may mean that the device is “capable of” working with other devices or components. For example, the phrase "processor configured (or set) to perform A, B, and C" refers to a processor dedicated to performing the operations (e.g., an embedded processor), or by executing one or more software programs stored on a memory device. , may refer to a general-purpose processor (e.g., CPU or application processor) capable of performing the corresponding operations.

In an embodiment, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software, or as a combination of hardware and software. Additionally, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented with at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.

Meanwhile, various elements and areas in the drawing are schematically drawn. Accordingly, the technical idea of the present invention is not limited by the relative sizes or spacing drawn in the attached drawings.

Hereinafter, with reference to the attached drawings, embodiments according to the present disclosure will be described in detail so that those skilled in the art can easily implement them.

FIG. 1 is a block diagram briefly illustrating the configuration of an electronic device 100 according to an embodiment of the present disclosure. The 'electronic device 100' according to the present disclosure refers to a device that can input a voice sequence corresponding to a user's voice and learn a voice recognition model that can obtain a text sequence corresponding to the voice sequence. For example, the electronic device 100 may be a device such as a server, or may be a user terminal such as a smartphone or tablet PC.

As shown in FIG. 1, the electronic device 100 according to an embodiment of the present disclosure may include a memory 110 and at least one processor 120. However, the configuration shown in FIG. 1 is only an example, and other configurations may be added depending on the type of the electronic device 100. For example, if the electronic device 100 is implemented as a server, it may further include a communication interface for acquiring learning data (e.g., a learning voice sequence, a learning voice signal, etc.), and the electronic device 100 may be configured as a server. Of course, when implemented as a terminal, it may further include a communication interface and an input interface (for example, a microphone) for acquiring learning data.

At least one instruction for controlling the electronic device 100 may be stored in the memory 110 . Additionally, an operating system (O/S) for driving the electronic device 100 may be stored in the memory 110 . Additionally, the memory 110 may store various software programs or applications for operating the electronic device 100 according to various embodiments of the present disclosure. Additionally, the memory 110 may include a semiconductor memory such as flash memory or a magnetic storage medium such as a hard disk.

Specifically, the memory 110 may store various software modules for operating the electronic device 100 according to various embodiments of the present disclosure, and at least one processor 120 may store various software modules stored in the memory 110. The operation of the electronic device 100 can be controlled by executing the module. That is, the memory 110 is accessed by at least one processor 120, and data read/write/modify/delete/update, etc. can be performed by the processor 130.

Meanwhile, in the present disclosure, the term memory 110 refers to memory 110, ROM (not shown), RAM (not shown) in the processor 120, or a memory card (not shown) mounted on the electronic device 100 (e.g. For example, it can be used to mean including micro SD card, memory stick).

In particular, in various embodiments according to the present disclosure, data about a voice recognition model may be stored in the memory 110. Here, the data for the voice recognition model may include information on weights, various parameters and nodes that make up the neural network included in the voice recognition model, and may include training data for training the voice recognition model, and information on the voice recognition model. It may also include input/output data for modules, input/output data of modules included in the voice recognition model, etc. Additionally, the memory 110 may store information about a voice signal and voice sequence corresponding to the user's voice, and information about a text sequence corresponding to the voice sequence.

In addition, various information necessary within the scope of achieving the purpose of the present disclosure may be stored in the memory 110, and the information stored in the memory 110 may be updated as it is received from an external device or input by the user. .

At least one processor 120 controls the overall operation of the electronic device 100. Specifically, at least one processor 120 is connected to the configuration of the electronic device 100 including the memory 110, and executes at least one instruction stored in the memory 120 as described above, thereby controlling the electronic device ( 100) operations can be controlled overall.

At least one processor 120 may input a voice sequence corresponding to a user's voice and train a voice recognition model capable of obtaining a text sequence corresponding to the voice sequence. In particular, in one embodiment of the present disclosure, when a learning speech sequence is input to a speech recognition model including a plurality of layers, at least one processor 120 obtains a plurality of loss values from the output terminal of each of the plurality of layers, and , a speech recognition model can be trained based on a plurality of loss values.

Here, 'voice recognition model' refers to a neural network model learned to recognize the user's voice and obtain text data corresponding to the user's voice. In particular, the speech recognition model according to the present disclosure can be configured to perform speech recognition for a preset language. The speech recognition model may be referred to as an automatic speech recognition (ASR) model. In particular, according to an embodiment of the present disclosure, the speech recognition model is an end-to-end speech recognition that directly predicts a text sequence (e.g., phoneme sequence, word sequence, etc.) corresponding to the input speech sequence. It could be a model.

Here, the term 'voice sequence' is used to specify a set of sequentially received voice signals when the user's voice generated by the user's speech is sequentially received in the form of a voice signal through an input means (e.g., microphone). It is used as a term. At this time, the voice sequence may be a signal that has undergone voice preprocessing (e.g., noise removal, time-frequency conversion, etc.).

FIG. 2 is a diagram briefly explaining a voice recognition model according to an embodiment of the present disclosure. Specifically, according to an embodiment of the present disclosure, the speech recognition model includes an encoder 210 for obtaining a hidden vector corresponding to a voice sequence and a decoder 220 for obtaining a text sequence based on the hidden vector. ) It may be a sequence-to-sequence model including.

The encoder 210 may be trained based on training data consisting of a preset language (eg, English or Korean, etc.), and may be trained to output a hidden vector corresponding to an input voice sequence. At this time, the encoder 210 may include a plurality of layers for obtaining hidden vectors corresponding to the voice sequence. At this time, the layer may be implemented with LSTM (Long Short-Term Memory), but this is only an example and may be implemented with GRU (Gated Recurrent Units), Conformer, CNN (Convolutional Neural Network), Transformer, etc.

The decoder 220 may output a text sequence corresponding to the voice sequence based on the hidden vector, which is the output value of the encoder 210. At this time, the decoder 220 may include various types of modules depending on the type of voice recognition. This will be explained with reference to FIGS. 3A to 3C.

In particular, the performance of the speech recognition model is basically determined by the performance of the encoder 210 rather than the decoder 220. Typically, the encoder of a speech recognition model contains 5 to 20 layers, while the decoder of a speech recognition model contains 1 to 2 layers. However, if the number of layers included in the encoder of the speech recognition model is increased too much, there is a problem in that the parameters of the layers do not converge.

3A to 3C are diagrams for explaining a voice recognition model according to various embodiments.

Figure 3a is a diagram for explaining a connectionist temporal classification (CTC) model. The CTC model is a speech recognition model that can obtain a text sequence by inputting a speech sequence without explicit alignment information between the input speech sequence and the text sequence.

As shown in Figure 3a, the CTC model includes an encoder 210 including a plurality of layers (310-1, 310-2,...310-n) and a decoder 220 including a softmax module 320. It can be included. The encoder 210 may obtain a hidden vector corresponding to the input voice sequence through a plurality of layers 310-1, 310-2,...310-n. The decoder 210 may output a text sequence corresponding to the voice sequence at the current time based on the hidden vector input through the softmax module 320. Specifically, the softmax module 320 identifies the class corresponding to the voice sequence at the current time among a plurality of classes by normalizing the input hidden vector to a value between 0 and 1, and creates a class corresponding to the voice sequence according to the identification result. A text sequence can be output.

Figure 3b is a diagram for explaining a recurrent neural network-transducer (RNN-T) model. As shown in Figure 3b, the RNN-T model includes an encoder 210 including a plurality of layers 330-1, 330-2,..., 330-n, a prediction module 340, and a joint module 350. ) and a decoder 220 including a softmax module 360.

The encoder 210 may obtain a hidden vector corresponding to the input voice sequence through a plurality of layers 330-1, 330-2,...330-n.

The prediction module 340 of the decoder 220 may include at least one layer and may convert the text sequence at time t-1 (or a previous time point) into a hidden vector and output it. For example, when the voice sequence at time t (or the current time) is converted and output by the encoder 210 into a first hidden vector, the prediction module 340 converts the text sequence at time t-1 into the second hidden vector. It can be converted and output. here. The terms 'first hidden vector' and 'second hidden vector' are used to distinguish and specify the hidden vector output through the encoder 210 and the hidden vector output through the prediction module 340. The term prediction module 340 may be replaced with the term ‘prediction network module.’

The joint module 350 of the decoder 220 calculates a logit ( logit) vector can be output. For example, when the first hidden vector is output through the encoder 210 and the second hidden vector is output through the prediction module 340, the joint module 350 calculates the output based on the first hidden vector and the second hidden vector. A logit vector corresponding to the voice sequence at time t can be output. The term joint module 350 may be replaced with the term ‘joint network module.’

The softmax module 360 of the decoder 220 may output a text sequence corresponding to the voice sequence at time t based on the input logit vector. Specifically, the softmax module 360 normalizes the input logit vector to a value between 0 and 1 to identify a class that corresponds to the voice sequence at the current time among a plurality of classes, and according to the identification result, generates a class corresponding to the voice sequence. A text sequence can be output.

Figure 3c is a diagram for explaining the attention-based encoder-decoder (AED) model. As shown in FIG. 3C, the AED model includes an encoder 210 including a plurality of layers (370-1, 370-2,..., 370-n) and an attention module 380, and a decoding module 390. It may include a decoder 220 including a softmax module 395.

The encoder 210 may obtain a hidden vector corresponding to the input voice sequence through a plurality of layers 370-1, 370-2,...370-n.

The attention module 380 uses the hidden vector at time t obtained through a plurality of layers (370-1,370-2,...370-n) and the hidden vector at time t-1 by the decoding module 390. Attention information (eg, convex vector) can be obtained based on the vector. And, the attention module 380 can output attention information to the decoding module 390.

The decoding module 390 may output a logit vector corresponding to the speech sequence at time t based on the attention information acquired at time t and the hidden vector acquired at time t-1.

The softmax module 395 can output a text sequence corresponding to the voice sequence at time t based on the input logit vector. Specifically, the softmax module 395 normalizes the input logit vector to a value between 0 and 1 to identify a class corresponding to the voice sequence at the current time among a plurality of classes, and according to the identification result, a class corresponding to the voice sequence. A text sequence can be output.

Meanwhile, conventionally, when learning the speech recognition model of FIGS. 3A to 3C, the speech recognition model was learned using the loss value obtained at the output terminal of the decoder 220 or the output terminal of the encoder 210. At this time, the loss value obtained at the output terminal of the encoder 210 may be a CTC loss value, and the loss value obtained at the output terminal of the decoder 220 may be Transducer loss (if the speech recognition model is an RNN-T model) or CE loss ( It may be cross-entropy loss (if the voice recognition model is an AED model), but is not limited to this.

However, according to an embodiment of the present disclosure, at least one processor 120 may train a voice recognition model based on a plurality of loss values obtained at the output terminals of a plurality of layers included in the voice recognition model. At this time, the plurality of layers may be layers constituting the encoder 210 included in the voice recognition model.

In addition, each of the plurality of layers includes a softmax module at the output terminal, and at least one processor 210 can obtain a plurality of loss values by the softmax module included at the output terminal of each of the plurality of layers.

Specifically, when the voice recognition model is a CTC model, as shown in FIG. 4A, the softmax module 410 is installed at each output terminal of the plurality of layers 310-1, 310-2,...310-(n-1). -1,410-2,...410-(n-1)) may be included. And, loss values (particularly, CTC loss values) can be obtained by each softmax module (410-1, 410-2,...410-(n-1)).

When the speech recognition model is an RNN-T model, as shown in FIG. 4b, each of the output terminals of the plurality of layers (330-1, 330-2,...330-(n-1)) included in the encoder 210 Softmax modules (420-1, 420-2,...420-(n-1)) may be included. And, loss values (particularly, CTC loss values) can be obtained by each softmax module (420-1, 420-2,...420-(n-1)).

When the voice recognition model is an AED model, as shown in FIG. 4C, a soft Max modules (430-1, 430-2,...430-(n-1)) may be included. And, loss values (particularly, CTC loss values) can be obtained by each softmax module (430-1, 430-2,...430-(n-1)).

At least one processor 120 may train a speech recognition model based on a plurality of acquired loss values. At this time, the plurality of loss values obtained may be CTC loss values obtained at the output terminal of the plurality of layers.

For example, at least one processor 120 may train a speech recognition model so that the final loss value obtained by adding the plurality of loss values obtained at the output terminals of the plurality of layers decreases.

Specifically, at least one processor 120 may train a speech recognition model so that the final loss value (L _total ) obtained by the equation below decreases.

At this time, L _CTC(N-1) may be the CTC loss value obtained at the output terminal of the Nth layer.

Additionally, at least one processor 120 may train a voice recognition model to reduce the final loss value (L _total ) obtained by the equation below.

At this time, L _CTC(N-1) is the CTC loss value obtained at the output stage of the Nth layer, and A is a parameter between 0 and 1.

At this time, A may converge to 0 as learning progresses. Specifically, as the speech recognition model learns to a certain extent, the parameters of the layers converge, so gradually reducing the ratio of loss values obtained at the output stage of the middle and lower layers can help improve the performance of the speech recognition model.

Additionally, when the final loss value becomes smaller than the threshold, at least one processor 120 may train a speech recognition model by decreasing A. For example, if the final loss value falls below the first threshold (e.g., 0.2) while learning a speech recognition model by applying the A value to 0.7, at least one processor 120 applies A to 0.5 to perform speech recognition. You can train a model. Additionally, if the final loss value falls below the second threshold (for example, 0.1) while learning the speech recognition model by applying A to 0.5, at least one processor 120 applies A to 0.3 to train the speech recognition model. It can be learned.

That is, at least one processor 120 may reduce the A value by comparing the final loss value with a plurality of threshold values, thereby decreasing the A value as training of the speech recognition model progresses.

FIG. 5 is a diagram illustrating a control method of the electronic device 100 for learning a voice recognition model, according to an embodiment of the present disclosure.

First, the electronic device 100 receives a learning voice sequence as input to a voice recognition model including a plurality of layers (S510). At this time, the voice recognition model may be one of the CTC model, RNN-T model, and AED model, but is not limited thereto. Additionally, the plurality of layers may be layers constituting an encoder included in the speech recognition model.

The electronic device 100 obtains a plurality of loss values from the output terminal of each of the plurality of layers (S520). Specifically, a softmax module may be included in the output terminal of each of the plurality of layers, and a plurality of loss values may be obtained according to the output results of each softmax module. At this time, the loss value may be a CTC loss value.

The electronic device 100 learns a voice recognition model based on a plurality of loss values (S530). Specifically, the electronic device 100 may learn a voice recognition model so that the final loss value obtained by adding a plurality of loss values decreases. Alternatively, the electronic device 100 may train the voice recognition model to reduce the final loss value obtained by Equation 2 below. At this time, as learning progresses, A in Equation 2 may converge to 0. Alternatively, when the final loss value becomes smaller than the threshold, the electronic device 100 can learn the voice recognition model by decreasing A in Equation 2.

As described above, using Layerwise loss makes it easier to converge parameters when learning a speech recognition model, which allows the speech recognition model to include very deep (i.e. a large number of) layers. do. Accordingly, the speech recognition performance of the speech recognition model can be further improved.

Functions related to artificial intelligence according to the present disclosure are operated through the processor and memory of the electronic device 100.

The processor may consist of one or multiple processors. At this time, one or more processors may include at least one of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Neural Processing Unit (NPU), but are not limited to the examples of the processors described above.

CPU is a general-purpose processor that can perform not only general calculations but also artificial intelligence calculations, and can efficiently execute complex programs through a multi-layer cache structure. CPUs are advantageous for serial processing, which allows organic connection between previous and next calculation results through sequential calculations. The general-purpose processor is not limited to the examples described above, except where specified as the CPU described above.

GPU is a processor for large-scale operations such as floating-point operations used in graphics processing, and can perform large-scale operations in parallel by integrating a large number of cores. In particular, GPUs may be more advantageous than CPUs in parallel processing methods such as convolution operations. Additionally, the GPU can be used as a co-processor to supplement the functions of the CPU. The processor for mass computation is not limited to the above-described example, except for the case where it is specified as the GPU.

NPU is a processor specialized in artificial intelligence calculations using artificial neural networks, and each layer that makes up the artificial neural network can be implemented in hardware (e.g., silicon). At this time, the NPU is designed specifically according to the company's requirements, so it has a lower degree of freedom than a CPU or GPU, but can efficiently process artificial intelligence calculations requested by the company. Meanwhile, as a processor specialized for artificial intelligence calculations, NPU can be implemented in various forms such as TPU (Tensor Processing Unit), IPU (Intelligence Processing Unit), and VPU (Vision processing unit). The artificial intelligence processor is not limited to the examples described above, except where specified as the NPU described above.

Additionally, one or more processors may be implemented as a System on Chip (SoC). At this time, in addition to one or more processors, the SoC may further include memory and a network interface such as a bus for data communication between the processor and memory.

If the SoC (System on Chip) included in the electronic device includes a plurality of processors, the electronic device uses some of the processors to perform artificial intelligence-related operations (for example, learning of an artificial intelligence model). or operations related to inference) can be performed. For example, an electronic device can perform operations related to artificial intelligence using at least one of a plurality of processors, a GPU, NPU, VPU, TPU, or hardware accelerator specialized for artificial intelligence operations such as convolution operation, matrix multiplication operation, etc. there is. However, this is only an example, and of course, calculations related to artificial intelligence can be processed using general-purpose processors such as CPUs.

Additionally, electronic devices can perform calculations on functions related to artificial intelligence using multiple cores (eg, dual core, quad core, etc.) included in one processor. In particular, electronic devices can perform artificial intelligence operations such as convolution operations and matrix multiplication operations in parallel using multi-cores included in the processor.

One or more processors control input data to be processed according to predefined operation rules or artificial intelligence models stored in memory. Predefined operation rules or artificial intelligence models are characterized by being created through learning.

Here, being created through learning means that a predefined operation rule or artificial intelligence model with desired characteristics is created by applying a learning algorithm to a large number of learning data. This learning may be performed on the device itself that performs the artificial intelligence according to the present disclosure, or may be performed through a separate server/system.

An artificial intelligence model may be composed of multiple neural network layers. At least one layer has at least one weight value, and the operation of the layer is performed using the operation result of the previous layer and at least one defined operation. Examples of neural networks include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), and Deep Neural Network (BRDNN). There are Q-Networks (Deep Q-Networks) and Transformer, and the neural network in this disclosure is not limited to the above-described examples except where specified.

A learning algorithm is a method of training a target device (eg, a robot) using a large number of learning data so that the target device can make decisions or make predictions on its own. Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, and the learning algorithm in the present disclosure is specified. Except, it is not limited to the examples described above.

Meanwhile, methods according to various embodiments of the present disclosure may be included and provided in a computer program product. Computer program products are commodities and can be traded between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or through an application store (e.g. Play StoreTM) or on two user devices (e.g. It can be distributed (e.g. downloaded or uploaded) directly between smartphones) or online. In the case of online distribution, at least a portion of the computer program product (e.g., a downloadable app) is stored on a machine-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server. It can be temporarily stored or created temporarily.

Methods according to various embodiments of the present disclosure may be implemented as software including instructions stored in a machine-readable storage media that can be read by a machine (e.g., a computer). The device stores information stored from the storage medium. A device capable of calling a command and operating according to the called command may include an electronic device (eg, a TV) according to the disclosed embodiments.

Meanwhile, a storage medium that can be read by a device may be provided in the form of a non-transitory storage medium. Here, 'non-transitory storage medium' simply means that it is a tangible device and does not contain signals (e.g. electromagnetic waves). This term is used to refer to cases where data is semi-permanently stored in a storage medium and temporary storage media. It does not distinguish between cases where it is stored as . For example, a 'non-transitory storage medium' may include a buffer where data is temporarily stored.

When the instruction is executed by a processor, the processor may perform the function corresponding to the instruction directly or using other components under the control of the processor. Instructions may contain code generated or executed by a compiler or interpreter.

In the above, preferred embodiments of the present disclosure have been shown and described, but the present disclosure is not limited to the specific embodiments described above, and may be used in the technical field to which the disclosure pertains without departing from the gist of the present disclosure as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical ideas or perspectives of the present disclosure.

Claims

In a method of controlling an electronic device 100 for learning a voice recognition model,

Inputting a learning speech sequence into a speech recognition model including a plurality of layers;

Obtaining a plurality of loss values at the output terminal of each of the plurality of layers; and

A control method comprising: learning the speech recognition model based on the plurality of loss values.
According to paragraph 1,

A control method, characterized in that the plurality of layers are layers constituting an encoder included in the voice recognition model.
According to paragraph 1,

Each of the plurality of layers includes a softmax module at the output terminal,

The obtaining step is,

A control method for obtaining a plurality of loss values by the softmax module included in the output terminal of each of the plurality of layers.
According to paragraph 3,

A control method, characterized in that the loss value is a CTC (connectionist temporal classification) loss value.
According to paragraph 4,

The learning step is,

A control method for learning the speech recognition model so that the final loss value obtained by adding the plurality of loss values decreases.
According to paragraph 4,

The learning step is,

Control method for learning the speech recognition model so that the final loss value obtained by the equation below is reduced

[Equation]

L total represents the final loss value, L CTC(N) is the CTC loss value obtained at the output stage of the Nth layer, and A is a parameter between 0 and 1.
According to clause 6,

A control method characterized in that A converges to 0 as learning progresses.
According to clause 6,

A control method for learning the speech recognition model by decreasing A when the final loss value becomes smaller than the threshold.
In the electronic device 100 for learning a voice recognition model,

a memory 110 that stores data for a speech recognition model; and

Includes at least one processor 120,

The at least one processor 120,

When a learning voice sequence is input to a voice recognition model including a plurality of layers, a plurality of loss values are obtained from the output terminal of each of the plurality of layers,

An electronic device that trains the speech recognition model based on the plurality of loss values.
According to clause 9,

The plurality of layers are layers constituting an encoder included in the voice recognition model.
According to clause 9,

Each of the plurality of layers includes a softmax module at the output terminal,

The at least one processor 120,

An electronic device that obtains a plurality of loss values by the softmax module included in the output terminal of each of the plurality of layers.
According to clause 11,

An electronic device, wherein the loss value is a CTC (connectionist temporal classification) loss value.
According to clause 12,

The at least one processor 120,

An electronic device that trains the speech recognition model so that a final loss value obtained by adding the plurality of loss values decreases.
According to clause 13,

The at least one processor 120,

An electronic device that trains the speech recognition model to reduce the final loss value obtained by the equation below:

[Equation]

L total represents the final loss value, L CTC(N) is the CTC loss value obtained at the output stage of the Nth layer, and A is a parameter between 0 and 1.
According to clause 14,

An electronic device wherein A converges to 0 as learning progresses.