WO2024053842A1

WO2024053842A1 - Electronic device and control method thereof

Info

Publication number: WO2024053842A1
Application number: PCT/KR2023/010131
Authority: WO
Inventors: 진시첸; 김성수; 박진환; 한창우
Original assignee: 삼성전자주식회사
Priority date: 2022-09-08
Filing date: 2023-07-14
Publication date: 2024-03-14
Also published as: KR20240035197A

Abstract

Disclosed are an electronic device and control method thereof. The electronic device disclosed herein: acquires first speech data and first text data corresponding to the first speech data; trains a first prediction module, a projection module, and a softmax module included in a first language model on the basis of the first text data and first predicted text data acquired by inputting the first text data to the first language model; trains a speech recognition model on the basis of the first speech data and first transcription data acquired by inputting the first speech data to an automatic speech recognition model to which the trained first prediction module has been added; and trains a second language model on the basis of second text data and second predicted text data acquired by inputting the second text data to the second language model to which the trained projection module and softmax module have been added.

Description

Electronic devices and methods for controlling the same

The present disclosure relates to an electronic device and a control method thereof, and more specifically, to an efficient voice recognition model learning method and a method that can provide accurate voice recognition operations appropriate for the subject included in the voice using the learned voice recognition model. It relates to electronic devices and their control methods.

A speech recognition model (Automatic Speech Recognition Model) is used to obtain information about characters included in speech and perform speech recognition operations. When voice data is input, the voice recognition model obtains output data (or transcription data) corresponding to the voice data and identifies the meaning contained in the voice based on character data corresponding to the output data.

The speech recognition model includes a probability model-based speech recognition model that expresses speech/language information as an individual probability model to perform speech recognition operations, and a probability model-based speech recognition model that reduces the complexity of the probability model-based speech recognition model and performs efficient speech recognition operations. There are longitudinal voice recognition models that use a single deep neural network.

A language model is used to obtain a more natural, complete sentence by predicting the word that will appear after a specific word within a sentence. The language model assigns a probability to the word that appears after a specific word, and can obtain a natural, complete sentence based on the probability information about the word.

Here, by introducing a character prediction module included in the language model into the speech recognition model, more efficient and accurate speech recognition operations can be performed.

An electronic device according to this embodiment for achieving the above-described object includes: a memory storing at least one instruction; and one or more processors executing the at least one instruction, wherein the one or more processors acquire first voice data and first text data corresponding to the first voice data, and execute the first text data and A first prediction module, a projection module, and a soft software included in the first language model based on the first prediction character data obtained by inputting the first character data into a first language model. A first transcription obtained by learning the Softmax module and inputting the first voice data and the first voice data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. ) Learning the speech recognition model based on data, and inputting the second text data and the second text data into a second language model to which the learned projection module and the learned softmax module are added, a second language obtained The second language model is learned based on the predicted character data, and the first language model and the second language model output a probability value of the predicted character corresponding to the input character to determine the predicted character corresponding to the input character. model, and the voice recognition model may be a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.

Meanwhile, the speech recognition model includes an encoder including a transcription module and a decoder including an integrated module to which the learned first prediction module is added, and the one or more processors include, Learning the transcription module and integration module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module. can do.

Meanwhile, the second language model includes a second prediction module, the learned projection module, and the learned soft max module, and the one or more processors configure the structures of the learned projection module and the learned soft max module. And a second prediction module included in the second language model may be learned based on the second text data and the second prediction text data while maintaining the parameters.

Meanwhile, the second text data may be text data related to a predefined domain.

Meanwhile, the one or more processors input the second voice data into a voice recognition model learned by adding the learned first prediction module to obtain second transcription data, and transfer the obtained second transcription data to a domain detection module. You can identify the domain corresponding to the second transcription data by entering .

Meanwhile, the one or more processors identify a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules, and apply the second voice data to the learned second prediction module to which the identified second prediction module is added. Third transcription data can be obtained by inputting it into the voice recognition model.

Meanwhile, the learned speech recognition model to which the second prediction module is added may not include the first prediction module.

A method of controlling an electronic device according to an embodiment of the present disclosure includes obtaining first voice data and first text data corresponding to the first voice data; A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module; The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; And the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. A step of learning; including, wherein the first language model and the second language model are models that determine a predicted character corresponding to the input character by outputting a probability value of the predicted character corresponding to the input character, and the speech recognition model The character corresponding to the input voice can be determined by outputting the probability value of the character corresponding to the input voice.

Meanwhile, the speech recognition model includes an encoder including a transcription module and a decoder including an integrated module to which the learned first prediction module is added, and learns the speech recognition model. The step of doing so includes a transcription module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module, and It may include a step of learning an integrated module.

Meanwhile, the second language model includes a second prediction module, the learned projection module, and the learned soft max module, and the step of learning the second language model includes the learned projection module and the learned soft max module. It may include learning a second prediction module included in the second language model based on the second text data and the second prediction text data while maintaining the structure and parameters of the soft max module.

Meanwhile, the control method includes the steps of inputting second voice data into a voice recognition model learned by adding the learned first prediction module to obtain second transcription data; and inputting the obtained second transcription data into a domain detection module to identify a domain corresponding to the second transcription data.

Meanwhile, identifying the domain may include identifying a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules; and acquiring third transcription data by inputting the second voice data into the learned voice recognition model to which the identified second prediction module is added.

In a non-transitory computer-readable recording medium storing computer instructions that are executed by a processor of an electronic device and cause the electronic device to perform an operation according to an embodiment of the present disclosure, the recording medium includes first voice data and the Obtaining first text data corresponding to first voice data; A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module; The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; And the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. A step of learning; including, wherein the first language model and the second language model are models that determine a predicted character corresponding to the input character by outputting a probability value of the predicted character corresponding to the input character, and the speech recognition model may be a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.

Aspects, features and advantages of specific embodiments of the present disclosure will become clearer through the following description with reference to the accompanying drawings.

1 is a diagram illustrating a voice recognition model that can be used to control an electronic device according to an embodiment of the present disclosure.

Figure 2 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.

FIG. 3 is a diagram illustrating an operation in which an electronic device learns a voice recognition model and a language model, according to an embodiment of the present disclosure.

FIG. 4 is a diagram illustrating an operation in which an electronic device performs voice recognition using a learned voice recognition model, according to an embodiment of the present disclosure.

FIG. 5 is a diagram illustrating a voice recognition model that can be used to control an electronic device according to an embodiment of the present disclosure.

FIG. 6 is a flowchart for explaining the operation of an electronic device according to an embodiment of the present disclosure.

-

Since these embodiments can be modified in various ways and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope to specific embodiments, and should be understood to include various modifications, equivalents, and/or alternatives to the embodiments of the present disclosure. In connection with the description of the drawings, similar reference numbers may be used for similar components.

In describing the present disclosure, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present disclosure, the detailed description thereof will be omitted.

In addition, the following examples may be modified into various other forms, and the scope of the technical idea of the present disclosure is not limited to the following examples. Rather, these embodiments are provided to make the present disclosure more faithful and complete and to completely convey the technical idea of the present disclosure to those skilled in the art.

The terms used in this disclosure are merely used to describe specific embodiments and are not intended to limit the scope of rights. Singular expressions include plural expressions unless the context clearly dictates otherwise.

In the present disclosure, expressions such as “have,” “may have,” “includes,” or “may include” refer to the presence of the corresponding feature (e.g., component such as numerical value, function, operation, or part). , and does not rule out the existence of additional features.

In the present disclosure, expressions such as “A or B,” “at least one of A or/and B,” or “one or more of A or/and B” may include all possible combinations of the items listed together. . For example, “A or B,” “at least one of A and B,” or “at least one of A or B” includes (1) at least one A, (2) at least one B, or (3) it may refer to all cases including both at least one A and at least one B.

Expressions such as “first,” “second,” “first,” or “second,” used in the present disclosure can modify various components regardless of order and/or importance, and can refer to one component. It is only used to distinguish from other components and does not limit the components.

A component (e.g., a first component) is “(operatively or communicatively) coupled with/to” another component (e.g., a second component). When referred to as “connected to,” it should be understood that a certain component can be connected directly to another component or connected through another component (e.g., a third component).

On the other hand, when a component (e.g., a first component) is said to be "directly connected" or "directly connected" to another component (e.g., a second component), It may be understood that no other components (e.g., third components) exist between the elements.

The expression “configured to” used in the present disclosure may mean, for example, “suitable for,” “having the capacity to,” depending on the situation. ," can be used interchangeably with "designed to," "adapted to," "made to," or "capable of." The term “configured (or set to)” may not necessarily mean “specifically designed to” in hardware.

Instead, in some contexts, the expression “a device configured to” may mean that the device is “capable of” working with other devices or components. For example, the phrase "processor configured (or set) to perform A, B, and C" refers to a processor dedicated to performing the operations (e.g., an embedded processor), or by executing one or more software programs stored on a memory device. , may refer to a general-purpose processor (e.g., CPU or application processor) capable of performing the corresponding operations.

In an embodiment, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software, or as a combination of hardware and software. Additionally, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented with at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.

Meanwhile, various elements and areas in the drawing are schematically drawn. Accordingly, the technical idea of the present invention is not limited by the relative sizes or spacing drawn in the attached drawings.

Hereinafter, with reference to the attached drawings, embodiments according to the present disclosure will be described in detail so that those skilled in the art can easily implement them.

Referring to FIG. 1, a voice recognition model that can be used to control an electronic device or operate as a component of an electronic device can output transcription data corresponding to the input voice data when voice data is input. Finally, the electronic device can obtain text data corresponding to the voice data based on the transcription data, and identify the meaning contained in the voice data based on the acquired text data.

The speech recognition model includes a probability model-based speech recognition model that expresses speech/language information as an individual probability model to perform speech recognition operations, and a probability model-based speech recognition model that reduces the complexity of the probability model-based speech recognition model and performs efficient speech recognition operations. A longitudinal speech recognition model using a single deep neural network can be used.

A voice recognition model may include an encoder and a decoder. The encoder converts voice data into vector values in virtual space, and the decoder outputs a string based on the vector value output from the encoder and the probability value that the output string may have. The character string output here may correspond to transcription data or text data corresponding to voice data.

Here, the encoder may be based on a Convolutional Neural Network (CNN) or a Deep Neural Network (DNN), and the decoder may be based on a Recurrent Neural Network (RNN), CNN, or DNN, but is not limited to this and can be implemented in various ways. It can be done with a neural network model.

An electronic device including the above-described voice recognition model may be an electronic device with various purposes and functions. For example, the electronic device may be a variety of electronic devices that perform calculations through a processor, such as a computer, robot, speaker, display device, or other home appliance, and the electronic device including a voice recognition model according to an embodiment of the present disclosure may be It is not limited to the above-described devices. Additionally, the electronic device may be implemented as an electronic device having two or more functions of the above-described devices.

FIG. 2 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.

Referring to FIG. 2 , the electronic device 100 may include a memory 110 capable of storing at least one instruction and a processor 120 capable of executing at least one instruction.

The configuration of the electronic device 100 is not limited to the memory 110 and the processor 120, and the electronic device 100 includes a user interface, a communication interface, a display, and a speaker in addition to the memory 110 and the processor 120. , microphone, camera, etc. may further include components necessary to perform various operations based on data acquired through the voice recognition model.

The memory 110 temporarily or non-temporarily stores various programs or data, and transmits the stored information to the processor 120 according to a call from the processor 120. Additionally, the memory 110 can store various information necessary for calculation, processing, or control operations of the processor 120 in an electronic format.

The memory 110 may include, for example, at least one of a main memory and an auxiliary memory. The main memory may be implemented using semiconductor storage media such as ROM and/or RAM. ROM may include, for example, conventional ROM, EPROM, EEPROM, and/or MASK-ROM. RAM may include, for example, DRAM and/or SRAM. Auxiliary storage devices include flash memory (110) devices, Secure Digital (SD) cards, solid state drives (SSDs), hard disk drives (HDDs), magnetic drums, compact disks (CDs), It can be implemented using at least one storage medium capable of storing data permanently or semi-permanently, such as optical media such as DVD or laser disk, magnetic tape, magneto-optical disk, and/or floppy disk. .

The memory 110 may store a voice recognition model and a language model. Specifically, the memory 110 may store information about the encoder and decoder included in the speech recognition model, information about the prediction module included in the language model, information about the projection module, and information about the softmax module. Information can be saved. The memory 110 may store information about the domain detection module.

The memory 110 may store voice data input to the voice recognition model, transcription data, or text data output from the voice recognition model. The memory 110 may store text data input to the language model and predicted text data output from the language model. Memory 110 may store information about the domain.

The memory 110 may store information about the neural network that constitutes the speech recognition model and language model. Specifically, the memory 110 can store structural information such as layers and nodes constituting a neural network and parameter information such as weights and thresholds.

However, it is not limited to this, and the electronic device 100 may store various information generated, processed, or used in the process of learning and using the voice recognition model and language model.

The processor 120 controls the overall operation of the electronic device 100. Specifically, the processor 120 is connected to the configuration of the electronic device 100 including the memory 110 as described above, and executes at least one instruction stored in the memory 110 as described above, thereby controlling the electronic device ( 100) operations can be controlled overall. In particular, the processor 120 may be implemented not only as one processor 120 but also as a plurality of processors 120 .

Processor 120 may be implemented in various ways. For example, the one or more processors 120 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Accelerated Processing Unit (APU), a Many Integrated Core (MIC), a Digital Signal Processor (DSP), and a Neural Processing Unit (NPU). Unit), hardware accelerator, or machine learning accelerator. One or more processors 120 may control one or any combination of other components of the electronic device 100 and may perform operations related to communication or data processing. One or more processors 120 may execute one or more programs or instructions stored in the memory 110. For example, one or more processors 120 may perform a method according to an embodiment of the present disclosure by executing one or more instructions stored in the memory 110.

When the method according to an embodiment of the present disclosure includes a plurality of operations, the plurality of operations may be performed by one processor 120 or may be performed by a plurality of processors 120. For example, when the first operation, the second operation, and the third operation are performed by the method according to one embodiment, the first operation, the second operation, and the third operation are all performed by the first processor 120. It may be, the first operation and the second operation are performed by the first processor 120 (e.g., general-purpose processor 120) and the third operation is performed by the second processor 120 (e.g., artificial intelligence It may also be performed by a dedicated processor 120).

The one or more processors 120 may be implemented as a single core processor 120 including one core, or as a single core processor 120 including a plurality of cores (e.g., homogeneous multi-core or heterogeneous multi-core). It may also be implemented with the above multi-core processor 120 (multicore processor). When one or more processors 120 are implemented as a multi-core processor 120, each of the plurality of cores included in the multi-core processor 120 is inside the processor 120, such as an on-chip memory 110. It may include a memory 110, and a common cache shared by a plurality of cores may be included in the multi-core processor 120. In addition, each of the plurality of cores (or some of the plurality of cores) included in the multi-core processor 120 may independently read and perform program instructions for implementing the method according to an embodiment of the present disclosure, and the plurality of cores may independently read and perform the program instructions. All (or part) of the core may be linked to read and perform program instructions for implementing the method according to an embodiment of the present disclosure.

When the method according to an embodiment of the present disclosure includes a plurality of operations, the plurality of operations may be performed by one core among the plurality of cores included in the multi-core processor 120, or by the plurality of cores. It may also be carried out. For example, when the first operation, the second operation, and the third operation are performed by the method according to one embodiment, the first operation, the second operation, and the third operation are all included in the multi-core processor 120. It may be performed by the first core, and the first operation and the second operation are performed by the first core included in the multi-core processor 120, and the third operation is performed by the second core included in the multi-core processor 120. It may also be performed by the core.

In embodiments of the present disclosure, the processor 120 may be a system-on-chip (SoC), a single-core processor 120, a multi-core processor 120, or a single-core processor in which one or more processors 120 and other electronic components are integrated. (120) or may refer to a core included in the multi-core processor 120, where the core may be implemented as a CPU, GPU, APU, MIC, DSP, NPU, hardware accelerator, or machine learning accelerator, but the present disclosure The embodiments are not limited to this.

One or more processors 120 (hereinafter referred to as 'processors 120') may acquire first voice data and first text data corresponding to the first voice data.

The processor 120 operates a first prediction module included in the first language model based on the first character data and the first prediction character data obtained by inputting the first character data into the first language model (Language Model). , you can learn the projection module and softmax module.

The processor 120 performs speech recognition based on first transcription data obtained by inputting the first speech data and the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. You can learn a model.

The processor 120 creates a second language model based on second prediction character data obtained by inputting the second character data and the second character data into a second language model to which the learned projection module and the learned softmax module are added. You can learn.

Here, the first language model and the second language model are models that determine the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character, and the speech recognition model is the character corresponding to the input voice. It may be a model that determines the character corresponding to the input voice by outputting the probability value of .

The control operation of the electronic device 100, including the voice recognition model and language model of the processor 120, will be described in more detail with FIGS. 3 to 5.

FIG. 3 is a diagram illustrating an operation of the electronic device 100 learning a voice recognition model and a language model, according to an embodiment of the present disclosure.

Referring to FIG. 3, the processor 120 may obtain first voice data and first text data corresponding to the first voice data. Here, the first text data may be transcription data obtained by transcribing the first voice data, but is not limited thereto, and may be text data corresponding to the meaning included in the first voice data.

The processor 120 may input first text data into the first language model 200 to obtain first predicted text data.

Here, the language model is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character.

The processor 120 includes a first prediction module 210 included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into the first language model 200; The projection module and softmax module 220 can be learned.

Here, the prediction module 210 plays a role in predicting and identifying characters that may appear after the input character, and the projection module and softmax module 220 predict vector values and vector values in virtual space corresponding to the predicted characters. It serves to identify the probability value corresponding to .

Additionally, the prediction module 210, projection module, and softmax module 220 included in the language model 200 may be composed of a neural network model with nodes, layers, and weights.

The neural network model constituting the prediction module 210, the projection module, and the softmax module 220 may be comprised of a CNN (Convolutional Neural Network), a DNN (Deep Neural Network), etc., but is not limited thereto, and may include various types of neural networks. It can be done as a model.

The language model learning operation of the processor 120 may be performed based on first text data, first predicted text data, and loss (eg, cross-entropy loss) information of the first language model 200. That is, the processor 120 modifies the parameters (e.g., weight, threshold, etc.) of the first prediction module 210, projection module, and softmax module 220 to minimize the loss of the language model 200. The prediction module 210, projection module, and softmax module 220 can be learned.

The processor 120 inputs the first voice data and the first voice data into a voice recognition model (Automatic Speech Recognition Model) 300 to which the learned first prediction module 210 is added, and obtains the first transcription. The voice recognition model 300 can be learned based on the data.

Here, the voice recognition model is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.

The speech recognition model 300 may include an encoder including a transcription module 310 and a decoder including an integrated module 320 to which the learned first prediction module 210 is added. You can.

The transcription module 310 may convert the input voice data into a vector value in a virtual space corresponding to the voice data. The integration module 320 to which the learned first prediction module 210 is added can output a string based on the vector value output from the transcription module 310 and the probability value that the output string may have. The character string may correspond to the first transcription data.

The transcription module 310 and integration module 320 included in the voice recognition model 300 may be composed of a neural network model with nodes, layers, and weights.

The neural network model constituting the transcription module 310 and the integration module 320 may be composed of a Convolutional Neural Network (CNN), Deep Neural Network (DNN), or Recurrent Neural Network (RNN). However, it is not limited to this and can be made up of various types of neural network models.

The processor 120 maintains the learned structure and parameters of the first prediction module 210 and outputs the first voice data, first text data corresponding to the first voice data, and the voice recognition model 300. 1 The transcription module 310 and integration module 320 included in the voice recognition model 300 can be learned based on the transcription data.

By learning the transcription module 310 and the integration module 320 while maintaining the structure and parameters of the first prediction module 210 learned as described above, the processor 120 creates the learned first language model 200. The transcription module 310 and integration module 320 adapted to the virtual vector space of the projection module and softmax module 220 included in can be learned. Accordingly, the processor 120 can learn the voice recognition model 300 more efficiently and improve the voice recognition accuracy of the voice recognition model 300.

The processor 120 inputs the second character data and the second character data into the second language model 400 to which the learned projection module and the learned softmax module 220 are added, based on the second predicted character data obtained. Thus, the second language model 400 can be learned.

Here, the operation of the second language model 400, the second prediction module 410 included in the second language model 400, the learned projection module and the learned softmax module 220 added to the second language model. The operation and configuration of is the same as described above along with the first language model 300.

Specifically, the processor 120 generates a second language model 400 based on the second predicted character data obtained by inputting it into the second language model 400 to which the learned projection module and the learned softmax module 220 are added. ) can be learned, and the second prediction module 410 can be learned for each domain to predict the character that will appear after the input character in relation to the predefined domain (Domain). there is. That is, the processor 120 may learn the second prediction modules 410-1, 410-2, and 410-3 for each domain based on text data and predicted text data related to a specific domain.

A predefined domain can be the subject of text data, for example, text data such as “When cooking pasta, you must use milk to make sauce” or “When making gamjatang, you must boil it for 40 minutes.” The domain might be “cooking”. As another example, the domain of text data such as "It will rain in the central region starting tomorrow afternoon" and "It will be sunny all day tomorrow" may be "Weather".

The processor 120 learns the second prediction module 410 with character data corresponding to a specific domain, thereby enabling learning with a small amount of data for each specific domain and improving the prediction accuracy of the second prediction module 410. .

In addition, as described above, by learning the second prediction module 410 included in the second language model 400 while maintaining the structure and parameters of the learned projection module and the learned softmax module 220, general speech Efficient learning is possible because the prediction modules 410-1, 410-2, and 410-3 can be learned using only text data, rather than the voice data and text data pairs required for learning the recognition model.

As described above, the processor 120 can learn the first language model 200, the speech recognition model 300, and the second language model 400. A method of performing voice recognition using the learned model will be explained with Figure 4.

FIG. 4 is a diagram illustrating an operation in which the electronic device 100 performs voice recognition using a learned voice recognition model, according to an embodiment of the present disclosure.

Referring to FIG. 4, the processor 120 may obtain second transcription data by inputting the second voice data into the voice recognition model 300 learned by adding the learned first prediction module 210.

The second voice data may be arbitrary voice data different from the first voice data, and the second transcription data may be text data corresponding to characters included in the second voice data.

The processor 120 may input the acquired second transcription data into the domain detection module 500 to identify the domain corresponding to the second transcription data. In other words, the subject of the second voice data can be identified based on the domain identified through the domain detection module 500.

The processor 120 may identify the second prediction module 410 corresponding to the identified domain among the plurality of learned second prediction modules 410-1, 410-2, and 410-3.

The processor 120 inputs the second voice data into the learned voice recognition model 300 to which the identified second prediction module 410 is added to obtain third transcription data or text data corresponding to the third transcription data. It can be obtained.

Additionally, the learned speech recognition model to which the second prediction module 410 is added may not include the first prediction module 210.

Therefore, final speech recognition is performed by changing the prediction module included in the speech recognition model 300 from the first prediction module 210 to the second prediction module 410 corresponding to a specific domain without additional calculation using the language model. Fast and easy voice recognition becomes possible.

In addition, the accuracy of voice recognition for each domain can be increased by allowing the prediction module, which is a common partial model of the voice recognition model, to be exchanged.

FIG. 5 is a diagram illustrating a voice recognition model that can be used to control the electronic device 100 according to an embodiment of the present disclosure.

Referring to FIG. 5, when voice data is input to a voice recognition model according to an embodiment of the present disclosure, the domain related to the voice is first identified, the final voice recognition is performed based on the identified domain, and the transcription data is output. It is possible to perform accurate voice recognition operations by acquiring .

In addition, since the speech recognition model according to the present disclosure learns the speech recognition model with the structure and parameters of the learned prediction module, projection module, and softmax module fixed, accurate and efficient learning is possible with only a small amount of data.

FIG. 6 is a flowchart for explaining the operation of the electronic device 100 according to an embodiment of the present disclosure.

The electronic device 100 may obtain first voice data and first text data corresponding to the first voice data (S610).

The electronic device 100 includes the first prediction module, projection module, and softmax included in the first language model based on the first character data and the first prediction character data obtained by inputting the first character data into the first language model. You can learn the module (S620).

The electronic device 100 may learn a voice recognition model based on the first transcription data obtained by inputting the first voice data and the first voice data into a voice recognition model to which the learned first prediction module is added (S630) ).

The electronic device 100 uses a transcription module and an integration module included in the voice recognition model based on the first voice data, first text data, and first transcription data while maintaining the structure and parameters of the learned first prediction module. You can learn.

The electronic device 100 creates a second language model based on the second predicted character data obtained by inputting the second character data and the second character data into a second language model to which the learned projection module and the learned softmax module are added. You can learn (S640).

The electronic device 100 generates a second prediction module included in the second language model based on the second character data and the second prediction character data while maintaining the structure and parameters of the learned projection module and the learned soft max module. You can learn.

The electronic device 100 acquires second transcription data by inputting the second voice data into a voice recognition model learned by adding the learned first prediction module, and inputs the obtained second transcription data into the domain detection module to obtain the first transcription data. 2 The domain corresponding to the transcription data can be identified.

The electronic device 100 may identify a second prediction module corresponding to an identified domain among a plurality of learned second prediction modules, and use the second speech data as a learned speech recognition model to which the identified second prediction module is added. You can obtain third transcription data by inputting .

Here, the learned speech recognition model to which the second prediction module is added may not include the first prediction module.

Functions related to artificial intelligence according to the present disclosure are operated through the processor and memory of the electronic device.

The processor may consist of one or multiple processors. At this time, one or more processors may include at least one of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Neural Processing Unit (NPU), but are not limited to the examples of the processors described above.

CPU is a general-purpose processor that can perform not only general calculations but also artificial intelligence calculations, and can efficiently execute complex programs through a multi-layer cache structure. CPUs are advantageous for serial processing, which allows organic connection between previous and next calculation results through sequential calculations. The general-purpose processor is not limited to the above-described examples, except where specified as the above-described CPU.

GPU is a processor for large-scale operations such as floating-point operations used in graphics processing, and can perform large-scale operations in parallel by integrating a large number of cores. In particular, GPUs may be more advantageous than CPUs in parallel processing methods such as convolution operations. Additionally, the GPU can be used as a co-processor to supplement the functions of the CPU. The processor for mass computation is not limited to the above-described example, except for the case specified as the above-described GPU.

NPU is a processor specialized in artificial intelligence calculations using artificial neural networks, and each layer that makes up the artificial neural network can be implemented in hardware (e.g., silicon). At this time, the NPU is designed specifically according to the company's requirements, so it has a lower degree of freedom than a CPU or GPU, but can efficiently process artificial intelligence calculations requested by the company. Meanwhile, as a processor specialized for artificial intelligence calculations, NPU can be implemented in various forms such as TPU (Tensor Processing Unit), IPU (Intelligence Processing Unit), and VPU (Vision processing unit). The artificial intelligence processor is not limited to the examples described above, except where specified as the NPU described above.

Additionally, one or more processors may be implemented as a System on Chip (SoC). At this time, in addition to one or more processors, the SoC may further include memory and a network interface such as a bus for data communication between the processor and memory.

If the SoC (System on Chip) included in the electronic device includes a plurality of processors, the electronic device uses some of the processors to perform artificial intelligence-related operations (for example, learning of an artificial intelligence model). or operations related to inference) can be performed. For example, an electronic device can perform operations related to artificial intelligence using at least one of a plurality of processors, a GPU, NPU, VPU, TPU, or hardware accelerator specialized for artificial intelligence operations such as convolution operation, matrix multiplication operation, etc. there is. However, this is only an example, and of course, calculations related to artificial intelligence can be processed using general-purpose processors such as CPUs.

Additionally, electronic devices can perform calculations on functions related to artificial intelligence using multiple cores (eg, dual core, quad core, etc.) included in one processor. In particular, electronic devices can perform artificial intelligence operations such as convolution operations and matrix multiplication operations in parallel using multi-cores included in the processor.

One or more processors control input data to be processed according to predefined operation rules or artificial intelligence models stored in memory. Predefined operation rules or artificial intelligence models are characterized by being created through science.

Here, being created through learning means that a predefined operation rule or artificial intelligence model with desired characteristics is created by applying a learning algorithm to a large number of learning data. This learning may be performed on the device itself that performs the artificial intelligence according to the present disclosure, or may be performed through a separate server/system.

An artificial intelligence model may be composed of multiple neural network layers. At least one layer has at least one weight value, and the operation of the layer is performed using the operation result of the previous layer and at least one defined operation. Examples of neural networks include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), and Deep Neural Network (BRDNN). There are Q-Networks (Deep Q-Networks) and Transformer, and the neural network in this disclosure is not limited to the above-described examples except where specified.

A learning algorithm is a method of training a target device (eg, a robot) using a large number of learning data so that the target device can make decisions or make predictions on its own. Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, and the learning algorithm in the present disclosure is specified. Except, it is not limited to the examples described above.

According to one embodiment, methods according to various embodiments disclosed in this document may be included and provided in a computer program product. Computer program products are commodities and can be traded between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or through an application store (e.g. Play StoreTM) or on two user devices (e.g. It can be distributed (e.g. downloaded or uploaded) directly between smartphones) or online. In the case of online distribution, at least a portion of the computer program product (e.g., a downloadable app) is stored on a machine-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server. It can be temporarily stored or created temporarily.

In the above, preferred embodiments of the present disclosure have been shown and described, but the present disclosure is not limited to the specific embodiments described above, and may be used in the technical field pertaining to the disclosure without departing from the gist of the disclosure as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical ideas or perspectives of the present disclosure.

Claims

In electronic devices,

a memory storing at least one instruction; and

Includes one or more processors executing the at least one instruction,

The one or more processors:

Obtaining first voice data and first text data corresponding to the first voice data,

A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learn the Projection module and Softmax module,

The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning,

Learning the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. do,

The first language model and the second language model are,

It is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character,

The voice recognition model is,

An electronic device that is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
According to paragraph 1,

The voice recognition model is,

An encoder including a transcription module and a decoder including an integration module to which the learned first prediction module is added,

The one or more processors:

Learning the transcription module and integration module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module. electronic device that does.
According to paragraph 1,

The second language model is,

comprising a second prediction module, the learned projection module, and the learned soft max module,

The one or more processors:

Learning a second prediction module included in the second language model based on the second character data and the second prediction character data while maintaining the structure and parameters of the learned projection module and the learned soft max module. electronic device that does.
According to paragraph 3,

The second character data is,

A control method that is character data related to a pre-defined domain.
According to paragraph 1,

The one or more processors:

Obtaining second transcription data by inputting second voice data into a voice recognition model learned by adding the learned first prediction module,

An electronic device that inputs the acquired second transcription data into a domain detection module to identify a domain corresponding to the second transcription data.
According to clause 5,

The one or more processors:

Identifying a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules,

An electronic device that obtains third transcription data by inputting the second voice data into the learned voice recognition model to which the identified second prediction module is added.
According to clause 6,

The learned speech recognition model to which the second prediction module is added,

An electronic device not comprising the first prediction module.
In a method of controlling an electronic device,

Obtaining first voice data and first text data corresponding to the first voice data;

A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module;

The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; and

Learning the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. Including the step of doing;

The first language model and the second language model are,

It is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character,

The voice recognition model is,

A control method that is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
According to clause 8,

The voice recognition model is,

An encoder including a transcription module and a decoder including an integration module to which the learned first prediction module is added,

The step of learning the voice recognition model is,

Learning the transcription module and integration module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module. A control method comprising:
According to clause 8,

The second language model is,

comprising a second prediction module, the learned projection module, and the learned soft max module,

The step of learning the second language model is,

Learning a second prediction module included in the second language model based on the second character data and the second prediction character data while maintaining the structure and parameters of the learned projection module and the learned soft max module. A control method comprising:
According to clause 10,

The second character data is,

A control method that is character data related to a pre-defined domain.
According to clause 8,

The control method is,

acquiring second transcription data by inputting second voice data into a voice recognition model learned by adding the learned first prediction module; and

A control method further comprising inputting the obtained second transcription data into a domain detection module to identify a domain corresponding to the second transcription data.
According to clause 12,

The step of identifying the domain is,

Identifying a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules; and

A control method further comprising: acquiring third transcription data by inputting the second voice data into the learned voice recognition model to which the identified second prediction module is added.
According to clause 13,

The learned speech recognition model to which the second prediction module is added,

A control method not including the first prediction module.
A non-transitory computer-readable recording medium storing computer instructions that are executed by a processor of an electronic device to cause the electronic device to perform an operation, comprising:

Obtaining first voice data and first text data corresponding to the first voice data;

A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module;

The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; and

Learning the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. It includes;

The first language model and the second language model are,

It is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character,

The voice recognition model is,

A computer-readable recording medium that is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.