WO2024053842A1 - Electronic device and control method thereof - Google Patents

Electronic device and control method thereof Download PDF

Info

Publication number
WO2024053842A1
WO2024053842A1 PCT/KR2023/010131 KR2023010131W WO2024053842A1 WO 2024053842 A1 WO2024053842 A1 WO 2024053842A1 KR 2023010131 W KR2023010131 W KR 2023010131W WO 2024053842 A1 WO2024053842 A1 WO 2024053842A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
module
learned
character
voice
Prior art date
Application number
PCT/KR2023/010131
Other languages
French (fr)
Korean (ko)
Inventor
진시첸
김성수
박진환
한창우
Original Assignee
삼성전자주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 삼성전자주식회사 filed Critical 삼성전자주식회사
Publication of WO2024053842A1 publication Critical patent/WO2024053842A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present disclosure relates to an electronic device and a control method thereof, and more specifically, to an efficient voice recognition model learning method and a method that can provide accurate voice recognition operations appropriate for the subject included in the voice using the learned voice recognition model. It relates to electronic devices and their control methods.
  • a speech recognition model (Automatic Speech Recognition Model) is used to obtain information about characters included in speech and perform speech recognition operations.
  • voice data is input
  • the voice recognition model obtains output data (or transcription data) corresponding to the voice data and identifies the meaning contained in the voice based on character data corresponding to the output data.
  • the speech recognition model includes a probability model-based speech recognition model that expresses speech/language information as an individual probability model to perform speech recognition operations, and a probability model-based speech recognition model that reduces the complexity of the probability model-based speech recognition model and performs efficient speech recognition operations.
  • a language model is used to obtain a more natural, complete sentence by predicting the word that will appear after a specific word within a sentence.
  • the language model assigns a probability to the word that appears after a specific word, and can obtain a natural, complete sentence based on the probability information about the word.
  • An electronic device for achieving the above-described object includes: a memory storing at least one instruction; and one or more processors executing the at least one instruction, wherein the one or more processors acquire first voice data and first text data corresponding to the first voice data, and execute the first text data and A first prediction module, a projection module, and a soft software included in the first language model based on the first prediction character data obtained by inputting the first character data into a first language model.
  • a first transcription obtained by learning the Softmax module and inputting the first voice data and the first voice data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added.
  • model Learning the speech recognition model based on data, and inputting the second text data and the second text data into a second language model to which the learned projection module and the learned softmax module are added, a second language obtained
  • the second language model is learned based on the predicted character data, and the first language model and the second language model output a probability value of the predicted character corresponding to the input character to determine the predicted character corresponding to the input character.
  • model, and the voice recognition model may be a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
  • the speech recognition model includes an encoder including a transcription module and a decoder including an integrated module to which the learned first prediction module is added, and the one or more processors include, Learning the transcription module and integration module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module. can do.
  • the second language model includes a second prediction module, the learned projection module, and the learned soft max module, and the one or more processors configure the structures of the learned projection module and the learned soft max module.
  • a second prediction module included in the second language model may be learned based on the second text data and the second prediction text data while maintaining the parameters.
  • the second text data may be text data related to a predefined domain.
  • the one or more processors input the second voice data into a voice recognition model learned by adding the learned first prediction module to obtain second transcription data, and transfer the obtained second transcription data to a domain detection module. You can identify the domain corresponding to the second transcription data by entering .
  • the one or more processors identify a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules, and apply the second voice data to the learned second prediction module to which the identified second prediction module is added.
  • Third transcription data can be obtained by inputting it into the voice recognition model.
  • the learned speech recognition model to which the second prediction module is added may not include the first prediction module.
  • a method of controlling an electronic device includes obtaining first voice data and first text data corresponding to the first voice data; A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module; The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; And the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added.
  • a first language model Lianguage Model
  • the speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps;
  • the second language model based on second character data and second prediction character
  • a step of learning including, wherein the first language model and the second language model are models that determine a predicted character corresponding to the input character by outputting a probability value of the predicted character corresponding to the input character, and the speech recognition model
  • the character corresponding to the input voice can be determined by outputting the probability value of the character corresponding to the input voice.
  • the speech recognition model includes an encoder including a transcription module and a decoder including an integrated module to which the learned first prediction module is added, and learns the speech recognition model.
  • the step of doing so includes a transcription module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module, and It may include a step of learning an integrated module.
  • the second language model includes a second prediction module, the learned projection module, and the learned soft max module, and the step of learning the second language model includes the learned projection module and the learned soft max module. It may include learning a second prediction module included in the second language model based on the second text data and the second prediction text data while maintaining the structure and parameters of the soft max module.
  • the second text data may be text data related to a predefined domain.
  • control method includes the steps of inputting second voice data into a voice recognition model learned by adding the learned first prediction module to obtain second transcription data; and inputting the obtained second transcription data into a domain detection module to identify a domain corresponding to the second transcription data.
  • identifying the domain may include identifying a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules; and acquiring third transcription data by inputting the second voice data into the learned voice recognition model to which the identified second prediction module is added.
  • the learned speech recognition model to which the second prediction module is added may not include the first prediction module.
  • the recording medium includes first voice data and the Obtaining first text data corresponding to first voice data; A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module; The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added.
  • a first language model Lianguage Model
  • the speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added.
  • a step of learning including, wherein the first language model and the second language model are models that determine a predicted character corresponding to the input character by outputting a probability value of the predicted character corresponding to the input character, and the speech recognition model may be a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
  • FIG. 1 is a diagram illustrating a voice recognition model that can be used to control an electronic device according to an embodiment of the present disclosure.
  • Figure 2 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.
  • FIG. 3 is a diagram illustrating an operation in which an electronic device learns a voice recognition model and a language model, according to an embodiment of the present disclosure.
  • FIG. 4 is a diagram illustrating an operation in which an electronic device performs voice recognition using a learned voice recognition model, according to an embodiment of the present disclosure.
  • FIG. 5 is a diagram illustrating a voice recognition model that can be used to control an electronic device according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart for explaining the operation of an electronic device according to an embodiment of the present disclosure.
  • expressions such as “have,” “may have,” “includes,” or “may include” refer to the presence of the corresponding feature (e.g., component such as numerical value, function, operation, or part). , and does not rule out the existence of additional features.
  • expressions such as “A or B,” “at least one of A or/and B,” or “one or more of A or/and B” may include all possible combinations of the items listed together.
  • “A or B,” “at least one of A and B,” or “at least one of A or B” includes (1) at least one A, (2) at least one B, or (3) it may refer to all cases including both at least one A and at least one B.
  • a component e.g., a first component
  • another component e.g., a second component
  • connection to it should be understood that a certain component can be connected directly to another component or connected through another component (e.g., a third component).
  • a component e.g., a first component
  • another component e.g., a second component
  • no other components e.g., third components
  • the expression “configured to” used in the present disclosure may mean, for example, “suitable for,” “having the capacity to,” depending on the situation. ,” can be used interchangeably with “designed to,” “adapted to,” “made to,” or “capable of.”
  • the term “configured (or set to)” may not necessarily mean “specifically designed to” in hardware.
  • the expression “a device configured to” may mean that the device is “capable of” working with other devices or components.
  • the phrase "processor configured (or set) to perform A, B, and C" refers to a processor dedicated to performing the operations (e.g., an embedded processor), or by executing one or more software programs stored on a memory device.
  • a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software, or as a combination of hardware and software. Additionally, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented with at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.
  • FIG. 1 is a diagram illustrating a voice recognition model that can be used to control an electronic device according to an embodiment of the present disclosure.
  • a voice recognition model that can be used to control an electronic device or operate as a component of an electronic device can output transcription data corresponding to the input voice data when voice data is input. Finally, the electronic device can obtain text data corresponding to the voice data based on the transcription data, and identify the meaning contained in the voice data based on the acquired text data.
  • the speech recognition model includes a probability model-based speech recognition model that expresses speech/language information as an individual probability model to perform speech recognition operations, and a probability model-based speech recognition model that reduces the complexity of the probability model-based speech recognition model and performs efficient speech recognition operations.
  • a longitudinal speech recognition model using a single deep neural network can be used.
  • a voice recognition model may include an encoder and a decoder.
  • the encoder converts voice data into vector values in virtual space, and the decoder outputs a string based on the vector value output from the encoder and the probability value that the output string may have.
  • the character string output here may correspond to transcription data or text data corresponding to voice data.
  • the encoder may be based on a Convolutional Neural Network (CNN) or a Deep Neural Network (DNN), and the decoder may be based on a Recurrent Neural Network (RNN), CNN, or DNN, but is not limited to this and can be implemented in various ways. It can be done with a neural network model.
  • CNN Convolutional Neural Network
  • DNN Deep Neural Network
  • RNN Recurrent Neural Network
  • An electronic device including the above-described voice recognition model may be an electronic device with various purposes and functions.
  • the electronic device may be a variety of electronic devices that perform calculations through a processor, such as a computer, robot, speaker, display device, or other home appliance, and the electronic device including a voice recognition model according to an embodiment of the present disclosure may be It is not limited to the above-described devices.
  • the electronic device may be implemented as an electronic device having two or more functions of the above-described devices.
  • FIG. 2 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.
  • the electronic device 100 may include a memory 110 capable of storing at least one instruction and a processor 120 capable of executing at least one instruction.
  • the configuration of the electronic device 100 is not limited to the memory 110 and the processor 120, and the electronic device 100 includes a user interface, a communication interface, a display, and a speaker in addition to the memory 110 and the processor 120. , microphone, camera, etc. may further include components necessary to perform various operations based on data acquired through the voice recognition model.
  • the memory 110 temporarily or non-temporarily stores various programs or data, and transmits the stored information to the processor 120 according to a call from the processor 120. Additionally, the memory 110 can store various information necessary for calculation, processing, or control operations of the processor 120 in an electronic format.
  • the memory 110 may include, for example, at least one of a main memory and an auxiliary memory.
  • the main memory may be implemented using semiconductor storage media such as ROM and/or RAM.
  • ROM may include, for example, conventional ROM, EPROM, EEPROM, and/or MASK-ROM.
  • RAM may include, for example, DRAM and/or SRAM.
  • Auxiliary storage devices include flash memory (110) devices, Secure Digital (SD) cards, solid state drives (SSDs), hard disk drives (HDDs), magnetic drums, compact disks (CDs), It can be implemented using at least one storage medium capable of storing data permanently or semi-permanently, such as optical media such as DVD or laser disk, magnetic tape, magneto-optical disk, and/or floppy disk. .
  • the memory 110 may store a voice recognition model and a language model. Specifically, the memory 110 may store information about the encoder and decoder included in the speech recognition model, information about the prediction module included in the language model, information about the projection module, and information about the softmax module. Information can be saved. The memory 110 may store information about the domain detection module.
  • the memory 110 may store voice data input to the voice recognition model, transcription data, or text data output from the voice recognition model.
  • the memory 110 may store text data input to the language model and predicted text data output from the language model.
  • Memory 110 may store information about the domain.
  • the memory 110 may store information about the neural network that constitutes the speech recognition model and language model. Specifically, the memory 110 can store structural information such as layers and nodes constituting a neural network and parameter information such as weights and thresholds.
  • the electronic device 100 may store various information generated, processed, or used in the process of learning and using the voice recognition model and language model.
  • the processor 120 controls the overall operation of the electronic device 100.
  • the processor 120 is connected to the configuration of the electronic device 100 including the memory 110 as described above, and executes at least one instruction stored in the memory 110 as described above, thereby controlling the electronic device ( 100) operations can be controlled overall.
  • the processor 120 may be implemented not only as one processor 120 but also as a plurality of processors 120 .
  • Processor 120 may be implemented in various ways.
  • the one or more processors 120 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Accelerated Processing Unit (APU), a Many Integrated Core (MIC), a Digital Signal Processor (DSP), and a Neural Processing Unit (NPU). Unit), hardware accelerator, or machine learning accelerator.
  • One or more processors 120 may control one or any combination of other components of the electronic device 100 and may perform operations related to communication or data processing.
  • One or more processors 120 may execute one or more programs or instructions stored in the memory 110.
  • one or more processors 120 may perform a method according to an embodiment of the present disclosure by executing one or more instructions stored in the memory 110.
  • the plurality of operations may be performed by one processor 120 or may be performed by a plurality of processors 120.
  • the first operation, the second operation, and the third operation are all performed by the first processor 120. It may be, the first operation and the second operation are performed by the first processor 120 (e.g., general-purpose processor 120) and the third operation is performed by the second processor 120 (e.g., artificial intelligence It may also be performed by a dedicated processor 120).
  • the one or more processors 120 may be implemented as a single core processor 120 including one core, or as a single core processor 120 including a plurality of cores (e.g., homogeneous multi-core or heterogeneous multi-core). It may also be implemented with the above multi-core processor 120 (multicore processor). When one or more processors 120 are implemented as a multi-core processor 120, each of the plurality of cores included in the multi-core processor 120 is inside the processor 120, such as an on-chip memory 110. It may include a memory 110, and a common cache shared by a plurality of cores may be included in the multi-core processor 120.
  • each of the plurality of cores (or some of the plurality of cores) included in the multi-core processor 120 may independently read and perform program instructions for implementing the method according to an embodiment of the present disclosure, and the plurality of cores may independently read and perform the program instructions. All (or part) of the core may be linked to read and perform program instructions for implementing the method according to an embodiment of the present disclosure.
  • the plurality of operations may be performed by one core among the plurality of cores included in the multi-core processor 120, or by the plurality of cores. It may also be carried out.
  • the first operation, the second operation, and the third operation are performed by the method according to one embodiment, the first operation, the second operation, and the third operation are all included in the multi-core processor 120. It may be performed by the first core, and the first operation and the second operation are performed by the first core included in the multi-core processor 120, and the third operation is performed by the second core included in the multi-core processor 120. It may also be performed by the core.
  • the processor 120 may be a system-on-chip (SoC), a single-core processor 120, a multi-core processor 120, or a single-core processor in which one or more processors 120 and other electronic components are integrated. (120) or may refer to a core included in the multi-core processor 120, where the core may be implemented as a CPU, GPU, APU, MIC, DSP, NPU, hardware accelerator, or machine learning accelerator, but the present disclosure The embodiments are not limited to this.
  • SoC system-on-chip
  • 120 or may refer to a core included in the multi-core processor 120, where the core may be implemented as a CPU, GPU, APU, MIC, DSP, NPU, hardware accelerator, or machine learning accelerator, but the present disclosure The embodiments are not limited to this.
  • One or more processors 120 may acquire first voice data and first text data corresponding to the first voice data.
  • the processor 120 operates a first prediction module included in the first language model based on the first character data and the first prediction character data obtained by inputting the first character data into the first language model (Language Model). , you can learn the projection module and softmax module.
  • the processor 120 performs speech recognition based on first transcription data obtained by inputting the first speech data and the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. You can learn a model.
  • a speech recognition model Automatic Speech Recognition Model
  • the processor 120 creates a second language model based on second prediction character data obtained by inputting the second character data and the second character data into a second language model to which the learned projection module and the learned softmax module are added. You can learn.
  • the first language model and the second language model are models that determine the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character
  • the speech recognition model is the character corresponding to the input voice. It may be a model that determines the character corresponding to the input voice by outputting the probability value of .
  • the control operation of the electronic device 100 including the voice recognition model and language model of the processor 120, will be described in more detail with FIGS. 3 to 5.
  • FIG. 3 is a diagram illustrating an operation of the electronic device 100 learning a voice recognition model and a language model, according to an embodiment of the present disclosure.
  • the processor 120 may obtain first voice data and first text data corresponding to the first voice data.
  • the first text data may be transcription data obtained by transcribing the first voice data, but is not limited thereto, and may be text data corresponding to the meaning included in the first voice data.
  • the processor 120 may input first text data into the first language model 200 to obtain first predicted text data.
  • the language model is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character.
  • the processor 120 includes a first prediction module 210 included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into the first language model 200;
  • the projection module and softmax module 220 can be learned.
  • the prediction module 210 plays a role in predicting and identifying characters that may appear after the input character, and the projection module and softmax module 220 predict vector values and vector values in virtual space corresponding to the predicted characters. It serves to identify the probability value corresponding to .
  • the prediction module 210, projection module, and softmax module 220 included in the language model 200 may be composed of a neural network model with nodes, layers, and weights.
  • the neural network model constituting the prediction module 210, the projection module, and the softmax module 220 may be comprised of a CNN (Convolutional Neural Network), a DNN (Deep Neural Network), etc., but is not limited thereto, and may include various types of neural networks. It can be done as a model.
  • CNN Convolutional Neural Network
  • DNN Deep Neural Network
  • the language model learning operation of the processor 120 may be performed based on first text data, first predicted text data, and loss (eg, cross-entropy loss) information of the first language model 200. That is, the processor 120 modifies the parameters (e.g., weight, threshold, etc.) of the first prediction module 210, projection module, and softmax module 220 to minimize the loss of the language model 200.
  • the prediction module 210, projection module, and softmax module 220 can be learned.
  • the processor 120 inputs the first voice data and the first voice data into a voice recognition model (Automatic Speech Recognition Model) 300 to which the learned first prediction module 210 is added, and obtains the first transcription.
  • the voice recognition model 300 can be learned based on the data.
  • the voice recognition model is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
  • the speech recognition model 300 may include an encoder including a transcription module 310 and a decoder including an integrated module 320 to which the learned first prediction module 210 is added. You can.
  • the transcription module 310 may convert the input voice data into a vector value in a virtual space corresponding to the voice data.
  • the integration module 320 to which the learned first prediction module 210 is added can output a string based on the vector value output from the transcription module 310 and the probability value that the output string may have.
  • the character string may correspond to the first transcription data.
  • the transcription module 310 and integration module 320 included in the voice recognition model 300 may be composed of a neural network model with nodes, layers, and weights.
  • the neural network model constituting the transcription module 310 and the integration module 320 may be composed of a Convolutional Neural Network (CNN), Deep Neural Network (DNN), or Recurrent Neural Network (RNN). However, it is not limited to this and can be made up of various types of neural network models.
  • CNN Convolutional Neural Network
  • DNN Deep Neural Network
  • RNN Recurrent Neural Network
  • the processor 120 maintains the learned structure and parameters of the first prediction module 210 and outputs the first voice data, first text data corresponding to the first voice data, and the voice recognition model 300. 1
  • the transcription module 310 and integration module 320 included in the voice recognition model 300 can be learned based on the transcription data.
  • the processor 120 By learning the transcription module 310 and the integration module 320 while maintaining the structure and parameters of the first prediction module 210 learned as described above, the processor 120 creates the learned first language model 200.
  • the transcription module 310 and integration module 320 adapted to the virtual vector space of the projection module and softmax module 220 included in can be learned. Accordingly, the processor 120 can learn the voice recognition model 300 more efficiently and improve the voice recognition accuracy of the voice recognition model 300.
  • the processor 120 inputs the second character data and the second character data into the second language model 400 to which the learned projection module and the learned softmax module 220 are added, based on the second predicted character data obtained.
  • the second language model 400 can be learned.
  • the operation of the second language model 400, the second prediction module 410 included in the second language model 400, the learned projection module and the learned softmax module 220 added to the second language model is the same as described above along with the first language model 300.
  • the processor 120 generates a second language model 400 based on the second predicted character data obtained by inputting it into the second language model 400 to which the learned projection module and the learned softmax module 220 are added. ) can be learned, and the second prediction module 410 can be learned for each domain to predict the character that will appear after the input character in relation to the predefined domain (Domain). there is. That is, the processor 120 may learn the second prediction modules 410-1, 410-2, and 410-3 for each domain based on text data and predicted text data related to a specific domain.
  • a predefined domain can be the subject of text data, for example, text data such as “When cooking pasta, you must use milk to make sauce” or “When making gamjatang, you must boil it for 40 minutes.”
  • the domain might be “cooking”.
  • the domain of text data such as "It will rain in the central region starting tomorrow afternoon” and “It will be sunny all day tomorrow” may be "Weather”.
  • the processor 120 learns the second prediction module 410 with character data corresponding to a specific domain, thereby enabling learning with a small amount of data for each specific domain and improving the prediction accuracy of the second prediction module 410. .
  • the second prediction module 410 included in the second language model 400 by learning the second prediction module 410 included in the second language model 400 while maintaining the structure and parameters of the learned projection module and the learned softmax module 220, general speech Efficient learning is possible because the prediction modules 410-1, 410-2, and 410-3 can be learned using only text data, rather than the voice data and text data pairs required for learning the recognition model.
  • the processor 120 can learn the first language model 200, the speech recognition model 300, and the second language model 400.
  • a method of performing voice recognition using the learned model will be explained with Figure 4.
  • FIG. 4 is a diagram illustrating an operation in which the electronic device 100 performs voice recognition using a learned voice recognition model, according to an embodiment of the present disclosure.
  • the processor 120 may obtain second transcription data by inputting the second voice data into the voice recognition model 300 learned by adding the learned first prediction module 210.
  • the second voice data may be arbitrary voice data different from the first voice data, and the second transcription data may be text data corresponding to characters included in the second voice data.
  • the processor 120 may input the acquired second transcription data into the domain detection module 500 to identify the domain corresponding to the second transcription data.
  • the subject of the second voice data can be identified based on the domain identified through the domain detection module 500.
  • the processor 120 may identify the second prediction module 410 corresponding to the identified domain among the plurality of learned second prediction modules 410-1, 410-2, and 410-3.
  • the processor 120 inputs the second voice data into the learned voice recognition model 300 to which the identified second prediction module 410 is added to obtain third transcription data or text data corresponding to the third transcription data. It can be obtained.
  • the learned speech recognition model to which the second prediction module 410 is added may not include the first prediction module 210.
  • final speech recognition is performed by changing the prediction module included in the speech recognition model 300 from the first prediction module 210 to the second prediction module 410 corresponding to a specific domain without additional calculation using the language model. Fast and easy voice recognition becomes possible.
  • the accuracy of voice recognition for each domain can be increased by allowing the prediction module, which is a common partial model of the voice recognition model, to be exchanged.
  • FIG. 5 is a diagram illustrating a voice recognition model that can be used to control the electronic device 100 according to an embodiment of the present disclosure.
  • the domain related to the voice is first identified, the final voice recognition is performed based on the identified domain, and the transcription data is output. It is possible to perform accurate voice recognition operations by acquiring .
  • the speech recognition model according to the present disclosure learns the speech recognition model with the structure and parameters of the learned prediction module, projection module, and softmax module fixed, accurate and efficient learning is possible with only a small amount of data.
  • FIG. 6 is a flowchart for explaining the operation of the electronic device 100 according to an embodiment of the present disclosure.
  • the electronic device 100 may obtain first voice data and first text data corresponding to the first voice data (S610).
  • the electronic device 100 includes the first prediction module, projection module, and softmax included in the first language model based on the first character data and the first prediction character data obtained by inputting the first character data into the first language model. You can learn the module (S620).
  • the electronic device 100 may learn a voice recognition model based on the first transcription data obtained by inputting the first voice data and the first voice data into a voice recognition model to which the learned first prediction module is added (S630) ).
  • the electronic device 100 uses a transcription module and an integration module included in the voice recognition model based on the first voice data, first text data, and first transcription data while maintaining the structure and parameters of the learned first prediction module. You can learn.
  • the electronic device 100 creates a second language model based on the second predicted character data obtained by inputting the second character data and the second character data into a second language model to which the learned projection module and the learned softmax module are added. You can learn (S640).
  • the electronic device 100 generates a second prediction module included in the second language model based on the second character data and the second prediction character data while maintaining the structure and parameters of the learned projection module and the learned soft max module. You can learn.
  • the electronic device 100 acquires second transcription data by inputting the second voice data into a voice recognition model learned by adding the learned first prediction module, and inputs the obtained second transcription data into the domain detection module to obtain the first transcription data. 2
  • the domain corresponding to the transcription data can be identified.
  • the electronic device 100 may identify a second prediction module corresponding to an identified domain among a plurality of learned second prediction modules, and use the second speech data as a learned speech recognition model to which the identified second prediction module is added. You can obtain third transcription data by inputting .
  • the learned speech recognition model to which the second prediction module is added may not include the first prediction module.
  • Functions related to artificial intelligence are operated through the processor and memory of the electronic device.
  • the processor may consist of one or multiple processors.
  • one or more processors may include at least one of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Neural Processing Unit (NPU), but are not limited to the examples of the processors described above.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • NPU Neural Processing Unit
  • CPU is a general-purpose processor that can perform not only general calculations but also artificial intelligence calculations, and can efficiently execute complex programs through a multi-layer cache structure. CPUs are advantageous for serial processing, which allows organic connection between previous and next calculation results through sequential calculations.
  • the general-purpose processor is not limited to the above-described examples, except where specified as the above-described CPU.
  • GPU is a processor for large-scale operations such as floating-point operations used in graphics processing, and can perform large-scale operations in parallel by integrating a large number of cores.
  • GPUs may be more advantageous than CPUs in parallel processing methods such as convolution operations.
  • the GPU can be used as a co-processor to supplement the functions of the CPU.
  • the processor for mass computation is not limited to the above-described example, except for the case specified as the above-described GPU.
  • NPU is a processor specialized in artificial intelligence calculations using artificial neural networks, and each layer that makes up the artificial neural network can be implemented in hardware (e.g., silicon). At this time, the NPU is designed specifically according to the company's requirements, so it has a lower degree of freedom than a CPU or GPU, but can efficiently process artificial intelligence calculations requested by the company. Meanwhile, as a processor specialized for artificial intelligence calculations, NPU can be implemented in various forms such as TPU (Tensor Processing Unit), IPU (Intelligence Processing Unit), and VPU (Vision processing unit).
  • the artificial intelligence processor is not limited to the examples described above, except where specified as the NPU described above.
  • one or more processors may be implemented as a System on Chip (SoC).
  • SoC System on Chip
  • the SoC may further include memory and a network interface such as a bus for data communication between the processor and memory.
  • the electronic device uses some of the processors to perform artificial intelligence-related operations (for example, learning of an artificial intelligence model). or operations related to inference) can be performed.
  • an electronic device can perform operations related to artificial intelligence using at least one of a plurality of processors, a GPU, NPU, VPU, TPU, or hardware accelerator specialized for artificial intelligence operations such as convolution operation, matrix multiplication operation, etc. there is.
  • this is only an example, and of course, calculations related to artificial intelligence can be processed using general-purpose processors such as CPUs.
  • electronic devices can perform calculations on functions related to artificial intelligence using multiple cores (eg, dual core, quad core, etc.) included in one processor.
  • electronic devices can perform artificial intelligence operations such as convolution operations and matrix multiplication operations in parallel using multi-cores included in the processor.
  • One or more processors control input data to be processed according to predefined operation rules or artificial intelligence models stored in memory.
  • Predefined operation rules or artificial intelligence models are characterized by being created through science.
  • being created through learning means that a predefined operation rule or artificial intelligence model with desired characteristics is created by applying a learning algorithm to a large number of learning data.
  • This learning may be performed on the device itself that performs the artificial intelligence according to the present disclosure, or may be performed through a separate server/system.
  • An artificial intelligence model may be composed of multiple neural network layers. At least one layer has at least one weight value, and the operation of the layer is performed using the operation result of the previous layer and at least one defined operation.
  • Examples of neural networks include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), and Deep Neural Network (BRDNN).
  • CNN Convolutional Neural Network
  • DNN Deep Neural Network
  • RNN Restricted Boltzmann Machine
  • BBM Restricted Boltzmann Machine
  • BBN Deep Belief Network
  • BBN Deep Belief Network
  • BBN Bidirectional Recurrent Deep Neural Network
  • BDN Deep Neural Network
  • BDN Deep Neural Network
  • a learning algorithm is a method of training a target device (eg, a robot) using a large number of learning data so that the target device can make decisions or make predictions on its own.
  • Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, and the learning algorithm in the present disclosure is specified. Except, it is not limited to the examples described above.
  • Computer program products are commodities and can be traded between sellers and buyers.
  • the computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or through an application store (e.g. Play StoreTM) or on two user devices (e.g. It can be distributed (e.g. downloaded or uploaded) directly between smartphones) or online.
  • a machine-readable storage medium e.g. compact disc read only memory (CD-ROM)
  • an application store e.g. Play StoreTM
  • two user devices e.g. It can be distributed (e.g. downloaded or uploaded) directly between smartphones) or online.
  • at least a portion of the computer program product e.g., a downloadable app
  • a machine-readable storage medium such as the memory of a manufacturer's server, an application store's server, or a relay server. It can be temporarily stored or created temporarily.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed are an electronic device and control method thereof. The electronic device disclosed herein: acquires first speech data and first text data corresponding to the first speech data; trains a first prediction module, a projection module, and a softmax module included in a first language model on the basis of the first text data and first predicted text data acquired by inputting the first text data to the first language model; trains a speech recognition model on the basis of the first speech data and first transcription data acquired by inputting the first speech data to an automatic speech recognition model to which the trained first prediction module has been added; and trains a second language model on the basis of second text data and second predicted text data acquired by inputting the second text data to the second language model to which the trained projection module and softmax module have been added.

Description

전자 장치 및 이의 제어 방법 Electronic devices and methods for controlling the same
본 개시는 전자 장치 및 이의 제어 방법에 관한 것으로, 더욱 상세하게는, 효율적인 음성인식 모델의 학습 방법과 학습된 음성인식 모델을 이용하여 음성에 포함된 주제에 맞는 정확한 음성인식 동작을 제공할 수 있는 전자 장치 및 이의 제어 방법에 관한 것이다.The present disclosure relates to an electronic device and a control method thereof, and more specifically, to an efficient voice recognition model learning method and a method that can provide accurate voice recognition operations appropriate for the subject included in the voice using the learned voice recognition model. It relates to electronic devices and their control methods.
음성에 포함된 문자에 대한 정보를 획득하여 음성인식 동작을 수행하기 위해 음성인식 모델(Automatic Speech Recognition Model)이 활용된다. 음성인식 모델은 음성 데이터가 입력되면 음성 데이터에 대응되는 출력 데이터(또는 전사(Transcription) 데이터)를 획득하고, 출력 데이터에 대응되는 문자 데이터에 기초하여 음성에 포함된 의미를 식별할 수 있도록 한다.A speech recognition model (Automatic Speech Recognition Model) is used to obtain information about characters included in speech and perform speech recognition operations. When voice data is input, the voice recognition model obtains output data (or transcription data) corresponding to the voice data and identifies the meaning contained in the voice based on character data corresponding to the output data.
음성인식 모델로는 음성/언어 정보를 개별 확률 모델로 표현하여 음성인식 동작을 수행하는 확률 모델 기반의 음성인식 모델과 확률 모델 기반의 음성인식 모델의 복잡도를 낮추고 효율적인 음성인식 동작을 수행할 수 있도록 단일 심층 신경망을 사용하는 종단형 음성인식 모델 등이 존재한다.The speech recognition model includes a probability model-based speech recognition model that expresses speech/language information as an individual probability model to perform speech recognition operations, and a probability model-based speech recognition model that reduces the complexity of the probability model-based speech recognition model and performs efficient speech recognition operations. There are longitudinal voice recognition models that use a single deep neural network.
문장 내에서 특정 단어 다음에 나올 단어를 예측하여 보다 자연스러운 완결된 문장을 획득하기 위해 언어 모델(Language Model)이 활용된다. 언어 모델은 특정 단어 다음에 나올 단어에 대한 확률을 할당하고, 단어에 대한 확률 정보에 기초하여 자연스러운 완결된 문장을 획득할 수 있다.A language model is used to obtain a more natural, complete sentence by predicting the word that will appear after a specific word within a sentence. The language model assigns a probability to the word that appears after a specific word, and can obtain a natural, complete sentence based on the probability information about the word.
여기서, 음성인식 모델에 언어 모델에 포함된 문자 예측 모듈(Prediction Module)을 도입하면 더욱 효율적이고 정확한 음성인식 동작을 수행할 수 있다.Here, by introducing a character prediction module included in the language model into the speech recognition model, more efficient and accurate speech recognition operations can be performed.
상술한 목적을 달성하기 위한 본 실시 예에 따른 전자 장치는, 적어도 하나의 인스트럭션을 저장하는 메모리; 및 상기 적어도 하나의 인스트럭션을 실행하는 하나 이상의 프로세서;를 포함하고, 상기 하나 이상의 프로세서는, 제1 음성 데이터 및 상기 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득하고, 상기 제1 문자 데이터 및 상기 제1 문자 데이터를 제1 언어 모델(Language Model)에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 상기 제1 언어 모델에 포함된 제1 예측(Prediction) 모듈, 투사(Projection) 모듈 및 소프트맥스(Softmax) 모듈을 학습하고, 상기 제1 음성 데이터 및 상기 제1 음성 데이터를 상기 학습된 제1 예측 모듈이 부가된 음성인식 모델(Automatic Speech Recognition Model)에 입력하여 획득한 제1 전사(Transcription) 데이터에 기초하여 상기 음성인식 모델을 학습하고, 제2 문자 데이터 및 상기 제2 문자 데이터를 상기 학습된 투사 모듈 및 상기 학습된 소프트맥스 모듈이 부가된 제2 언어 모델에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델을 학습하고, 상기 제1 언어 모델 및 제2 언어 모델은, 입력 문자에 대응되는 예측 문자의 확률 값을 출력하여 입력 문자에 대응되는 예측 문자를 판단하는 모델이고, 상기 음성인식 모델은, 입력 음성에 대응되는 문자의 확률 값을 출력하여 입력 음성에 대응되는 문자를 판단하는 모델일 수 있다.An electronic device according to this embodiment for achieving the above-described object includes: a memory storing at least one instruction; and one or more processors executing the at least one instruction, wherein the one or more processors acquire first voice data and first text data corresponding to the first voice data, and execute the first text data and A first prediction module, a projection module, and a soft software included in the first language model based on the first prediction character data obtained by inputting the first character data into a first language model. A first transcription obtained by learning the Softmax module and inputting the first voice data and the first voice data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. ) Learning the speech recognition model based on data, and inputting the second text data and the second text data into a second language model to which the learned projection module and the learned softmax module are added, a second language obtained The second language model is learned based on the predicted character data, and the first language model and the second language model output a probability value of the predicted character corresponding to the input character to determine the predicted character corresponding to the input character. model, and the voice recognition model may be a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
한편, 상기 음성인식 모델은, 전사(Transcription) 모듈을 포함하는 인코더(Encoder) 및 상기 학습된 제1 예측 모듈이 부가된 통합 모듈을 포함하는 디코더(Decoder)를 포함하고, 상기 하나 이상의 프로세서는, 상기 학습된 제1 예측 모듈의 구조 및 파라미터를 유지한 상태로 상기 제1 음성 데이터, 상기 제1 문자 데이터 및 상기 제1 전사 데이터에 기초하여 상기 음성인식 모델에 포함된 전사 모듈 및 통합 모듈을 학습할 수 있다.Meanwhile, the speech recognition model includes an encoder including a transcription module and a decoder including an integrated module to which the learned first prediction module is added, and the one or more processors include, Learning the transcription module and integration module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module. can do.
한편, 상기 제2 언어 모델은, 제2 예측 모듈, 상기 학습된 투사 모듈 및 상기 학습된 소프트 맥스 모듈을 포함하고, 상기 하나 이상의 프로세서는, 상기 학습된 투사 모듈 및 상기 학습된 소프트 맥스 모듈의 구조 및 파라미터를 유지한 상태로 상기 제2 문자 데이터 및 상기 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델에 포함된 제2 예측 모듈을 학습할 수 있다.Meanwhile, the second language model includes a second prediction module, the learned projection module, and the learned soft max module, and the one or more processors configure the structures of the learned projection module and the learned soft max module. And a second prediction module included in the second language model may be learned based on the second text data and the second prediction text data while maintaining the parameters.
한편, 상기 제2 문자 데이터는, 기 정의된 도메인(Domain)과 관련된 문자 데이터일 수 있다.Meanwhile, the second text data may be text data related to a predefined domain.
한편, 상기 하나 이상의 프로세서는, 제2 음성 데이터를 상기 학습된 제1 예측 모듈이 부가되어 학습된 음성인식 모델에 입력하여 제2 전사 데이터를 획득하고, 상기 획득된 제2 전사 데이터를 도메인 검출 모듈에 입력하여 제2 전사 데이터에 대응되는 도메인을 식별할 수 있다.Meanwhile, the one or more processors input the second voice data into a voice recognition model learned by adding the learned first prediction module to obtain second transcription data, and transfer the obtained second transcription data to a domain detection module. You can identify the domain corresponding to the second transcription data by entering .
한편, 상기 하나 이상의 프로세서는, 복수의 학습된 제2 예측 모듈 중 상기 식별된 도메인에 대응되는 제2 예측 모듈을 식별하고, 상기 제2 음성 데이터를 상기 식별된 제2 예측 모듈이 부가된 상기 학습된 음성인식 모델에 입력하여 제3 전사 데이터를 획득할 수 있다.Meanwhile, the one or more processors identify a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules, and apply the second voice data to the learned second prediction module to which the identified second prediction module is added. Third transcription data can be obtained by inputting it into the voice recognition model.
한편, 상기 제2 예측 모듈이 부가된 상기 학습된 음성인식 모델은, 상기 제1 예측 모듈을 포함하지 않을 수 있다.Meanwhile, the learned speech recognition model to which the second prediction module is added may not include the first prediction module.
본 개시의 일 실시 예에 따른 전자 장치의 제어 방법은 제1 음성 데이터 및 상기 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득하는 단계; 상기 제1 문자 데이터 및 상기 제1 문자 데이터를 제1 언어 모델(Language Model)에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 상기 제1 언어 모델에 포함된 제1 예측(Prediction) 모듈, 투사(Projection) 모듈 및 소프트맥스(Softmax) 모듈을 학습하는 단계; 상기 제1 음성 데이터 및 상기 제1 음성 데이터를 상기 학습된 제1 예측 모듈이 부가된 음성인식 모델(Automatic Speech Recognition Model)에 입력하여 획득한 제1 전사(Transcription) 데이터에 기초하여 상기 음성인식 모델을 학습하는 단계; 및 제2 문자 데이터 및 상기 제2 문자 데이터를 상기 학습된 투사 모듈 및 상기 학습된 소프트맥스 모듈이 부가된 제2 언어 모델에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델을 학습하는 단계;를 포함하고, 상기 제1 언어 모델 및 제2 언어 모델은, 입력 문자에 대응되는 예측 문자의 확률 값을 출력하여 입력 문자에 대응되는 예측 문자를 판단하는 모델이고, 상기 음성인식 모델은, 입력 음성에 대응되는 문자의 확률 값을 출력하여 입력 음성에 대응되는 문자를 판단할 수 있다.A method of controlling an electronic device according to an embodiment of the present disclosure includes obtaining first voice data and first text data corresponding to the first voice data; A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module; The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; And the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. A step of learning; including, wherein the first language model and the second language model are models that determine a predicted character corresponding to the input character by outputting a probability value of the predicted character corresponding to the input character, and the speech recognition model The character corresponding to the input voice can be determined by outputting the probability value of the character corresponding to the input voice.
한편, 상기 음성인식 모델은, 전사(Transcription) 모듈을 포함하는 인코더(Encoder) 및 상기 학습된 제1 예측 모듈이 부가된 통합 모듈을 포함하는 디코더(Decoder)를 포함하고, 상기 음성인식 모델을 학습하는 단계는, 상기 학습된 제1 예측 모듈의 구조 및 파라미터를 유지한 상태로 상기 제1 음성 데이터, 상기 제1 문자 데이터 및 상기 제1 전사 데이터에 기초하여 상기 음성인식 모델에 포함된 전사 모듈 및 통합 모듈을 학습하는 단계;를 포함할 수 있다.Meanwhile, the speech recognition model includes an encoder including a transcription module and a decoder including an integrated module to which the learned first prediction module is added, and learns the speech recognition model. The step of doing so includes a transcription module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module, and It may include a step of learning an integrated module.
한편, 상기 제2 언어 모델은, 제2 예측 모듈, 상기 학습된 투사 모듈 및 상기 학습된 소프트 맥스 모듈을 포함하고, 상기 제2 언어 모델을 학습하는 단계는, 상기 학습된 투사 모듈 및 상기 학습된 소프트 맥스 모듈의 구조 및 파라미터를 유지한 상태로 상기 제2 문자 데이터 및 상기 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델에 포함된 제2 예측 모듈을 학습하는 단계;를 포함할 수 있다.Meanwhile, the second language model includes a second prediction module, the learned projection module, and the learned soft max module, and the step of learning the second language model includes the learned projection module and the learned soft max module. It may include learning a second prediction module included in the second language model based on the second text data and the second prediction text data while maintaining the structure and parameters of the soft max module.
한편, 상기 제2 문자 데이터는, 기 정의된 도메인(Domain)과 관련된 문자 데이터일 수 있다.Meanwhile, the second text data may be text data related to a predefined domain.
한편, 상기 제어 방법은, 제2 음성 데이터를 상기 학습된 제1 예측 모듈이 부가되어 학습된 음성인식 모델에 입력하여 제2 전사 데이터를 획득하는 단계; 및 상기 획득된 제2 전사 데이터를 도메인 검출 모듈에 입력하여 제2 전사 데이터에 대응되는 도메인을 식별하는 단계;를 더 포함할 수 있다.Meanwhile, the control method includes the steps of inputting second voice data into a voice recognition model learned by adding the learned first prediction module to obtain second transcription data; and inputting the obtained second transcription data into a domain detection module to identify a domain corresponding to the second transcription data.
한편, 상기 도메인을 식별하는 단계는, 복수의 학습된 제2 예측 모듈 중 상기 식별된 도메인에 대응되는 제2 예측 모듈을 식별하는 단계; 및 상기 제2 음성 데이터를 상기 식별된 제2 예측 모듈이 부가된 상기 학습된 음성인식 모델에 입력하여 제3 전사 데이터를 획득하는 단계;를 더 포함할 수 있다.Meanwhile, identifying the domain may include identifying a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules; and acquiring third transcription data by inputting the second voice data into the learned voice recognition model to which the identified second prediction module is added.
한편, 상기 제2 예측 모듈이 부가된 상기 학습된 음성인식 모델은, 상기 제1 예측 모듈을 포함하지 않을 수 있다.Meanwhile, the learned speech recognition model to which the second prediction module is added may not include the first prediction module.
본 개시의 일 실시 예에 따른 전자 장치의 프로세서의 의해 실행되어 상기 전자 장치가 동작을 수행하도록 하는 컴퓨터 명령을 저장하는 비일시적 컴퓨터 판독가능 기록매체에 있어서, 상기 기록매체는 제1 음성 데이터 및 상기 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득하는 단계; 상기 제1 문자 데이터 및 상기 제1 문자 데이터를 제1 언어 모델(Language Model)에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 상기 제1 언어 모델에 포함된 제1 예측(Prediction) 모듈, 투사(Projection) 모듈 및 소프트맥스(Softmax) 모듈을 학습하는 단계; 상기 제1 음성 데이터 및 상기 제1 음성 데이터를 상기 학습된 제1 예측 모듈이 부가된 음성인식 모델(Automatic Speech Recognition Model)에 입력하여 획득한 제1 전사(Transcription) 데이터에 기초하여 상기 음성인식 모델을 학습하는 단계; 및 제2 문자 데이터 및 상기 제2 문자 데이터를 상기 학습된 투사 모듈 및 상기 학습된 소프트맥스 모듈이 부가된 제2 언어 모델에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델을 학습하는 단계;를 포함하고, 상기 제1 언어 모델 및 제2 언어 모델은, 입력 문자에 대응되는 예측 문자의 확률 값을 출력하여 입력 문자에 대응되는 예측 문자를 판단하는 모델이고, 상기 음성인식 모델은, 입력 음성에 대응되는 문자의 확률 값을 출력하여 입력 음성에 대응되는 문자를 판단하는 모델일 수 있다.In a non-transitory computer-readable recording medium storing computer instructions that are executed by a processor of an electronic device and cause the electronic device to perform an operation according to an embodiment of the present disclosure, the recording medium includes first voice data and the Obtaining first text data corresponding to first voice data; A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module; The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; And the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. A step of learning; including, wherein the first language model and the second language model are models that determine a predicted character corresponding to the input character by outputting a probability value of the predicted character corresponding to the input character, and the speech recognition model may be a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
본 개시의 특정 실시 예의 양상, 특징 및 이점은 첨부된 도면들을 참조하여 후술되는 설명을 통해 보다 명확해질 것이다.Aspects, features and advantages of specific embodiments of the present disclosure will become clearer through the following description with reference to the accompanying drawings.
도 1은 본 개시의 일 실시 예에 따른, 전자 장치의 제어에 이용될 수 있는 음성인식 모델을 설명하기 위한 도면이다.1 is a diagram illustrating a voice recognition model that can be used to control an electronic device according to an embodiment of the present disclosure.
도 2는 본 개시의 일 실시 예에 따른, 전자 장치의 구성을 설명하기 위한 블록도이다.Figure 2 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.
도 3은 본 개시의 일 실시 예에 따른, 전자 장치가 음성인식 모델 및 언어 모델을 학습하는 동작을 설명하기 위한 도면이다.FIG. 3 is a diagram illustrating an operation in which an electronic device learns a voice recognition model and a language model, according to an embodiment of the present disclosure.
도 4는 본 개시의 일 실시 예에 따른, 전자 장치가 학습된 음성인식 모델을 이용하여 음성인식을 수행하는 동작을 설명하기 위한 도면이다.FIG. 4 is a diagram illustrating an operation in which an electronic device performs voice recognition using a learned voice recognition model, according to an embodiment of the present disclosure.
도 5는 본 개시의 일 실시 예에 따른, 전자 장치의 제어에 이용될 수 있는 음성인식 모델을 설명하기 위한 도면이다.FIG. 5 is a diagram illustrating a voice recognition model that can be used to control an electronic device according to an embodiment of the present disclosure.
도 6은 본 개시의 일 실시 예에 따른, 전자 장치의 동작을 설명하기 위한 흐름도이다.FIG. 6 is a flowchart for explaining the operation of an electronic device according to an embodiment of the present disclosure.
--
본 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 본 개시의 실시 예의 다양한 변경(modifications), 균등물(equivalents), 및/또는 대체물(alternatives)을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.Since these embodiments can be modified in various ways and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope to specific embodiments, and should be understood to include various modifications, equivalents, and/or alternatives to the embodiments of the present disclosure. In connection with the description of the drawings, similar reference numbers may be used for similar components.
본 개시를 설명함에 있어서, 관련된 공지 기능 혹은 구성에 대한 구체적인 설명이 본 개시의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그에 대한 상세한 설명은 생략한다. In describing the present disclosure, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present disclosure, the detailed description thereof will be omitted.
덧붙여, 하기 실시 예는 여러 가지 다른 형태로 변형될 수 있으며, 본 개시의 기술적 사상의 범위가 하기 실시 예에 한정되는 것은 아니다. 오히려, 이들 실시 예는 본 개시를 더욱 충실하고 완전하게 하고, 당업자에게 본 개시의 기술적 사상을 완전하게 전달하기 위하여 제공되는 것이다.In addition, the following examples may be modified into various other forms, and the scope of the technical idea of the present disclosure is not limited to the following examples. Rather, these embodiments are provided to make the present disclosure more faithful and complete and to completely convey the technical idea of the present disclosure to those skilled in the art.
본 개시에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 권리범위를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.The terms used in this disclosure are merely used to describe specific embodiments and are not intended to limit the scope of rights. Singular expressions include plural expressions unless the context clearly dictates otherwise.
본 개시에서, "가진다," "가질 수 있다," "포함한다," 또는 "포함할 수 있다" 등의 표현은 해당 특징(예: 수치, 기능, 동작, 또는 부품 등의 구성요소)의 존재를 가리키며, 추가적인 특징의 존재를 배제하지 않는다. In the present disclosure, expressions such as “have,” “may have,” “includes,” or “may include” refer to the presence of the corresponding feature (e.g., component such as numerical value, function, operation, or part). , and does not rule out the existence of additional features.
본 개시에서, "A 또는 B," "A 또는/및 B 중 적어도 하나," 또는 "A 또는/및 B 중 하나 또는 그 이상"등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다. 예를 들면, "A 또는 B," "A 및 B 중 적어도 하나," 또는 "A 또는 B 중 적어도 하나"는, (1) 적어도 하나의 A를 포함, (2) 적어도 하나의 B를 포함, 또는 (3) 적어도 하나의 A 및 적어도 하나의 B 모두를 포함하는 경우를 모두 지칭할 수 있다.In the present disclosure, expressions such as “A or B,” “at least one of A or/and B,” or “one or more of A or/and B” may include all possible combinations of the items listed together. . For example, “A or B,” “at least one of A and B,” or “at least one of A or B” includes (1) at least one A, (2) at least one B, or (3) it may refer to all cases including both at least one A and at least one B.
본 개시에서 사용된 "제1," "제2," "첫째," 또는 "둘째,"등의 표현들은 다양한 구성요소들을, 순서 및/또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. Expressions such as “first,” “second,” “first,” or “second,” used in the present disclosure can modify various components regardless of order and/or importance, and can refer to one component. It is only used to distinguish from other components and does not limit the components.
어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "(기능적으로 또는 통신적으로) 연결되어((operatively or communicatively) coupled with/to)" 있다거나 "접속되어(connected to)" 있다고 언급된 때에는, 어떤 구성요소가 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제3 구성요소)를 통하여 연결될 수 있다고 이해되어야 할 것이다. A component (e.g., a first component) is “(operatively or communicatively) coupled with/to” another component (e.g., a second component). When referred to as “connected to,” it should be understood that a certain component can be connected directly to another component or connected through another component (e.g., a third component).
반면에, 어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 어떤 구성요소와 다른 구성요소 사이에 다른 구성요소(예: 제3 구성요소)가 존재하지 않는 것으로 이해될 수 있다.On the other hand, when a component (e.g., a first component) is said to be "directly connected" or "directly connected" to another component (e.g., a second component), It may be understood that no other components (e.g., third components) exist between the elements.
본 개시에서 사용된 표현 "~하도록 구성된(또는 설정된)(configured to)"은 상황에 따라, 예를 들면, "~에 적합한(suitable for)," "~하는 능력을 가지는(having the capacity to)," "~하도록 설계된(designed to)," "~하도록 변경된(adapted to)," "~하도록 만들어진(made to)," 또는 "~를 할 수 있는(capable of)"과 바꾸어 사용될 수 있다. 용어 "~하도록 구성된(또는 설정된)"은 하드웨어적으로 "특별히 설계된(specifically designed to)" 것만을 반드시 의미하지 않을 수 있다. The expression “configured to” used in the present disclosure may mean, for example, “suitable for,” “having the capacity to,” depending on the situation. ," can be used interchangeably with "designed to," "adapted to," "made to," or "capable of." The term “configured (or set to)” may not necessarily mean “specifically designed to” in hardware.
대신, 어떤 상황에서는, "~하도록 구성된 장치"라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다. 예를 들면, 문구 "A, B, 및 C를 수행하도록 구성된(또는 설정된) 프로세서"는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(generic-purpose processor)(예: CPU 또는 application processor)를 의미할 수 있다.Instead, in some contexts, the expression “a device configured to” may mean that the device is “capable of” working with other devices or components. For example, the phrase "processor configured (or set) to perform A, B, and C" refers to a processor dedicated to performing the operations (e.g., an embedded processor), or by executing one or more software programs stored on a memory device. , may refer to a general-purpose processor (e.g., CPU or application processor) capable of performing the corresponding operations.
실시 예에 있어서 '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 혹은 복수의 '부'는 특정한 하드웨어로 구현될 필요가 있는 '모듈' 혹은 '부'를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In an embodiment, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software, or as a combination of hardware and software. Additionally, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented with at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.
한편, 도면에서의 다양한 요소와 영역은 개략적으로 그려진 것이다. 따라서, 본 발명의 기술적 사상은 첨부한 도면에 그려진 상대적인 크기나 간격에 의해 제한되지 않는다. Meanwhile, various elements and areas in the drawing are schematically drawn. Accordingly, the technical idea of the present invention is not limited by the relative sizes or spacing drawn in the attached drawings.
이하에서는 첨부한 도면을 참고하여 본 개시에 따른 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다.Hereinafter, with reference to the attached drawings, embodiments according to the present disclosure will be described in detail so that those skilled in the art can easily implement them.
도 1은 본 개시의 일 실시 예에 따른, 전자 장치의 제어에 이용될 수 있는 음성인식 모델을 설명하기 위한 도면이다.1 is a diagram illustrating a voice recognition model that can be used to control an electronic device according to an embodiment of the present disclosure.
도 1을 참조하면, 전자 장치의 제어에 이용되거나 전자 장치의 일 구성으로 동작할 수 있는 음성인식 모델은 음성 데이터가 입력되면 입력된 음성 데이터에 대응되는 전사(Transcription) 데이터를 출력할 수 있다. 최종적으로 전자 장치는 전사 데이터에 기초하여 음성 데이터에 대응되는 문자 데이터를 획득할 수 있고, 획득된 문자 데이터에 기초하여 음성 데이터에 포함된 의미를 식별할 수 있다.Referring to FIG. 1, a voice recognition model that can be used to control an electronic device or operate as a component of an electronic device can output transcription data corresponding to the input voice data when voice data is input. Finally, the electronic device can obtain text data corresponding to the voice data based on the transcription data, and identify the meaning contained in the voice data based on the acquired text data.
음성인식 모델로는 음성/언어 정보를 개별 확률 모델로 표현하여 음성인식 동작을 수행하는 확률 모델 기반의 음성인식 모델과 확률 모델 기반의 음성인식 모델의 복잡도를 낮추고 효율적인 음성인식 동작을 수행할 수 있도록 단일 심층 신경망을 사용하는 종단형 음성인식 모델 등이 활용될 수 있다.The speech recognition model includes a probability model-based speech recognition model that expresses speech/language information as an individual probability model to perform speech recognition operations, and a probability model-based speech recognition model that reduces the complexity of the probability model-based speech recognition model and performs efficient speech recognition operations. A longitudinal speech recognition model using a single deep neural network can be used.
음성인식 모델은 인코더(Encoder) 및 디코더(Decoder)를 포함할 수 있다. 인코더는 음성 데이터를 가상 공간 상의 벡터 값으로 변환하고, 디코더는 인코더로부터 출력된 벡터 값 및 출력되는 문자열이 가질 수 있는 확률 값에 기초하여 문자열을 출력하게 된다. 여기서 출력되는 문자열은 음성 데이터에 대응되는 전사 데이터 혹은 문자 데이터에 대응될 수 있다.A voice recognition model may include an encoder and a decoder. The encoder converts voice data into vector values in virtual space, and the decoder outputs a string based on the vector value output from the encoder and the probability value that the output string may have. The character string output here may correspond to transcription data or text data corresponding to voice data.
여기서, 인코더는 CNN(Convolutional Neural Network), DNN(Deep Neural Network) 기반으로 이루어질 수 있으며, 디코더는 RNN(Recurrent Neural Network), CNN, DNN 기반으로 이루어질 수 있으나, 이에 국한되는 것은 아니며, 다양한 방식의 신경망 모델로 이루어질 수 있다.Here, the encoder may be based on a Convolutional Neural Network (CNN) or a Deep Neural Network (DNN), and the decoder may be based on a Recurrent Neural Network (RNN), CNN, or DNN, but is not limited to this and can be implemented in various ways. It can be done with a neural network model.
상술한 음성인식 모델을 포함하는 전자 장치는 다양한 목적과 기능을 갖는 전자 장치일 수 있다. 예를 들어, 전자 장치는 컴퓨터, 로봇, 스피커, 디스플레이 장치, 기타 가전 등 프로세서를 통해 연산을 수행하는 다양한 전자 장치일 수 있으며, 본 개시의 일 실시 예에 따른 음성인식 모델을 포함하는 전자 장치는 상술한 기기에 한정되지 않는다. 또한, 전자 장치는 상술한 기기들의 둘 이상의 기능을 갖춘 전자 장치로 구현될 수 있다.An electronic device including the above-described voice recognition model may be an electronic device with various purposes and functions. For example, the electronic device may be a variety of electronic devices that perform calculations through a processor, such as a computer, robot, speaker, display device, or other home appliance, and the electronic device including a voice recognition model according to an embodiment of the present disclosure may be It is not limited to the above-described devices. Additionally, the electronic device may be implemented as an electronic device having two or more functions of the above-described devices.
도 2는 본 개시의 일 실시 예에 따른, 전자 장치의 구성을 설명하기 위한 블록도이다.FIG. 2 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.
도 2를 참조하면, 전자 장치(100)는 적어도 하나의 인스트럭션을 저장할 수 있는 메모리(110) 및 적어도 하나의 인스트럭션을 실행할 수 있는 프로세서(120)를 포함할 수 있다.Referring to FIG. 2 , the electronic device 100 may include a memory 110 capable of storing at least one instruction and a processor 120 capable of executing at least one instruction.
전자 장치(100)의 구성은 상술한 메모리(110) 및 프로세서(120)에 국한되는 것은 아니며, 전자 장치(100)는 메모리(110) 및 프로세서(120) 이외에 사용자 인터페이스, 통신 인터페이스, 디스플레이, 스피커, 마이크, 카메라 등 음성인식 모델을 통해 획득한 데이터에 기초한 다양한 동작을 수행하기 위해 필요한 구성을 더 포함할 수 있다.The configuration of the electronic device 100 is not limited to the memory 110 and the processor 120, and the electronic device 100 includes a user interface, a communication interface, a display, and a speaker in addition to the memory 110 and the processor 120. , microphone, camera, etc. may further include components necessary to perform various operations based on data acquired through the voice recognition model.
메모리(110)는 각종 프로그램이나 데이터를 일시적 또는 비일시적으로 저장하고, 프로세서(120)의 호출에 따라서 저장된 정보를 프로세서(120)에 전달한다. 또한, 메모리(110)는, 프로세서(120)의 연산, 처리 또는 제어 동작 등에 필요한 각종 정보를 전자적 포맷으로 저장할 수 있다.The memory 110 temporarily or non-temporarily stores various programs or data, and transmits the stored information to the processor 120 according to a call from the processor 120. Additionally, the memory 110 can store various information necessary for calculation, processing, or control operations of the processor 120 in an electronic format.
메모리(110)는, 예를 들어, 주기억장치 및 보조기억장치 중 적어도 하나를 포함할 수 있다. 주기억장치는 롬(ROM) 및/또는 램(RAM)과 같은 반도체 저장 매체를 이용하여 구현된 것일 수 있다. 롬은, 예를 들어, 통상적인 롬, 이피롬(EPROM), 이이피롬(EEPROM) 및/또는 마스크롬(MASK-ROM) 등을 포함할 수 있다. 램은 예를 들어, 디램(DRAM) 및/또는 에스램(SRAM) 등을 포함할 수 있다. 보조기억장치는, 플래시 메모리(110) 장치, SD(Secure Digital) 카드, 솔리드 스테이트 드라이브(SSD, Solid State Drive), 하드 디스크 드라이브(HDD, Hard Disc Drive), 자기 드럼, 컴팩트 디스크(CD), 디브이디(DVD) 또는 레이저 디스크 등과 같은 광 기록 매체(optical media), 자기테이프, 광자기 디스크 및/또는 플로피 디스크 등과 같이 데이터를 영구적 또는 반영구적으로 저장 가능한 적어도 하나의 저장 매체를 이용하여 구현될 수 있다.The memory 110 may include, for example, at least one of a main memory and an auxiliary memory. The main memory may be implemented using semiconductor storage media such as ROM and/or RAM. ROM may include, for example, conventional ROM, EPROM, EEPROM, and/or MASK-ROM. RAM may include, for example, DRAM and/or SRAM. Auxiliary storage devices include flash memory (110) devices, Secure Digital (SD) cards, solid state drives (SSDs), hard disk drives (HDDs), magnetic drums, compact disks (CDs), It can be implemented using at least one storage medium capable of storing data permanently or semi-permanently, such as optical media such as DVD or laser disk, magnetic tape, magneto-optical disk, and/or floppy disk. .
메모리(110)는 음성인식 모델 및 언어 모델을 저장할 수 있다. 구체적으로, 메모리(110)는 음성인식 모델에 포함된 인코더에 대한 정보 및 디코더에 대한 정보를 저장할 수 있고, 언어 모델에 포함된 예측 모듈에 대한 정보, 투사 모듈에 대한 정보 및 소프트맥스 모듈에 대한 정보를 저장할 수 있다. 메모리(110)는 도메인(Domain) 검출 모듈에 대한 정보를 저장할 수 있다.The memory 110 may store a voice recognition model and a language model. Specifically, the memory 110 may store information about the encoder and decoder included in the speech recognition model, information about the prediction module included in the language model, information about the projection module, and information about the softmax module. Information can be saved. The memory 110 may store information about the domain detection module.
메모리(110)는 음성인식 모델에 입력되는 음성 데이터, 음성인식 모델로부터 출력되는 전사 데이터 또는 문자 데이터를 저장할 수 있다. 메모리(110)는 언어 모델에 입력되는 문자 데이터, 언어 모델로부터 출력되는 예측 문자 데이터를 저장할 수 있다. 메모리(110)는 도메인에 대한 정보를 저장할 수 있다.The memory 110 may store voice data input to the voice recognition model, transcription data, or text data output from the voice recognition model. The memory 110 may store text data input to the language model and predicted text data output from the language model. Memory 110 may store information about the domain.
메모리(110)는 음성인식 모델 및 언어 모델을 구성하는 신경망에 대한 정보를 저장할 수 있다. 구체적으로 메모리(110)는 신경망을 구성하는 레이어, 노드와 같은 구조 정보와 가중치, 역치와 같은 파라미터 정보를 저장할 수 있다.The memory 110 may store information about the neural network that constitutes the speech recognition model and language model. Specifically, the memory 110 can store structural information such as layers and nodes constituting a neural network and parameter information such as weights and thresholds.
다만, 이에 국한되는 것은 아니며, 전자 장치(100)가 음성인식 모델과 언어 모델의 학습과 이용 과정에서 생성 및 가공되거나 이용되는 다양한 정보를 저장할 수 있다.However, it is not limited to this, and the electronic device 100 may store various information generated, processed, or used in the process of learning and using the voice recognition model and language model.
프로세서(120)는, 전자 장치(100)의 전반적인 동작을 제어한다. 구체적으로, 프로세서(120)는 상술한 바와 메모리(110)를 포함하는 전자 장치(100)의 구성과 연결되며, 상술한 바와 같은 메모리(110)에 저장된 적어도 하나의 인스트럭션을 실행함으로써, 전자 장치(100)의 동작을 전반적으로 제어할 수 있다. 특히, 프로세서(120)는 하나의 프로세서(120)로 구현될 수 있을 뿐만 아니라 복수의 프로세서(120)로 구현될 수 있다.The processor 120 controls the overall operation of the electronic device 100. Specifically, the processor 120 is connected to the configuration of the electronic device 100 including the memory 110 as described above, and executes at least one instruction stored in the memory 110 as described above, thereby controlling the electronic device ( 100) operations can be controlled overall. In particular, the processor 120 may be implemented not only as one processor 120 but also as a plurality of processors 120 .
프로세서(120)는 다양한 방식으로 구현될 수 있다. 예를 들어, 하나 이상의 프로세서(120)는 CPU (Central Processing Unit), GPU (Graphics Processing Unit), APU (Accelerated Processing Unit), MIC (Many Integrated Core), DSP (Digital Signal Processor), NPU (Neural Processing Unit), 하드웨어 가속기 또는 머신 러닝 가속기 중 하나 이상을 포함할 수 있다. 하나 이상의 프로세서(120)는 전자 장치(100)의 다른 구성요소 중 하나 또는 임의의 조합을 제어할 수 있으며, 통신에 관한 동작 또는 데이터 처리를 수행할 수 있다. 하나 이상의 프로세서(120)는 메모리(110)에 저장된 하나 이상의 프로그램 또는 명령어(instruction)을 실행할 수 있다. 예를 들어, 하나 이상의 프로세서(120)는 메모리(110)에 저장된 하나 이상의 명령어를 실행함으로써, 본 개시의 일 실시 예에 따른 방법을 수행할 수 있다. Processor 120 may be implemented in various ways. For example, the one or more processors 120 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Accelerated Processing Unit (APU), a Many Integrated Core (MIC), a Digital Signal Processor (DSP), and a Neural Processing Unit (NPU). Unit), hardware accelerator, or machine learning accelerator. One or more processors 120 may control one or any combination of other components of the electronic device 100 and may perform operations related to communication or data processing. One or more processors 120 may execute one or more programs or instructions stored in the memory 110. For example, one or more processors 120 may perform a method according to an embodiment of the present disclosure by executing one or more instructions stored in the memory 110.
본 개시의 일 실시 예에 따른 방법이 복수의 동작을 포함하는 경우, 복수의 동작은 하나의 프로세서(120)에 의해 수행될 수도 있고, 복수의 프로세서(120)에 의해 수행될 수도 있다. 예를 들어, 일 실시 예에 따른 방법에 의해 제 1 동작, 제 2 동작, 제 3 동작이 수행될 때, 제 1 동작, 제 2 동작, 및 제 3 동작 모두 제 1 프로세서(120)에 의해 수행될 수도 있고, 제 1 동작 및 제 2 동작은 제 1 프로세서(120)(예를 들어, 범용 프로세서(120))에 의해 수행되고 제 3 동작은 제 2 프로세서(120)(예를 들어, 인공지능 전용 프로세서(120))에 의해 수행될 수도 있다. When the method according to an embodiment of the present disclosure includes a plurality of operations, the plurality of operations may be performed by one processor 120 or may be performed by a plurality of processors 120. For example, when the first operation, the second operation, and the third operation are performed by the method according to one embodiment, the first operation, the second operation, and the third operation are all performed by the first processor 120. It may be, the first operation and the second operation are performed by the first processor 120 (e.g., general-purpose processor 120) and the third operation is performed by the second processor 120 (e.g., artificial intelligence It may also be performed by a dedicated processor 120).
하나 이상의 프로세서(120)는 하나의 코어를 포함하는 단일 코어 프로세서(120)(single core processor)로 구현될 수도 있고, 복수의 코어(예를 들어, 동종 멀티 코어 또는 이종 멀티 코어)를 포함하는 하나 이상의 멀티 코어 프로세서(120)(multicore processor)로 구현될 수도 있다. 하나 이상의 프로세서(120)가 멀티 코어 프로세서(120)로 구현되는 경우, 멀티 코어 프로세서(120)에 포함된 복수의 코어 각각은 온 칩(On-chip) 메모리(110)와 같은 프로세서(120) 내부 메모리(110)를 포함할 수 있으며, 복수의 코어에 의해 공유되는 공통 캐시가 멀티 코어 프로세서(120)에 포함될 수 있다. 또한, 멀티 코어 프로세서(120)에 포함된 복수의 코어 각각(또는 복수의 코어 중 일부)은 독립적으로 본 개시의 일 실시 예에 따른 방법을 구현하기 위한 프로그램 명령을 판독하여 수행할 수도 있고, 복수의 코어 전체(또는 일부)가 연계되어 본 개시의 일 실시 예에 따른 방법을 구현하기 위한 프로그램 명령을 판독하여 수행할 수도 있다.The one or more processors 120 may be implemented as a single core processor 120 including one core, or as a single core processor 120 including a plurality of cores (e.g., homogeneous multi-core or heterogeneous multi-core). It may also be implemented with the above multi-core processor 120 (multicore processor). When one or more processors 120 are implemented as a multi-core processor 120, each of the plurality of cores included in the multi-core processor 120 is inside the processor 120, such as an on-chip memory 110. It may include a memory 110, and a common cache shared by a plurality of cores may be included in the multi-core processor 120. In addition, each of the plurality of cores (or some of the plurality of cores) included in the multi-core processor 120 may independently read and perform program instructions for implementing the method according to an embodiment of the present disclosure, and the plurality of cores may independently read and perform the program instructions. All (or part) of the core may be linked to read and perform program instructions for implementing the method according to an embodiment of the present disclosure.
본 개시의 일 실시 예에 따른 방법이 복수의 동작을 포함하는 경우, 복수의 동작은 멀티 코어 프로세서(120)에 포함된 복수의 코어 중 하나의 코어에 의해 수행될 수도 있고, 복수의 코어에 의해 수행될 수도 있다. 예를 들어, 일 실시 예에 따른 방법에 의해 제 1 동작, 제 2 동작, 및 제 3 동작이 수행될 때, 제 1 동작, 제2 동작, 및 제3 동작 모두 멀티 코어 프로세서(120)에 포함된 제 1 코어에 의해 수행될 수도 있고, 제 1 동작 및 제 2 동작은 멀티 코어 프로세서(120)에 포함된 제 1 코어에 의해 수행되고 제 3 동작은 멀티 코어 프로세서(120)에 포함된 제 2 코어에 의해 수행될 수도 있다. When the method according to an embodiment of the present disclosure includes a plurality of operations, the plurality of operations may be performed by one core among the plurality of cores included in the multi-core processor 120, or by the plurality of cores. It may also be carried out. For example, when the first operation, the second operation, and the third operation are performed by the method according to one embodiment, the first operation, the second operation, and the third operation are all included in the multi-core processor 120. It may be performed by the first core, and the first operation and the second operation are performed by the first core included in the multi-core processor 120, and the third operation is performed by the second core included in the multi-core processor 120. It may also be performed by the core.
본 개시의 실시 예들에서, 프로세서(120)는 하나 이상의 프로세서(120) 및 기타 전자 부품들이 집적된 시스템 온 칩(SoC), 단일 코어 프로세서(120), 멀티 코어 프로세서(120), 또는 단일 코어 프로세서(120) 또는 멀티 코어 프로세서(120)에 포함된 코어를 의미할 수 있으며, 여기서 코어는 CPU, GPU, APU, MIC, DSP, NPU, 하드웨어 가속기 또는 기계 학습 가속기 등으로 구현될 수 있으나, 본 개시의 실시 예들이 이에 한정되는 것은 아니다.In embodiments of the present disclosure, the processor 120 may be a system-on-chip (SoC), a single-core processor 120, a multi-core processor 120, or a single-core processor in which one or more processors 120 and other electronic components are integrated. (120) or may refer to a core included in the multi-core processor 120, where the core may be implemented as a CPU, GPU, APU, MIC, DSP, NPU, hardware accelerator, or machine learning accelerator, but the present disclosure The embodiments are not limited to this.
하나 이상의 프로세서(120)(이하 '프로세서(120)')는, 제1 음성 데이터 및 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득할 수 있다.One or more processors 120 (hereinafter referred to as 'processors 120') may acquire first voice data and first text data corresponding to the first voice data.
프로세서(120)는 제1 문자 데이터 및 제1 문자 데이터를 제1 언어 모델(Language Model)에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 제1 언어 모델에 포함된 제1 예측(Prediction) 모듈, 투사(Projection) 모듈 및 소프트맥스(Softmax) 모듈을 학습할 수 있다.The processor 120 operates a first prediction module included in the first language model based on the first character data and the first prediction character data obtained by inputting the first character data into the first language model (Language Model). , you can learn the projection module and softmax module.
프로세서(120)는 제1 음성 데이터 및 제1 음성 데이터를 학습된 제1 예측 모듈이 부가된 음성인식 모델(Automatic Speech Recognition Model)에 입력하여 획득한 제1 전사(Transcription) 데이터에 기초하여 음성인식 모델을 학습할 수 있다.The processor 120 performs speech recognition based on first transcription data obtained by inputting the first speech data and the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. You can learn a model.
프로세서(120)는 제2 문자 데이터 및 제2 문자 데이터를 학습된 투사 모듈 및 학습된 소프트맥스 모듈이 부가된 제2 언어 모델에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 제2 언어 모델을 학습할 수 있다.The processor 120 creates a second language model based on second prediction character data obtained by inputting the second character data and the second character data into a second language model to which the learned projection module and the learned softmax module are added. You can learn.
여기서, 제1 언어 모델 및 제2 언어 모델은, 입력 문자에 대응되는 예측 문자의 확률 값을 출력하여 입력 문자에 대응되는 예측 문자를 판단하는 모델이고, 음성인식 모델은, 입력 음성에 대응되는 문자의 확률 값을 출력하여 입력 음성에 대응되는 문자를 판단하는 모델일 수 있다.Here, the first language model and the second language model are models that determine the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character, and the speech recognition model is the character corresponding to the input voice. It may be a model that determines the character corresponding to the input voice by outputting the probability value of .
프로세서(120)의 음성인식 모델 및 언어모델을 포함하는 전자 장치(100)의 제어 동작은 도 3 내지 5과 함께 보다 구체적으로 설명한다.The control operation of the electronic device 100, including the voice recognition model and language model of the processor 120, will be described in more detail with FIGS. 3 to 5.
도 3은 본 개시의 일 실시 예에 따른, 전자 장치(100)가 음성인식 모델 및 언어 모델을 학습하는 동작을 설명하기 위한 도면이다.FIG. 3 is a diagram illustrating an operation of the electronic device 100 learning a voice recognition model and a language model, according to an embodiment of the present disclosure.
도 3을 참조하면, 프로세서(120)는, 제1 음성 데이터 및 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득할 수 있다. 여기서, 제1 문자 데이터는 제1 음성 데이터를 전사(Transcription)한 전사 데이터일 수 있으나, 이에 국한 되는 것은 아니며, 제1 음성 데이터에 포함된 의미에 대응되는 문자 데이터일 수 있다.Referring to FIG. 3, the processor 120 may obtain first voice data and first text data corresponding to the first voice data. Here, the first text data may be transcription data obtained by transcribing the first voice data, but is not limited thereto, and may be text data corresponding to the meaning included in the first voice data.
프로세서(120)는 제1 문자 데이터를 제1 언어 모델(200)에 입력하여 제1 예측 문자 데이터를 획득할 수 있다.The processor 120 may input first text data into the first language model 200 to obtain first predicted text data.
여기서 언어 모델은 입력 문자에 대응되는 예측 문자의 확률 값을 출력하여 입력 문자에 대응되는 예측 문자를 판단하는 모델이다.Here, the language model is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character.
프로세서(120)는 제1 문자 데이터 및 제1 문자 데이터를 제1 언어 모델(200)에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 제1 언어 모델에 포함된 제1 예측 모듈(210), 투사 모듈 및 소프트맥스 모듈(220)을 학습할 수 있다.The processor 120 includes a first prediction module 210 included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into the first language model 200; The projection module and softmax module 220 can be learned.
여기서, 예측 모듈(210)은 입력된 문자 다음에 나올 수 있는 문자를 예측하여 식별하는 역할을 하며, 투사 모듈 및 소프트맥스 모듈(220)은 예측된 문자에 대응되는 가상 공간 상의 벡터 값과 벡터 값에 대응되는 확률 값을 식별하는 역할을 한다. Here, the prediction module 210 plays a role in predicting and identifying characters that may appear after the input character, and the projection module and softmax module 220 predict vector values and vector values in virtual space corresponding to the predicted characters. It serves to identify the probability value corresponding to .
또한, 언어 모델(200)이 포함하는 예측 모듈(210), 투사 모듈 및 소프트맥스 모듈(220)은 노드, 레이어, 가중치를 갖는 신경망 모델로 이루어질 수 있다. Additionally, the prediction module 210, projection module, and softmax module 220 included in the language model 200 may be composed of a neural network model with nodes, layers, and weights.
예측 모듈(210), 투사 모듈 및 소프트맥스 모듈(220)을 구성하는 신경망 모델은 CNN(Convolutional Neural Network), DNN(Deep Neural Network) 등으로 이루어질 수 있으나, 이에 국한되는 것은 아니며, 다양한 방식의 신경망 모델로 이루어질 수 있다.The neural network model constituting the prediction module 210, the projection module, and the softmax module 220 may be comprised of a CNN (Convolutional Neural Network), a DNN (Deep Neural Network), etc., but is not limited thereto, and may include various types of neural networks. It can be done as a model.
프로세서(120)의 언어 모델 학습 동작은 제1 문자 데이터, 제1 예측 문자 데이터 및 제1 언어 모델(200)의 손실(loss)(예: 크로스 엔트로피 손실) 정보에 기초하여 이루어질 수 있다. 즉, 프로세서(120)는 언어 모델(200)의 손실을 최소화하기 위해 제1 예측 모듈(210), 투사 모듈 및 소프트맥스 모듈(220)의 파라미터(예: 가중치, 역치 등)를 수정하는 방식으로 예측 모듈(210), 투사 모듈 및 소프트맥스 모듈(220)을 학습할 수 있다.The language model learning operation of the processor 120 may be performed based on first text data, first predicted text data, and loss (eg, cross-entropy loss) information of the first language model 200. That is, the processor 120 modifies the parameters (e.g., weight, threshold, etc.) of the first prediction module 210, projection module, and softmax module 220 to minimize the loss of the language model 200. The prediction module 210, projection module, and softmax module 220 can be learned.
프로세서(120)는 제1 음성 데이터 및 제1 음성 데이터를 학습된 제1 예측 모듈(210)이 부가된 음성인식 모델(Automatic Speech Recognition Model)(300)에 입력하여 획득한 제1 전사(Transcription) 데이터에 기초하여 음성인식 모델(300)을 학습할 수 있다.The processor 120 inputs the first voice data and the first voice data into a voice recognition model (Automatic Speech Recognition Model) 300 to which the learned first prediction module 210 is added, and obtains the first transcription. The voice recognition model 300 can be learned based on the data.
여기서, 음성인식 모델은 입력 음성에 대응되는 문자의 확률 값을 출력하여 입력 음성에 대응되는 문자를 판단하는 모델이다.Here, the voice recognition model is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
음성인식 모델(300)은 전사(Transcription) 모듈(310)을 포함하는 인코더(Encoder) 및 학습된 제1 예측 모듈(210)이 부가된 통합 모듈(320)을 포함하는 디코더(Decoder)를 포함할 수 있다.The speech recognition model 300 may include an encoder including a transcription module 310 and a decoder including an integrated module 320 to which the learned first prediction module 210 is added. You can.
전사 모듈(310)은 입력된 음성 데이터를 음성 데이터에 대응되는 가상 공간 상의 벡터 값으로 변환할 수 있다. 학습된 제1 예측 모듈(210)이 부가된 통합 모듈(320)은 전사 모듈(310)로부터 출력된 벡터 값 및 출력되는 문자열이 가질 수 있는 확률 값에 기초하여 문자열을 출력할 수 있으며, 출력된 문자열은 제1 전사 데이터에 대응될 수 있다.The transcription module 310 may convert the input voice data into a vector value in a virtual space corresponding to the voice data. The integration module 320 to which the learned first prediction module 210 is added can output a string based on the vector value output from the transcription module 310 and the probability value that the output string may have. The character string may correspond to the first transcription data.
음성인식 모델(300)이 포함하는 전사 모듈(310), 통합 모듈(320)은 노드, 레이어, 가중치를 갖는 신경망 모델로 이루어질 수 있다. The transcription module 310 and integration module 320 included in the voice recognition model 300 may be composed of a neural network model with nodes, layers, and weights.
전사 모듈(310), 통합 모듈(320)을 구성하는 신경망 모델은 CNN(Convolutional Neural Network), DNN(Deep Neural Network), RNN(Recurrent Neural Network) 등으로 이루어질 수 있다. 다만, 이에 국한되는 것은 아니며, 다양한 방식의 신경망 모델로 이루어질 수 있다.The neural network model constituting the transcription module 310 and the integration module 320 may be composed of a Convolutional Neural Network (CNN), Deep Neural Network (DNN), or Recurrent Neural Network (RNN). However, it is not limited to this and can be made up of various types of neural network models.
프로세서(120)는 학습된 제1 예측 모듈(210)의 구조 및 파라미터를 유지한 상태로 제1 음성 데이터, 제1 음성 데이터에 대응되는 제1 문자 데이터 및 음성인식 모델(300)로부터 출력된 제1 전사 데이터에 기초하여 음성인식 모델(300)에 포함된 전사 모듈(310) 및 통합 모듈(320)을 학습할 수 있다.The processor 120 maintains the learned structure and parameters of the first prediction module 210 and outputs the first voice data, first text data corresponding to the first voice data, and the voice recognition model 300. 1 The transcription module 310 and integration module 320 included in the voice recognition model 300 can be learned based on the transcription data.
상술한 바와 같이 학습된 제1 예측 모듈(210)의 구조 및 파라미터를 유지한 상태로 전사 모듈(310) 및 통합 모듈(320)을 학습함으로써 프로세서(120)는 학습된 제1 언어 모델(200)에 포함된 투사 모듈 및 소프트맥스 모듈(220)의 가상 벡터 공간에 적응된 전사 모듈(310) 및 통합 모듈(320)을 학습할 수 있다. 따라서, 프로세서(120)는 보다 효율적으로 음성인식 모델(300)을 학습할 수 있고, 음성인식 모델(300)의 음성인식 정확도를 높일 수 있다.By learning the transcription module 310 and the integration module 320 while maintaining the structure and parameters of the first prediction module 210 learned as described above, the processor 120 creates the learned first language model 200. The transcription module 310 and integration module 320 adapted to the virtual vector space of the projection module and softmax module 220 included in can be learned. Accordingly, the processor 120 can learn the voice recognition model 300 more efficiently and improve the voice recognition accuracy of the voice recognition model 300.
프로세서(120)는 제2 문자 데이터 및 제2 문자 데이터를 학습된 투사 모듈 및 학습된 소프트맥스 모듈(220)이 부가된 제2 언어 모델(400)에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 제2 언어 모델(400)을 학습할 수 있다.The processor 120 inputs the second character data and the second character data into the second language model 400 to which the learned projection module and the learned softmax module 220 are added, based on the second predicted character data obtained. Thus, the second language model 400 can be learned.
여기서, 제2 언어 모델(400)의 동작과 제2 언어 모델(400)이 포함하는 제2 예측 모듈(410), 제2 언어 모델에 부가된 학습된 투사 모듈 및 학습된 소프트맥스 모듈(220)의 동작 및 구성은 제1 언어 모델(300)과 함께 전술한 바와 같다.Here, the operation of the second language model 400, the second prediction module 410 included in the second language model 400, the learned projection module and the learned softmax module 220 added to the second language model. The operation and configuration of is the same as described above along with the first language model 300.
구체적으로, 프로세서(120)는 학습된 투사 모듈 및 학습된 소프트맥스 모듈(220)이 부가된 제2 언어 모델(400)에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 제2 언어 모델(400)에 포함된 제2 예측 모듈(410)을 학습할 수 있고, 제2 예측 모듈(410)은 기 정의된 도메인(Domain)과 관련하여 입력된 문자 다음에 나올 문자를 예측하도록 도메인 별로 학습될 수 있다. 즉, 프로세서(120)는 특정 도메인에 관련된 문자 데이터 및 예측 문자 데이터에 기초하여 제2 예측 모듈(410-1, 410-2, 410-3)을 각각 도메인 별로 학습할 수 있다.Specifically, the processor 120 generates a second language model 400 based on the second predicted character data obtained by inputting it into the second language model 400 to which the learned projection module and the learned softmax module 220 are added. ) can be learned, and the second prediction module 410 can be learned for each domain to predict the character that will appear after the input character in relation to the predefined domain (Domain). there is. That is, the processor 120 may learn the second prediction modules 410-1, 410-2, and 410-3 for each domain based on text data and predicted text data related to a specific domain.
기 정의된 도메인은 문자 데이터의 주제일 수 있으며, 예를 들어, "파스타를 조리할 땐 우유를 이용하여 소스를 만들어야 합니다.", "감자탕을 끓일 때는 40분간 푹 끓여야 합니다."라는 문자 데이터의 도메인은 "요리"일 수 있다. 또 다른 예로, "중부지방에는 내일 오후부터 비가 옵니다.", "내일은 하루종일 맑은 날씨가 계속될 것입니다."라는 문자 데이터의 도메인은 "날씨"일 수 있다.A predefined domain can be the subject of text data, for example, text data such as “When cooking pasta, you must use milk to make sauce” or “When making gamjatang, you must boil it for 40 minutes.” The domain might be “cooking”. As another example, the domain of text data such as "It will rain in the central region starting tomorrow afternoon" and "It will be sunny all day tomorrow" may be "Weather".
프로세서(120)는 특정 도메인에 대응되는 문자 데이터로 제2 예측 모듈(410)을 학습함으로써, 특정 도메인 별로 적은 양의 데이터로 학습이 가능해지며 제2 예측 모듈(410)의 예측 정확도를 높일 수 있다.The processor 120 learns the second prediction module 410 with character data corresponding to a specific domain, thereby enabling learning with a small amount of data for each specific domain and improving the prediction accuracy of the second prediction module 410. .
또한, 상술한 바와 같이 학습된 투사 모듈 및 학습된 소프트맥스 모듈(220)의 구조 및 파라미터를 유지한 상태로 제2 언어 모델(400)에 포함된 제2 예측 모듈(410)을 학습함으로써 일반적인 음성인식 모델의 학습에서 필요한 음성 데이터와 문자 데이터 쌍이 아닌 문자 데이터만으로도 예측 모듈(410-1, 410-2, 410-3)을 학습할 수 있어 효율적인 학습이 가능해진다.In addition, as described above, by learning the second prediction module 410 included in the second language model 400 while maintaining the structure and parameters of the learned projection module and the learned softmax module 220, general speech Efficient learning is possible because the prediction modules 410-1, 410-2, and 410-3 can be learned using only text data, rather than the voice data and text data pairs required for learning the recognition model.
프로세서(120)는 상술한 바와 같이 제1 언어 모델(200), 음성인식 모델(300), 제2 언어 모델(400)을 학습할 수 있다. 학습된 모델을 이용하여 음성인식을 수행하는 방법은 도4와 함께 설명한다.As described above, the processor 120 can learn the first language model 200, the speech recognition model 300, and the second language model 400. A method of performing voice recognition using the learned model will be explained with Figure 4.
도 4는 본 개시의 일 실시 예에 따른, 전자 장치(100)가 학습된 음성인식 모델을 이용하여 음성인식을 수행하는 동작을 설명하기 위한 도면이다.FIG. 4 is a diagram illustrating an operation in which the electronic device 100 performs voice recognition using a learned voice recognition model, according to an embodiment of the present disclosure.
도 4를 참조하면, 프로세서(120)는, 제2 음성 데이터를 학습된 제1 예측 모듈(210)이 부가되어 학습된 음성인식 모델(300)에 입력하여 제2 전사 데이터를 획득할 수 있다. Referring to FIG. 4, the processor 120 may obtain second transcription data by inputting the second voice data into the voice recognition model 300 learned by adding the learned first prediction module 210.
제2 음성 데이터는 제1 음성 데이터와 다른 임의의 음성 데이터이며, 제2 전사 데이터는 제2 음성 데이터에 포함된 문자에 대응되는 문자 데이터일 수 있다.The second voice data may be arbitrary voice data different from the first voice data, and the second transcription data may be text data corresponding to characters included in the second voice data.
프로세서(120)는 획득된 제2 전사 데이터를 도메인 검출 모듈(500)에 입력하여 제2 전사 데이터에 대응되는 도메인을 식별할 수 있다. 즉, 제2 음성 데이터의 주제가 무엇인지 도메인 검출 모듈(500)을 통해 식별한 도메인에 기초하여 식별할 수 있다.The processor 120 may input the acquired second transcription data into the domain detection module 500 to identify the domain corresponding to the second transcription data. In other words, the subject of the second voice data can be identified based on the domain identified through the domain detection module 500.
프로세서(120)는 복수의 학습된 제2 예측 모듈(410-1, 410-2, 410-3) 중 식별된 도메인에 대응되는 제2 예측 모듈(410)을 식별할 수 있다.The processor 120 may identify the second prediction module 410 corresponding to the identified domain among the plurality of learned second prediction modules 410-1, 410-2, and 410-3.
프로세서(120)는 제2 음성 데이터를 식별된 제2 예측 모듈(410)이 부가된 학습된 음성인식 모델(300)에 입력하여 제3 전사 데이터를 획득하거나 제3 전사 데이터에 대응되는 문자 데이터를 획득할 수 있다.The processor 120 inputs the second voice data into the learned voice recognition model 300 to which the identified second prediction module 410 is added to obtain third transcription data or text data corresponding to the third transcription data. It can be obtained.
또한, 제2 예측 모듈(410)이 부가된 학습된 음성인식 모델은, 제1 예측 모듈(210)을 포함하지 않을 수 있다.Additionally, the learned speech recognition model to which the second prediction module 410 is added may not include the first prediction module 210.
따라서, 언어 모델을 이용한 추가 연산 없이 음성인식 모델(300)에 포함된 예측 모듈을 제1 예측 모듈(210)에서 특정 도메인에 대응되는 제2 예측 모듈(410)로 바꾸어 최종적인 음성인식을 수행함으로써 빠르고 간편한 음성인식이 가능해진다.Therefore, final speech recognition is performed by changing the prediction module included in the speech recognition model 300 from the first prediction module 210 to the second prediction module 410 corresponding to a specific domain without additional calculation using the language model. Fast and easy voice recognition becomes possible.
또한, 음성인식 모델의 공통적인 부분 모델인 예측 모듈을 교환할 수 있게 하여 도메인 별 음성인식의 정확도를 높일 수 있다.In addition, the accuracy of voice recognition for each domain can be increased by allowing the prediction module, which is a common partial model of the voice recognition model, to be exchanged.
도 5는 본 개시의 일 실시 예에 따른, 전자 장치(100)의 제어에 이용될 수 있는 음성인식 모델을 설명하기 위한 도면이다.FIG. 5 is a diagram illustrating a voice recognition model that can be used to control the electronic device 100 according to an embodiment of the present disclosure.
도 5를 참조하면, 본 개시의 일 실시 예에 따른 음성인식 모델에 음성 데이터가 입력된 경우, 음성과 관련된 도메인을 먼저 식별하고, 식별된 도메인에 기초하여 최종 음성인식을 수행하여 출력된 전사 데이터를 획득하여 정확한 음성인식 동작 수행이 가능하다.Referring to FIG. 5, when voice data is input to a voice recognition model according to an embodiment of the present disclosure, the domain related to the voice is first identified, the final voice recognition is performed based on the identified domain, and the transcription data is output. It is possible to perform accurate voice recognition operations by acquiring .
또한, 본 개시에 따른 음성인식 모델은 학습된 예측 모듈과 투사 모듈 및 소프트맥스 모듈의 구조 및 파라미터를 고정한 상태로 음성인식 모델을 학습하기 때문에 적은 데이터만으로도 정확하고 효율적인 학습이 가능하다.In addition, since the speech recognition model according to the present disclosure learns the speech recognition model with the structure and parameters of the learned prediction module, projection module, and softmax module fixed, accurate and efficient learning is possible with only a small amount of data.
도 6은 본 개시의 일 실시 예에 따른, 전자 장치(100)의 동작을 설명하기 위한 흐름도이다.FIG. 6 is a flowchart for explaining the operation of the electronic device 100 according to an embodiment of the present disclosure.
전자 장치(100)는 제1 음성 데이터 및 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득할 수 있다(S610).The electronic device 100 may obtain first voice data and first text data corresponding to the first voice data (S610).
전자 장치(100)는 제1 문자 데이터 및 제1 문자 데이터를 제1 언어 모델에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 제1 언어 모델에 포함된 제1 예측 모듈, 투사 모듈 및 소프트맥스 모듈을 학습할 수 있다(S620).The electronic device 100 includes the first prediction module, projection module, and softmax included in the first language model based on the first character data and the first prediction character data obtained by inputting the first character data into the first language model. You can learn the module (S620).
전자 장치(100)는 제1 음성 데이터 및 제1 음성 데이터를 학습된 제1 예측 모듈이 부가된 음성인식 모델에 입력하여 획득한 제1 전사 데이터에 기초하여 음성인식 모델을 학습할 수 있다(S630). The electronic device 100 may learn a voice recognition model based on the first transcription data obtained by inputting the first voice data and the first voice data into a voice recognition model to which the learned first prediction module is added (S630) ).
전자 장치(100)는 학습된 제1 예측 모듈의 구조 및 파라미터를 유지한 상태로 제1 음성 데이터, 제1 문자 데이터 및 제1 전사 데이터에 기초하여 음성인식 모델에 포함된 전사 모듈 및 통합 모듈을 학습할 수 있다.The electronic device 100 uses a transcription module and an integration module included in the voice recognition model based on the first voice data, first text data, and first transcription data while maintaining the structure and parameters of the learned first prediction module. You can learn.
전자 장치(100)는 제2 문자 데이터 및 제2 문자 데이터를 학습된 투사 모듈 및 학습된 소프트맥스 모듈이 부가된 제2 언어 모델에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 제2 언어 모델을 학습할 수 있다(S640).The electronic device 100 creates a second language model based on the second predicted character data obtained by inputting the second character data and the second character data into a second language model to which the learned projection module and the learned softmax module are added. You can learn (S640).
전자 장치(100)는 학습된 투사 모듈 및 학습된 소프트 맥스 모듈의 구조 및 파라미터를 유지한 상태로 제2 문자 데이터 및 제2 예측 문자 데이터에 기초하여 제2 언어 모델에 포함된 제2 예측 모듈을 학습할 수 있다.The electronic device 100 generates a second prediction module included in the second language model based on the second character data and the second prediction character data while maintaining the structure and parameters of the learned projection module and the learned soft max module. You can learn.
전자 장치(100)는 제2 음성 데이터를 학습된 제1 예측 모듈이 부가되어 학습된 음성인식 모델에 입력하여 제2 전사 데이터를 획득하고, 획득된 제2 전사 데이터를 도메인 검출 모듈에 입력하여 제2 전사 데이터에 대응되는 도메인을 식별할 수 있다.The electronic device 100 acquires second transcription data by inputting the second voice data into a voice recognition model learned by adding the learned first prediction module, and inputs the obtained second transcription data into the domain detection module to obtain the first transcription data. 2 The domain corresponding to the transcription data can be identified.
전자 장치(100)는 복수의 학습된 제2 예측 모듈 중 식별된 도메인에 대응되는 제2 예측 모듈을 식별할 수 있고, 제2 음성 데이터를 식별된 제2 예측 모듈이 부가된 학습된 음성인식 모델에 입력하여 제3 전사 데이터를 획득할 수 있다.The electronic device 100 may identify a second prediction module corresponding to an identified domain among a plurality of learned second prediction modules, and use the second speech data as a learned speech recognition model to which the identified second prediction module is added. You can obtain third transcription data by inputting .
여기서, 제2 예측 모듈이 부가된 학습된 음성인식 모델은, 제1 예측 모듈을 포함하지 않을 수 있다.Here, the learned speech recognition model to which the second prediction module is added may not include the first prediction module.
본 개시에 따른 인공지능과 관련된 기능은 전자 장치의 프로세서와 메모리를 통해 동작된다.Functions related to artificial intelligence according to the present disclosure are operated through the processor and memory of the electronic device.
프로세서는 하나 또는 복수의 프로세서로 구성될 수 있다. 이때, 하나 또는 복수의 프로세서는 CPU(Central Processing Unit), GPU(Graphic Processing Unit), NPU(Neural Processing Unit) 중 적어도 하나를 포함할 수 있으나 전술한 프로세서의 예시에 한정되지 않는다.The processor may consist of one or multiple processors. At this time, one or more processors may include at least one of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Neural Processing Unit (NPU), but are not limited to the examples of the processors described above.
CPU는 일반 연산뿐만 아니라 인공지능 연산을 수행할 수 있는 범용 프로세서로서, 다계층 캐시(Cache) 구조를 통해 복잡한 프로그램을 효율적으로 실행할 수 있다. CPU는 순차적인 계산을 통해 이전 계산 결과와 다음 계산 결과의 유기적인 연계가 가능하도록 하는 직렬 처리 방식에 유리하다. 범용 프로세서는 전술한 CPU로 명시한 경우를 제외하고 전술한 예에 한정되지 않는다.CPU is a general-purpose processor that can perform not only general calculations but also artificial intelligence calculations, and can efficiently execute complex programs through a multi-layer cache structure. CPUs are advantageous for serial processing, which allows organic connection between previous and next calculation results through sequential calculations. The general-purpose processor is not limited to the above-described examples, except where specified as the above-described CPU.
GPU는 그래픽 처리에 이용되는 부동 소수점 연산 등과 같은 대량 연산을 위한 프로세서로서, 코어를 대량으로 집적하여 대규모 연산을 병렬로 수행할 수 있다. 특히, GPU는 CPU에 비해 컨볼루션(Convolution) 연산 등과 같은 병렬 처리 방식에 유리할 수 있다. 또한, GPU는 CPU의 기능을 보완하기 위한 보조 프로세서(co-processor)로 이용될 수 있다. 대량 연산을 위한 프로세서는 전술한 GPU로 명시한 경우를 제외하고 전술한 예에 한정되지 않는다. GPU is a processor for large-scale operations such as floating-point operations used in graphics processing, and can perform large-scale operations in parallel by integrating a large number of cores. In particular, GPUs may be more advantageous than CPUs in parallel processing methods such as convolution operations. Additionally, the GPU can be used as a co-processor to supplement the functions of the CPU. The processor for mass computation is not limited to the above-described example, except for the case specified as the above-described GPU.
NPU는 인공 신경망을 이용한 인공지능 연산에 특화된 프로세서로서, 인공 신경망을 구성하는 각 레이어를 하드웨어(예로, 실리콘)로 구현할 수 있다. 이때, NPU는 업체의 요구 사양에 따라 특화되어 설계되므로, CPU나 GPU에 비해 자유도가 낮으나, 업체가 요구하기 위한 인공지능 연산을 효율적으로 처리할 수 있다. 한편, 인공지능 연산에 특화된 프로세서로, NPU 는 TPU(Tensor Processing Unit), IPU(Intelligence Processing Unit), VPU(Vision processing unit) 등과 같은 다양한 형태로 구현 될 수 있다. 인공 지능 프로세서는 전술한 NPU로 명시한 경우를 제외하고 전술한 예에 한정되지 않는다.NPU is a processor specialized in artificial intelligence calculations using artificial neural networks, and each layer that makes up the artificial neural network can be implemented in hardware (e.g., silicon). At this time, the NPU is designed specifically according to the company's requirements, so it has a lower degree of freedom than a CPU or GPU, but can efficiently process artificial intelligence calculations requested by the company. Meanwhile, as a processor specialized for artificial intelligence calculations, NPU can be implemented in various forms such as TPU (Tensor Processing Unit), IPU (Intelligence Processing Unit), and VPU (Vision processing unit). The artificial intelligence processor is not limited to the examples described above, except where specified as the NPU described above.
또한, 하나 또는 복수의 프로세서는 SoC(System on Chip)으로 구현될 수 있다. 이때, SoC에는 하나 또는 복수의 프로세서 이외에 메모리, 및 프로세서와 메모리 사이의 데이터 통신을 위한 버스(Bus)등과 같은 네트워크 인터페이스를 더 포함할 수 있다. Additionally, one or more processors may be implemented as a System on Chip (SoC). At this time, in addition to one or more processors, the SoC may further include memory and a network interface such as a bus for data communication between the processor and memory.
전자 장치에 포함된 SoC(System on Chip)에 복수의 프로세서가 포함된 경우, 전자 장치는 복수의 프로세서 중 일부 프로세서를 이용하여 인공지능과 관련된 연산(예를 들어, 인공지능 모델의 학습(learning)이나 추론(inference)에 관련된 연산)을 수행할 수 있다. 예를 들어, 전자 장치는 복수의 프로세서 중 컨볼루션 연산, 행렬 곱 연산 등과 같은 인공지능 연산에 특화된 GPU, NPU, VPU, TPU, 하드웨어 가속기 중 적어도 하나를 이용하여 인공지능과 관련된 연산을 수행할 수 있다. 다만, 이는 일 실시예에 불과할 뿐, CPU 등과 범용 프로세서를 이용하여 인공지능과 관련된 연산을 처리할 수 있음은 물론이다. If the SoC (System on Chip) included in the electronic device includes a plurality of processors, the electronic device uses some of the processors to perform artificial intelligence-related operations (for example, learning of an artificial intelligence model). or operations related to inference) can be performed. For example, an electronic device can perform operations related to artificial intelligence using at least one of a plurality of processors, a GPU, NPU, VPU, TPU, or hardware accelerator specialized for artificial intelligence operations such as convolution operation, matrix multiplication operation, etc. there is. However, this is only an example, and of course, calculations related to artificial intelligence can be processed using general-purpose processors such as CPUs.
또한, 전자 장치는 하나의 프로세서에 포함된 멀티 코어(예를 들어, 듀얼 코어, 쿼드 코어 등)를 이용하여 인공지능과 관련된 기능에 대한 연산을 수행할 수 있다. 특히, 전자 장치는 프로세서에 포함된 멀티 코어를 이용하여 병렬적으로 컨볼루션 연산, 행렬 곱 연산 등과 같은 인공 지능 연산을 수행할 수 있다. Additionally, electronic devices can perform calculations on functions related to artificial intelligence using multiple cores (eg, dual core, quad core, etc.) included in one processor. In particular, electronic devices can perform artificial intelligence operations such as convolution operations and matrix multiplication operations in parallel using multi-cores included in the processor.
하나 또는 복수의 프로세서는, 메모리에 저장된 기정의된 동작 규칙 또는 인공지능 모델에 따라, 입력 데이터를 처리하도록 제어한다. 기정의된 동작 규칙 또는 인공지능 모델은 학을 통해 만들어진 것을 특징으로 한다. One or more processors control input data to be processed according to predefined operation rules or artificial intelligence models stored in memory. Predefined operation rules or artificial intelligence models are characterized by being created through science.
여기서, 학습을 통해 만들어진다는 것은, 다수의 학습 데이터들에 학습 알고리즘을 적용함으로써, 원하는 특성의 기정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미한다. 이러한 학습은 본 개시에 따른 인공지능이 수행되는 기기 자체에서 이루어질 수도 있고, 별도의 서버/시스템을 통해 이루어 질 수도 있다. Here, being created through learning means that a predefined operation rule or artificial intelligence model with desired characteristics is created by applying a learning algorithm to a large number of learning data. This learning may be performed on the device itself that performs the artificial intelligence according to the present disclosure, or may be performed through a separate server/system.
인공지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 적어도 하나의 레이어는 적어도 하나의 가중치(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 적어도 하나의 정의된 연산을 통해 레이어의 연산을 수행한다. 신경망의 예로는, CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 및 심층 Q-네트워크 (Deep Q-Networks), Transformer가 있으며, 본 개시에서의 신경망은 명시한 경우를 제외하고 전술한 예에 한정되지 않는다.An artificial intelligence model may be composed of multiple neural network layers. At least one layer has at least one weight value, and the operation of the layer is performed using the operation result of the previous layer and at least one defined operation. Examples of neural networks include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), and Deep Neural Network (BRDNN). There are Q-Networks (Deep Q-Networks) and Transformer, and the neural network in this disclosure is not limited to the above-described examples except where specified.
학습 알고리즘은, 다수의 학습 데이터들을 이용하여 소정의 대상 기기(예컨대, 로봇)을 훈련시켜 소정의 대상 기기 스스로 결정을 내리거나 예측을 할 수 있도록 하는 방법이다. 학습 알고리즘의 예로는, 지도형 학습(supervised learning), 비지도형 학습(unsupervised learning), 준지도형 학습(semi-supervised learning) 또는 강화 학습(reinforcement learning)이 있으며, 본 개시에서의 학습 알고리즘은 명시한 경우를 제외하고 전술한 예에 한정되지 않는다.A learning algorithm is a method of training a target device (eg, a robot) using a large number of learning data so that the target device can make decisions or make predictions on its own. Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, and the learning algorithm in the present disclosure is specified. Except, it is not limited to the examples described above.
일 실시 예에 따르면, 본 문서에 개시된 다양한 실시 예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로 배포되거나, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 또는 두개의 사용자 장치들(예: 스마트폰들) 간에 직접, 온라인으로 배포(예: 다운로드 또는 업로드)될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품(예: 다운로더블 앱(downloadable app))의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 기기로 읽을 수 있는 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to one embodiment, methods according to various embodiments disclosed in this document may be included and provided in a computer program product. Computer program products are commodities and can be traded between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or through an application store (e.g. Play StoreTM) or on two user devices (e.g. It can be distributed (e.g. downloaded or uploaded) directly between smartphones) or online. In the case of online distribution, at least a portion of the computer program product (e.g., a downloadable app) is stored on a machine-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server. It can be temporarily stored or created temporarily.
이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시에 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In the above, preferred embodiments of the present disclosure have been shown and described, but the present disclosure is not limited to the specific embodiments described above, and may be used in the technical field pertaining to the disclosure without departing from the gist of the disclosure as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical ideas or perspectives of the present disclosure.

Claims (15)

  1. 전자 장치에 있어서,In electronic devices,
    적어도 하나의 인스트럭션을 저장하는 메모리; 및a memory storing at least one instruction; and
    상기 적어도 하나의 인스트럭션을 실행하는 하나 이상의 프로세서;를 포함하고,Includes one or more processors executing the at least one instruction,
    상기 하나 이상의 프로세서는,The one or more processors:
    제1 음성 데이터 및 상기 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득하고,Obtaining first voice data and first text data corresponding to the first voice data,
    상기 제1 문자 데이터 및 상기 제1 문자 데이터를 제1 언어 모델(Language Model)에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 상기 제1 언어 모델에 포함된 제1 예측(Prediction) 모듈, 투사(Projection) 모듈 및 소프트맥스(Softmax) 모듈을 학습하고,A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learn the Projection module and Softmax module,
    상기 제1 음성 데이터 및 상기 제1 음성 데이터를 상기 학습된 제1 예측 모듈이 부가된 음성인식 모델(Automatic Speech Recognition Model)에 입력하여 획득한 제1 전사(Transcription) 데이터에 기초하여 상기 음성인식 모델을 학습하고,The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning,
    제2 문자 데이터 및 상기 제2 문자 데이터를 상기 학습된 투사 모듈 및 상기 학습된 소프트맥스 모듈이 부가된 제2 언어 모델에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델을 학습하고,Learning the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. do,
    상기 제1 언어 모델 및 제2 언어 모델은,The first language model and the second language model are,
    입력 문자에 대응되는 예측 문자의 확률 값을 출력하여 입력 문자에 대응되는 예측 문자를 판단하는 모델이고,It is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character,
    상기 음성인식 모델은,The voice recognition model is,
    입력 음성에 대응되는 문자의 확률 값을 출력하여 입력 음성에 대응되는 문자를 판단하는 모델인 전자 장치.An electronic device that is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
  2. 제1항에 있어서,According to paragraph 1,
    상기 음성인식 모델은,The voice recognition model is,
    전사(Transcription) 모듈을 포함하는 인코더(Encoder) 및 상기 학습된 제1 예측 모듈이 부가된 통합 모듈을 포함하는 디코더(Decoder)를 포함하고,An encoder including a transcription module and a decoder including an integration module to which the learned first prediction module is added,
    상기 하나 이상의 프로세서는,The one or more processors:
    상기 학습된 제1 예측 모듈의 구조 및 파라미터를 유지한 상태로 상기 제1 음성 데이터, 상기 제1 문자 데이터 및 상기 제1 전사 데이터에 기초하여 상기 음성인식 모델에 포함된 전사 모듈 및 통합 모듈을 학습하는 전자 장치.Learning the transcription module and integration module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module. electronic device that does.
  3. 제1항에 있어서,According to paragraph 1,
    상기 제2 언어 모델은,The second language model is,
    제2 예측 모듈, 상기 학습된 투사 모듈 및 상기 학습된 소프트 맥스 모듈을 포함하고,comprising a second prediction module, the learned projection module, and the learned soft max module,
    상기 하나 이상의 프로세서는,The one or more processors:
    상기 학습된 투사 모듈 및 상기 학습된 소프트 맥스 모듈의 구조 및 파라미터를 유지한 상태로 상기 제2 문자 데이터 및 상기 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델에 포함된 제2 예측 모듈을 학습하는 전자 장치.Learning a second prediction module included in the second language model based on the second character data and the second prediction character data while maintaining the structure and parameters of the learned projection module and the learned soft max module. electronic device that does.
  4. 제3항에 있어서,According to paragraph 3,
    상기 제2 문자 데이터는,The second character data is,
    기 정의된 도메인(Domain)과 관련된 문자 데이터인 제어 방법.A control method that is character data related to a pre-defined domain.
  5. 제1항에 있어서,According to paragraph 1,
    상기 하나 이상의 프로세서는,The one or more processors:
    제2 음성 데이터를 상기 학습된 제1 예측 모듈이 부가되어 학습된 음성인식 모델에 입력하여 제2 전사 데이터를 획득하고,Obtaining second transcription data by inputting second voice data into a voice recognition model learned by adding the learned first prediction module,
    상기 획득된 제2 전사 데이터를 도메인 검출 모듈에 입력하여 제2 전사 데이터에 대응되는 도메인을 식별하는 전자 장치.An electronic device that inputs the acquired second transcription data into a domain detection module to identify a domain corresponding to the second transcription data.
  6. 제5항에 있어서,According to clause 5,
    상기 하나 이상의 프로세서는,The one or more processors:
    복수의 학습된 제2 예측 모듈 중 상기 식별된 도메인에 대응되는 제2 예측 모듈을 식별하고,Identifying a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules,
    상기 제2 음성 데이터를 상기 식별된 제2 예측 모듈이 부가된 상기 학습된 음성인식 모델에 입력하여 제3 전사 데이터를 획득하는 전자 장치.An electronic device that obtains third transcription data by inputting the second voice data into the learned voice recognition model to which the identified second prediction module is added.
  7. 제6항에 있어서,According to clause 6,
    상기 제2 예측 모듈이 부가된 상기 학습된 음성인식 모델은,The learned speech recognition model to which the second prediction module is added,
    상기 제1 예측 모듈을 포함하지 않는 전자 장치.An electronic device not comprising the first prediction module.
  8. 전자 장치의 제어 방법에 있어서,In a method of controlling an electronic device,
    제1 음성 데이터 및 상기 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득하는 단계;Obtaining first voice data and first text data corresponding to the first voice data;
    상기 제1 문자 데이터 및 상기 제1 문자 데이터를 제1 언어 모델(Language Model)에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 상기 제1 언어 모델에 포함된 제1 예측(Prediction) 모듈, 투사(Projection) 모듈 및 소프트맥스(Softmax) 모듈을 학습하는 단계;A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module;
    상기 제1 음성 데이터 및 상기 제1 음성 데이터를 상기 학습된 제1 예측 모듈이 부가된 음성인식 모델(Automatic Speech Recognition Model)에 입력하여 획득한 제1 전사(Transcription) 데이터에 기초하여 상기 음성인식 모델을 학습하는 단계; 및The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; and
    제2 문자 데이터 및 상기 제2 문자 데이터를 상기 학습된 투사 모듈 및 상기 학습된 소프트맥스 모듈이 부가된 제2 언어 모델에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델을 학습하는 단계;를 포함하고,Learning the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. Including the step of doing;
    상기 제1 언어 모델 및 제2 언어 모델은,The first language model and the second language model are,
    입력 문자에 대응되는 예측 문자의 확률 값을 출력하여 입력 문자에 대응되는 예측 문자를 판단하는 모델이고,It is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character,
    상기 음성인식 모델은,The voice recognition model is,
    입력 음성에 대응되는 문자의 확률 값을 출력하여 입력 음성에 대응되는 문자를 판단하는 모델인 제어 방법.A control method that is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
  9. 제8항에 있어서,According to clause 8,
    상기 음성인식 모델은,The voice recognition model is,
    전사(Transcription) 모듈을 포함하는 인코더(Encoder) 및 상기 학습된 제1 예측 모듈이 부가된 통합 모듈을 포함하는 디코더(Decoder)를 포함하고,An encoder including a transcription module and a decoder including an integration module to which the learned first prediction module is added,
    상기 음성인식 모델을 학습하는 단계는,The step of learning the voice recognition model is,
    상기 학습된 제1 예측 모듈의 구조 및 파라미터를 유지한 상태로 상기 제1 음성 데이터, 상기 제1 문자 데이터 및 상기 제1 전사 데이터에 기초하여 상기 음성인식 모델에 포함된 전사 모듈 및 통합 모듈을 학습하는 단계;를 포함하는 제어 방법.Learning the transcription module and integration module included in the voice recognition model based on the first voice data, the first text data, and the first transcription data while maintaining the structure and parameters of the learned first prediction module. A control method comprising:
  10. 제8항에 있어서,According to clause 8,
    상기 제2 언어 모델은,The second language model is,
    제2 예측 모듈, 상기 학습된 투사 모듈 및 상기 학습된 소프트 맥스 모듈을 포함하고,comprising a second prediction module, the learned projection module, and the learned soft max module,
    상기 제2 언어 모델을 학습하는 단계는,The step of learning the second language model is,
    상기 학습된 투사 모듈 및 상기 학습된 소프트 맥스 모듈의 구조 및 파라미터를 유지한 상태로 상기 제2 문자 데이터 및 상기 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델에 포함된 제2 예측 모듈을 학습하는 단계;를 포함하는 제어 방법.Learning a second prediction module included in the second language model based on the second character data and the second prediction character data while maintaining the structure and parameters of the learned projection module and the learned soft max module. A control method comprising:
  11. 제10항에 있어서,According to clause 10,
    상기 제2 문자 데이터는,The second character data is,
    기 정의된 도메인(Domain)과 관련된 문자 데이터인 제어 방법.A control method that is character data related to a pre-defined domain.
  12. 제8항에 있어서,According to clause 8,
    상기 제어 방법은,The control method is,
    제2 음성 데이터를 상기 학습된 제1 예측 모듈이 부가되어 학습된 음성인식 모델에 입력하여 제2 전사 데이터를 획득하는 단계; 및acquiring second transcription data by inputting second voice data into a voice recognition model learned by adding the learned first prediction module; and
    상기 획득된 제2 전사 데이터를 도메인 검출 모듈에 입력하여 제2 전사 데이터에 대응되는 도메인을 식별하는 단계;를 더 포함하는 제어 방법.A control method further comprising inputting the obtained second transcription data into a domain detection module to identify a domain corresponding to the second transcription data.
  13. 제12항에 있어서,According to clause 12,
    상기 도메인을 식별하는 단계는,The step of identifying the domain is,
    복수의 학습된 제2 예측 모듈 중 상기 식별된 도메인에 대응되는 제2 예측 모듈을 식별하는 단계; 및Identifying a second prediction module corresponding to the identified domain among a plurality of learned second prediction modules; and
    상기 제2 음성 데이터를 상기 식별된 제2 예측 모듈이 부가된 상기 학습된 음성인식 모델에 입력하여 제3 전사 데이터를 획득하는 단계;를 더 포함하는 제어 방법.A control method further comprising: acquiring third transcription data by inputting the second voice data into the learned voice recognition model to which the identified second prediction module is added.
  14. 제13항에 있어서,According to clause 13,
    상기 제2 예측 모듈이 부가된 상기 학습된 음성인식 모델은,The learned speech recognition model to which the second prediction module is added,
    상기 제1 예측 모듈을 포함하지 않는 제어 방법.A control method not including the first prediction module.
  15. 전자 장치의 프로세서의 의해 실행되어 상기 전자 장치가 동작을 수행하도록 하는 컴퓨터 명령을 저장하는 비일시적 컴퓨터 판독가능 기록매체에 있어서,A non-transitory computer-readable recording medium storing computer instructions that are executed by a processor of an electronic device to cause the electronic device to perform an operation, comprising:
    제1 음성 데이터 및 상기 제1 음성 데이터에 대응되는 제1 문자 데이터를 획득하는 단계;Obtaining first voice data and first text data corresponding to the first voice data;
    상기 제1 문자 데이터 및 상기 제1 문자 데이터를 제1 언어 모델(Language Model)에 입력하여 획득한 제1 예측 문자 데이터에 기초하여 상기 제1 언어 모델에 포함된 제1 예측(Prediction) 모듈, 투사(Projection) 모듈 및 소프트맥스(Softmax) 모듈을 학습하는 단계;A first prediction module, projection, included in the first language model based on the first character data and first prediction character data obtained by inputting the first character data into a first language model (Language Model) Learning the Projection module and Softmax module;
    상기 제1 음성 데이터 및 상기 제1 음성 데이터를 상기 학습된 제1 예측 모듈이 부가된 음성인식 모델(Automatic Speech Recognition Model)에 입력하여 획득한 제1 전사(Transcription) 데이터에 기초하여 상기 음성인식 모델을 학습하는 단계; 및The speech recognition model based on the first speech data and first transcription data obtained by inputting the first speech data into a speech recognition model (Automatic Speech Recognition Model) to which the learned first prediction module is added. learning steps; and
    제2 문자 데이터 및 상기 제2 문자 데이터를 상기 학습된 투사 모듈 및 상기 학습된 소프트맥스 모듈이 부가된 제2 언어 모델에 입력하여 획득한 제2 예측 문자 데이터에 기초하여 상기 제2 언어 모델을 학습하는 단계;를 포함하고,Learning the second language model based on second character data and second prediction character data obtained by inputting the second character data into a second language model to which the learned projection module and the learned softmax module are added. It includes;
    상기 제1 언어 모델 및 제2 언어 모델은,The first language model and the second language model are,
    입력 문자에 대응되는 예측 문자의 확률 값을 출력하여 입력 문자에 대응되는 예측 문자를 판단하는 모델이고,It is a model that determines the predicted character corresponding to the input character by outputting the probability value of the predicted character corresponding to the input character,
    상기 음성인식 모델은,The voice recognition model is,
    입력 음성에 대응되는 문자의 확률 값을 출력하여 입력 음성에 대응되는 문자를 판단하는 모델인 컴퓨터 판독가능 기록매체.A computer-readable recording medium that is a model that determines the character corresponding to the input voice by outputting the probability value of the character corresponding to the input voice.
PCT/KR2023/010131 2022-09-08 2023-07-14 Electronic device and control method thereof WO2024053842A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2022-0114382 2022-09-08
KR1020220114382A KR20240035197A (en) 2022-09-08 2022-09-08 Electronic apparatus and method for controlling thereof

Publications (1)

Publication Number Publication Date
WO2024053842A1 true WO2024053842A1 (en) 2024-03-14

Family

ID=90191545

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/010131 WO2024053842A1 (en) 2022-09-08 2023-07-14 Electronic device and control method thereof

Country Status (2)

Country Link
KR (1) KR20240035197A (en)
WO (1) WO2024053842A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185352A (en) * 2020-08-31 2021-01-05 华为技术有限公司 Voice recognition method and device and electronic equipment
KR20210073252A (en) * 2019-12-10 2021-06-18 엘지전자 주식회사 Artificial intelligence device and operating method thereof
KR20210088461A (en) * 2020-10-15 2021-07-14 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Method and apparatus for training network, device, storage medium and program
KR20220053491A (en) * 2020-10-22 2022-04-29 삼성전자주식회사 Electronic device and controlling method of electronic device
KR20220082790A (en) * 2021-06-09 2022-06-17 아폴로 인텔리전트 커넥티비티 (베이징) 테크놀로지 씨오., 엘티디. Method and apparatus for processing audio signal, method and apparatus for training model, electronic device , storage medium and computer program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210073252A (en) * 2019-12-10 2021-06-18 엘지전자 주식회사 Artificial intelligence device and operating method thereof
CN112185352A (en) * 2020-08-31 2021-01-05 华为技术有限公司 Voice recognition method and device and electronic equipment
KR20210088461A (en) * 2020-10-15 2021-07-14 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Method and apparatus for training network, device, storage medium and program
KR20220053491A (en) * 2020-10-22 2022-04-29 삼성전자주식회사 Electronic device and controlling method of electronic device
KR20220082790A (en) * 2021-06-09 2022-06-17 아폴로 인텔리전트 커넥티비티 (베이징) 테크놀로지 씨오., 엘티디. Method and apparatus for processing audio signal, method and apparatus for training model, electronic device , storage medium and computer program

Also Published As

Publication number Publication date
KR20240035197A (en) 2024-03-15

Similar Documents

Publication Publication Date Title
WO2021177730A1 (en) Apparatus for diagnosing disease causing voice and swallowing disorders and method for diagnosing same
WO2019098414A1 (en) Method and device for hierarchical learning of neural network, based on weakly supervised learning
WO2021071110A1 (en) Electronic apparatus and method for controlling electronic apparatus
WO2019050297A1 (en) Neural network learning method and device
WO2022146050A1 (en) Federated artificial intelligence training method and system for depression diagnosis
EP3942481A1 (en) Method of performing, by electronic device, convolution operation at certain layer in neural network, and electronic device therefor
WO2021149878A1 (en) Electronic device and control method thereof
WO2018212584A2 (en) Method and apparatus for classifying class, to which sentence belongs, using deep neural network
WO2022145611A1 (en) Electronic device and method for controlling same
WO2020060311A1 (en) Electronic device and method for providing or obtaining data for training thereof
WO2024053842A1 (en) Electronic device and control method thereof
WO2019198900A1 (en) Electronic apparatus and control method thereof
WO2022177091A1 (en) Electronic device and method for controlling same
EP3824384A1 (en) Electronic device and controlling method of electronic device
WO2024049025A1 (en) Electronic device for training speech recognition model and control method thereof
WO2021045434A1 (en) Electronic device and control method therefor
WO2024053825A1 (en) Electronic device for training voice recognition model, and control method therefor
WO2022191366A1 (en) Electronic device and method of controlling same
WO2019107624A1 (en) Sequence-to-sequence translation method and apparatus therefor
WO2022092445A1 (en) Electronic device and control method thereof
WO2024048990A1 (en) Electronic device and control method therefor
WO2024034830A1 (en) Electronic apparatus for clustering graph data on basis of gnn and control method therefor
WO2024101652A1 (en) Electronic device and control method therefor
WO2024101668A1 (en) Electronic device and method for controlling same
WO2022177089A1 (en) Electronic device and control method therefor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23863323

Country of ref document: EP

Kind code of ref document: A1