KR20220069777A

KR20220069777A - Electronic apparatus, control method thereof and electronic system

Info

Publication number: KR20220069777A
Application number: KR1020210027704A
Authority: KR
Inventors: 김은향; 김광윤; 김성수; 박준모; 다이리아 샌디아나; 한창우
Original assignee: 삼성전자주식회사
Priority date: 2020-11-20
Filing date: 2021-03-02
Publication date: 2022-05-27

Abstract

Disclosed is an electronic device. The electronic device of the present invention comprises: a microphone; a memory in which a first neural network model and a second neural network model are stored; a communication interface; and a processor which is connected to the microphone, the memory and the communication interface to control the electronic device. The processor, when user's speech is received using the microphone, inputs the user's speech to the first neural network model to acquire a calculation result, inputs the calculation result to the second neural network to identify at least one device corresponding to the user's speech, and controls the communication interface to transmit the calculation result to the at least one device, wherein the first neural network model is configured to include only an additionally trained partial layer after only the partial layer of a third neural network model trained to identify text from the speech is additionally trained, and the second neural network is trained to identify devices corresponding to the speech. According to the present invention, the electronic device can identify a device corresponding to user's speech and enable the identified device to perform an operation corresponding to the user's speech.

Description

Electronic device, its control method and electronic system { ELECTRONIC APPARATUS, CONTROL METHOD THEREOF AND ELECTRONIC SYSTEM }

본 개시는 전자 장치 및 그 제어 방법에 대한 것으로, 더욱 상세하게는 사용자 음성에 기초하여 복수의 기기 중 적어도 하나의 기기를 제어하는 전자 장치, 그 제어 방법 및 전자 시스템에 대한 것이다.The present disclosure relates to an electronic device and a control method thereof, and more particularly, to an electronic device for controlling at least one device among a plurality of devices based on a user's voice, a control method thereof, and an electronic system.

전자 기술의 발달에 힘입어 다양한 유형의 디바이스들이 개발 및 보급되고 있으며, 각종 통신 기능이 구비된 디바이스들이 대부분의 일반 가정에서도 많이 사용되고 있다. 더 나아가, 종래 통신 기능이 없던 디바이스에도 통신 기능이 구비되어 MDE(multi device experience) 환경이 조성되고 있다.With the development of electronic technology, various types of devices have been developed and distributed, and devices with various communication functions are widely used in most general homes. Furthermore, a communication function is provided even in a device that does not have a conventional communication function, thereby creating a multi device experience (MDE) environment.

특히, 최근에는 신경망 모델을 이용하여 사용자 음성을 인식하고, 인식 결과에 기초하여 MDE 환경 내의 기기를 제어하는 기술도 개발되고 있다.In particular, recently, a technique for recognizing a user's voice using a neural network model and controlling a device in an MDE environment based on the recognition result has been developed.

다만, MDE 환경 내에 새로운 기기가 추가되는 경우, 기존의 신경망 모델을 그대로 이용하는 데에는 한계가 있다. 예를 들어, MDE 환경 내에 새로운 기기가 추가되면, 기존의 신경망 모델은 새로운 기기에 대한 정보가 없어 오동작 가능성이 존재하였다. 또는, 새로운 기기에 대한 신경망 모델의 재학습이 필요하여, 사용자에게 불편을 제공하였다.However, when a new device is added in the MDE environment, there is a limit to using the existing neural network model as it is. For example, when a new device is added in the MDE environment, the existing neural network model does not have information about the new device, so there is a possibility of malfunction. Alternatively, re-learning of the neural network model for a new device is required, providing inconvenience to the user.

이를 해결하기 위해, 서버를 이용하여 음성 인식을 수행할 수도 있다. 이 경우, 주기적으로 또는 특정 이벤트에 따라 서버에 저장된 신경망 모델을 업데이트하는 방식을 통해 이상의 문제를 해결할 수 있다. 다만, 서버를 이용하는 경우, 개인 정보 유출 등의 문제가 발생할 수 있다.To solve this, voice recognition may be performed using a server. In this case, the above problem can be solved by updating the neural network model stored in the server periodically or according to a specific event. However, when using the server, problems such as personal information leakage may occur.

그에 따라, 서버를 이용하지 않고, MDE 환경에 포함된 기기의 경량화된 모델을 이용하여 사용자 음성을 인식하면서도, 새로운 기기가 추가되더라도 사용자 불편을 초래하지 않고 유연하게 음성 인식 성능을 유지하는 방법이 개발될 필요가 있다.Accordingly, a method has been developed that does not use a server and recognizes a user's voice using a lightweight model of a device included in the MDE environment, while flexibly maintaining voice recognition performance without causing user inconvenience even when a new device is added. need to be

본 개시는 상술한 필요성에 따른 것으로, 본 개시의 목적은 MDE(multi device experience) 환경에서 기기 구성이 변경되더라도 음성 인식의 성능을 유지하며 지연을 최소화한 동작을 제공하는 전자 장치, 그 제어 방법 및 전자 시스템을 제공함에 있다.The present disclosure is in accordance with the above-mentioned necessity, and an object of the present disclosure is to provide an electronic device that maintains voice recognition performance and minimizes delay even when a device configuration is changed in a multi device experience (MDE) environment, a control method thereof, and To provide an electronic system.

이상과 같은 목적을 달성하기 위한 본 개시의 일 실시 예에 따르면, 전자 장치는 마이크, 제1 신경망 모델 및 제2 신경망 모델이 저장된 메모리, 통신 인터페이스 및 상기 마이크, 상기 메모리 및 상기 통신 인터페이스와 연결되어 상기 전자 장치를 제어하는 프로세서를 포함하고, 상기 프로세서는 상기 마이크를 통해 사용자 음성이 수신되면 상기 사용자 음성을 상기 제1 신경망 모델에 입력하여 연산 결과를 획득하고, 상기 연산 결과를 상기 제2 신경망 모델에 입력하여 상기 사용자 음성에 대응되는 적어도 하나의 기기를 식별하며, 상기 연산 결과를 상기 적어도 하나의 기기로 전송하도록 상기 통신 인터페이스를 제어하고, 상기 제1 신경망 모델은 음성으로부터 텍스트를 식별하도록 학습된 제3 신경망 모델의 일부 레이어 만이 추가 학습된 후 상기 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델이며, 상기 제2 신경망 모델은 음성에 대응되는 기기를 식별하도록 학습된 모델일 수 있다.According to an embodiment of the present disclosure for achieving the above object, an electronic device is connected to a microphone, a memory in which the first neural network model and the second neural network model are stored, a communication interface, and the microphone, the memory and the communication interface. a processor for controlling the electronic device, wherein when a user voice is received through the microphone, the processor receives the user voice into the first neural network model to obtain an operation result, and uses the operation result to the second neural network model to identify at least one device corresponding to the user's voice by inputting it to, and control the communication interface to transmit the operation result to the at least one device, and the first neural network model is trained to identify text from voice. After only some layers of the third neural network model are additionally trained, the second neural network model may be a model trained to identify a device corresponding to a voice.

또한, 상기 프로세서는 상기 사용자 음성을 기설정된 시간 구간 단위로 상기 제1 신경망 모델에 입력하여 상기 기설정된 시간 단위로 상기 연산 결과를 획득하고, 상기 기설정된 시간 단위로 획득되는 상기 연산 결과를 상기 제2 신경망 모델에 입력하여 상기 기설정된 시간 단위로 상기 적어도 하나의 기기를 식별하며, 상기 기설정된 시간 단위로 획득된 상기 연산 결과를 상기 기설정된 시간 단위로 식별된 상기 적어도 하나의 기기로 전송하도록 상기 통신 인터페이스를 제어할 수 있다.In addition, the processor inputs the user voice into the first neural network model in units of a preset time interval to obtain the operation result in units of the predetermined time, and returns the operation result obtained in units of the predetermined time to the first neural network model. 2 input into a neural network model to identify the at least one device in the preset time unit, and transmit the operation result obtained in the preset time unit to the at least one device identified in the preset time unit Control the communication interface.

그리고, 상기 메모리는 복수의 기기에 대한 정보 및 복수의 프로젝션(projection) 레이어에 대한 정보를 더 저장하며, 상기 프로세서는 상기 복수의 기기에 대한 정보에 기초하여 상기 적어도 하나의 기기에서 처리 가능한 제2 차원에 대한 정보를 식별하고, 상기 연산 결과의 제1 차원과 상기 제2 차원이 상이한 경우, 상기 복수의 프로젝션 레이어 중 상기 제1 차원 및 상기 제2 차원에 대응되는 프로젝션 레이어에 기초하여 상기 연산 결과를 상기 제2 차원으로 변경하고, 상기 변경된 제2 차원의 연산 결과를 상기 적어도 하나의 기기로 전송하도록 상기 통신 인터페이스를 제어할 수 있다.In addition, the memory further stores information on a plurality of devices and information on a plurality of projection layers, and the processor is a second processor that can be processed by the at least one device based on the information on the plurality of devices. Identifies dimension information, and when the first dimension and the second dimension of the operation result are different from each other, the operation result is based on a projection layer corresponding to the first dimension and the second dimension among the plurality of projection layers may be changed to the second dimension, and the communication interface may be controlled to transmit the changed second dimension calculation result to the at least one device.

또한, 상기 메모리는 복수의 기기에 대한 정보 및 상기 제3 신경망 모델의 나머지 레이어를 더 저장하며, 상기 프로세서는 상기 복수의 기기에 대한 정보에 기초하여 상기 적어도 하나의 기기에 음성 인식 기능이 구비되지 않은 것으로 식별되면, 상기 연산 결과를 상기 나머지 레이어에 입력하여 상기 사용자 음성에 대응되는 텍스트를 획득하고, 상기 획득된 텍스트를 상기 적어도 하나의 기기로 전송하도록 상기 통신 인터페이스를 제어할 수 있다.In addition, the memory further stores information on a plurality of devices and the remaining layers of the third neural network model, and the processor does not include a voice recognition function in the at least one device based on the information on the plurality of devices. If it is identified as not, the communication interface may be controlled to obtain a text corresponding to the user's voice by inputting the operation result to the remaining layer, and transmit the obtained text to the at least one device.

그리고, 상기 프로세서는 상기 연산 결과를 상기 제2 신경망 모델에 입력하여 복수의 기기 각각에 대한 스코어를 획득하고, 상기 연산 결과를 상기 획득된 스코어 중 임계 값 이상의 스코어를 갖는 기기로 전송하도록 상기 통신 인터페이스를 제어할 수 있다.Then, the processor inputs the calculation result to the second neural network model to obtain a score for each of a plurality of devices, and transmits the calculation result to a device having a score equal to or greater than a threshold value among the obtained scores. can control

또한, 상기 메모리는 복수의 프로젝션(projection) 레이어에 대한 정보 및 상기 제3 신경망 모델의 나머지 레이어를 더 저장하며, 상기 프로세서는 상기 연산 결과를 상기 적어도 하나의 기기로 전송한 후 상기 적어도 하나의 기기로부터 제1 응답이 수신되면, 이후 획득되는 연산 결과를 상기 적어도 하나의 기기로 전송하도록 상기 통신 인터페이스를 제어하고, 상기 연산 결과를 상기 적어도 하나의 기기로 전송한 후 상기 적어도 하나의 기기로부터 제2 응답이 수신되면, 상기 연산 결과를 상기 복수의 프로젝션 레이어 중 하나로 처리하거나 상기 나머지 레이어에 입력할 수 있다.In addition, the memory further stores information on a plurality of projection layers and the remaining layers of the third neural network model, and the processor transmits the operation result to the at least one device and then the at least one device Upon receiving a first response from When a response is received, the operation result may be processed as one of the plurality of projection layers or may be input to the remaining layers.

그리고, 상기 프로세서는 상기 제2 응답이 상기 적어도 하나의 기기에서 처리 가능한 차원에 대한 정보이면, 상기 복수의 프로젝션 레이어 중 상기 연산 결과의 차원 및 상기 적어도 하나의 기기에서 처리 가능한 차원에 대응되는 프로젝션 레이에에 기초하여 상기 연산 결과의 차원을 변경하고, 상기 차원이 변경된 연산 결과를 상기 적어도 하나의 기기로 전송하도록 상기 통신 인테페이스를 제어하고, 상기 제2 응답이 상기 연산 정보를 처리하지 못한다는 정보이면, 상기 연산 결과를 상기 나머지 레이어에 입력하여 상기 사용자 음성에 대응되는 텍스트를 획득하고, 상기 획득된 텍스트를 상기 적어도 하나의 기기로 전송하도록 상기 통신 인터페이스를 제어할 수 있다.And, when the second response is information on a dimension processable by the at least one device, the processor is configured to: a projection ray corresponding to a dimension of the operation result among the plurality of projection layers and a dimension processable by the at least one device Change the dimension of the operation result based on , control the communication interface to transmit the operation result with the changed dimension to the at least one device, and if the second response is information indicating that the operation information cannot be processed , input the operation result to the remaining layer to obtain a text corresponding to the user's voice, and control the communication interface to transmit the obtained text to the at least one device.

또한, 상기 제1 신경망 모델은 상기 제3 신경망 모델의 나머지 레이어의 가중치 값을 고정시키고, 상기 전자 장치에 대응되는 복수의 샘플 사용자 음성 및 상기 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 상기 일부 레이어가 추가 학습된 후 상기 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델일 수 있다.In addition, the first neural network model fixes weight values of the remaining layers of the third neural network model, and is based on a plurality of sample user voices corresponding to the electronic device and a plurality of sample texts corresponding to the plurality of sample user voices. Thus, after the some layers are additionally learned, the model may be configured to include only the additionally learned partial layers.

그리고, 상기 적어도 하나의 기기는 상기 연산 결과를 상기 적어도 하나의 기기에 저장된 제4 신경망 모델에 입력하여 상기 사용자 음성에 대응되는 텍스트를 획득하고, 상기 획득된 텍스트에 대응되는 동작을 수행하며, 상기 제4 신경망 모델은 상기 일부 레이어의 가중치 값을 고정시키고, 상기 적어도 하나의 기기에 대응되는 복수의 샘플 사용자 음성 및 상기 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 상기 제3 신경망 모델의 나머지 레이어가 추가 학습된 후 상기 추가 학습된 나머지 레이어 만을 포함하도록 구성된 모델일 수 있다.And, the at least one device inputs the operation result to a fourth neural network model stored in the at least one device to obtain a text corresponding to the user's voice, and performs an operation corresponding to the obtained text; The fourth neural network model fixes the weight values of the partial layers, and the third neural network model is based on a plurality of sample user voices corresponding to the at least one device and a plurality of sample texts corresponding to the plurality of sample user voices. It may be a model configured to include only the additionally trained remaining layers after the remaining layers of .

이상과 같은 목적을 달성하기 위한 본 개시의 일 실시 예에 따르면, 전자 시스템은 사용자 음성이 수신되면 상기 사용자 음성을 제1 신경망 모델에 입력하여 연산 결과를 획득하고, 상기 연산 결과를 제2 신경망 모델에 입력하여 상기 사용자 음성에 대응되는 적어도 하나의 기기를 식별하며, 상기 연산 결과를 상기 적어도 하나의 기기로 전송하는 전자 장치 및 상기 연산 결과를 제4 신경망 모델에 입력하여 상기 사용자 음성에 대응되는 텍스트를 획득하고, 상기 획득된 텍스트에 대응되는 동작을 수행하는 적어도 하나의 기기를 포함하며, 상기 제1 신경망 모델은 음성으로부터 텍스트를 식별하도록 학습된 제3 신경망 모델의 일부 레이어 만이 추가 학습된 후 상기 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델이며, 상기 제2 신경망 모델은 음성에 대응되는 기기를 식별하도록 학습된 모델이고, 상기 제4 신경망 모델은 상기 제3 신경망 모델의 나머지 레이어 만이 추가 학습된 후 상기 추가 학습된 나머지 레이어 만을 포함하도록 구성된 모델일 수 있다.According to an embodiment of the present disclosure for achieving the above object, when a user's voice is received, the electronic system inputs the user's voice into a first neural network model to obtain an operation result, and uses the operation result to a second neural network model. An electronic device that identifies at least one device corresponding to the user's voice by inputting it to the electronic device and transmits the calculation result to the at least one device, and a text corresponding to the user's voice by inputting the calculation result into a fourth neural network model and at least one device that performs an operation corresponding to the obtained text, wherein the first neural network model is further trained after only some layers of a third neural network model trained to identify text from speech are further trained. It is a model configured to include only some additionally trained layers, the second neural network model is a model trained to identify a device corresponding to a voice, and the fourth neural network model is after only the remaining layers of the third neural network model are additionally trained It may be a model configured to include only the additionally learned remaining layers.

또한, 상기 적어도 하나의 기기는 상기 연산 결과의 제1 차원과 상기 제4 신경망 모델의 입력의 제2 차원이 상이하면, 상기 제1 차원에 대응되는 프로젝션 레이어의 존재 여부를 식별하고, 상기 프로젝션 레이어가 존재하면 제1 신호를 상기 전자 장치로 전송하며, 상기 프로젝션 레이어가 존재하지 않으면 상기 제2 차원에 대한 정보를 포함하는 제2 신호를 상기 전자 장치로 전송할 수 있다.In addition, when the first dimension of the operation result and the second dimension of the input of the fourth neural network model are different from each other, the at least one device identifies whether a projection layer corresponding to the first dimension exists, and the projection layer When is present, a first signal is transmitted to the electronic device, and when the projection layer does not exist, a second signal including information on the second dimension can be transmitted to the electronic device.

그리고, 상기 전자 장치는 상기 제1 신호가 수신되면 상기 연산 결과를 상기 적어도 하나의 기기로 전송하는 동작을 유지하고, 상기 제2 신호가 수신되면, 상기 연산 결과를 상기 제2 차원으로 프로젝션하고 상기 프로젝션된 연산 결과를 상기 적어도 하나의 기기로 전송하거나, 상기 연산 결과를 디코딩하고 상기 디코딩된 연산 결과를 상기 적어도 하나의 기기로 전송할 수 있다.In addition, when the first signal is received, the electronic device maintains an operation of transmitting the operation result to the at least one device, and when the second signal is received, projects the operation result into the second dimension and The projected operation result may be transmitted to the at least one device, or the operation result may be decoded and the decoded operation result may be transmitted to the at least one device.

한편, 본 개시의 일 실시 예에 따르면, 전자 장치의 제어 방법은 사용자 음성이 수신되면 상기 사용자 음성을 제1 신경망 모델에 입력하여 연산 결과를 획득하는 단계, 상기 연산 결과를 제2 신경망 모델에 입력하여 상기 사용자 음성에 대응되는 적어도 하나의 기기를 식별하는 단계 및 상기 연산 결과를 상기 적어도 하나의 기기로 전송하는 단계를 포함하며, 상기 제1 신경망 모델은 음성으로부터 텍스트를 식별하도록 학습된 제3 신경망 모델의 일부 레이어 만이 추가 학습된 후 상기 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델이며, 상기 제2 신경망 모델은 음성에 대응되는 기기를 식별하도록 학습된 모델일 수 있다.Meanwhile, according to an embodiment of the present disclosure, a method of controlling an electronic device includes, when a user voice is received, inputting the user voice into a first neural network model to obtain an operation result, and inputting the operation result into a second neural network model to identify at least one device corresponding to the user's voice and transmitting the operation result to the at least one device, wherein the first neural network model is a third neural network trained to identify text from voice After only some layers of the model are additionally trained, the model is configured to include only the additionally trained partial layers, and the second neural network model may be a model trained to identify a device corresponding to a voice.

또한, 상기 획득하는 단계는 상기 사용자 음성을 기설정된 시간 구간 단위로 상기 제1 신경망 모델에 입력하여 상기 기설정된 시간 단위로 상기 연산 결과를 획득하고, 상기 식별하는 단계는 상기 기설정된 시간 단위로 획득되는 상기 연산 결과를 상기 제2 신경망 모델에 입력하여 상기 기설정된 시간 단위로 상기 적어도 하나의 기기를 식별하며, 상기 전송하는 단계는 상기 기설정된 시간 단위로 획득된 상기 연산 결과를 상기 기설정된 시간 단위로 식별된 상기 적어도 하나의 기기로 전송할 수 있다.In addition, the acquiring step inputs the user voice in a preset time interval unit to the first neural network model to obtain the calculation result in the preset time unit unit, and the identifying step is obtained in the preset time unit unit inputting the operation result to be obtained into the second neural network model to identify the at least one device in the preset time unit, and the transmitting includes the operation result obtained in the preset time unit in the preset time unit may be transmitted to the at least one device identified as .

그리고, 복수의 기기에 대한 정보에 기초하여 상기 적어도 하나의 기기에서 처리 가능한 제2 차원에 대한 정보를 식별하는 단계, 상기 연산 결과의 제1 차원과 상기 제2 차원이 상이한 경우, 복수의 프로젝션 레이어 중 상기 제1 차원 및 상기 제2 차원에 대응되는 프로젝션 레이어에 기초하여 상기 연산 결과를 상기 제2 차원으로 변경하는 단계를 더 포함하며, 상기 전송하는 단계는 상기 변경된 제2 차원의 연산 결과를 상기 적어도 하나의 기기로 전송할 수 있다.and identifying information on a second dimension that can be processed by the at least one device based on information about a plurality of devices. When the first dimension and the second dimension of the operation result are different from each other, a plurality of projection layers The method further includes changing the calculation result into the second dimension based on a projection layer corresponding to the first dimension and the second dimension, wherein the transmitting includes the changed second dimension calculation result in the second dimension. It can be transmitted to at least one device.

또한, 상기 복수의 기기에 대한 정보에 기초하여 상기 적어도 하나의 기기에 음성 인식 기능이 구비되지 않은 것으로 식별되면, 상기 연산 결과를 상기 제3 신경망 모델의 나머지 레이어에 입력하여 상기 사용자 음성에 대응되는 텍스트를 획득하는 단계를 더 포함하며, 상기 전송하는 단계는 상기 획득된 텍스트를 상기 적어도 하나의 기기로 전송할 수 있다.In addition, when it is identified that the at least one device does not have a voice recognition function based on the information on the plurality of devices, the operation result is input to the remaining layers of the third neural network model to correspond to the user voice. The method may further include obtaining a text, wherein the transmitting may transmit the obtained text to the at least one device.

그리고, 상기 식별하는 단계는 상기 연산 결과를 상기 제2 신경망 모델에 입력하여 복수의 기기 각각에 대한 스코어를 획득하고, 상기 전송하는 단계는 상기 연산 결과를 상기 획득된 스코어 중 임계 값 이상의 스코어를 갖는 기기로 전송할 수 있다.In the identifying step, the calculation result is inputted to the second neural network model to obtain a score for each of a plurality of devices, and the transmitting step includes the calculation result having a score equal to or greater than a threshold value among the obtained scores. can be transferred to the device.

또한, 상기 연산 결과를 상기 적어도 하나의 기기로 전송한 후 상기 적어도 하나의 기기로부터 제1 응답이 수신되면, 이후 획득되는 연산 결과를 상기 적어도 하나의 기기로 전송하고, 상기 연산 결과를 상기 적어도 하나의 기기로 전송한 후 상기 적어도 하나의 기기로부터 제2 응답이 수신되면, 상기 연산 결과를 복수의 프로젝션 레이어 중 하나로 처리하거나 상기 제3 신경망 모델의 나머지 레이어에 입력하는 단계를 더 포함할 수 있다.In addition, when a first response is received from the at least one device after transmitting the calculation result to the at least one device, a subsequent calculation result is transmitted to the at least one device, and the calculation result is transmitted to the at least one device When the second response is received from the at least one device after transmitting to the device of

그리고, 상기 입력하는 단계는 상기 제2 응답이 상기 적어도 하나의 기기에서 처리 가능한 차원에 대한 정보이면, 상기 복수의 프로젝션 레이어 중 상기 연산 결과의 차원 및 상기 적어도 하나의 기기에서 처리 가능한 차원에 대응되는 프로젝션 레이에에 기초하여 상기 연산 결과의 차원을 변경하고, 상기 차원이 변경된 연산 결과를 상기 적어도 하나의 기기로 전송하고, 상기 제2 응답이 상기 연산 정보를 처리하지 못한다는 정보이면, 상기 연산 결과를 상기 나머지 레이어에 입력하여 상기 사용자 음성에 대응되는 텍스트를 획득하고, 상기 획득된 텍스트를 상기 적어도 하나의 기기로 전송할 수 있다.And, in the step of inputting, if the second response is information on a dimension processable by the at least one device, corresponding to a dimension of the operation result among the plurality of projection layers and a dimension processable by the at least one device If the dimension of the calculation result is changed based on a projection ray, the dimension-changed calculation result is transmitted to the at least one device, and the second response is information indicating that the calculation information cannot be processed, the calculation result may be input to the remaining layer to obtain a text corresponding to the user's voice, and transmit the obtained text to the at least one device.

그리고, 상기 적어도 하나의 기기에 의해, 상기 연산 결과를 상기 적어도 하나의 기기에 저장된 제4 신경망 모델에 입력하여 상기 사용자 음성에 대응되는 텍스트를 획득하는 단계 및 상기 획득된 텍스트에 대응되는 동작을 수행하는 단계를 더 포함하며, 상기 제4 신경망 모델은 상기 일부 레이어의 가중치 값을 고정시키고, 상기 적어도 하나의 기기에 대응되는 복수의 샘플 사용자 음성 및 상기 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 상기 제3 신경망 모델의 나머지 레이어가 추가 학습된 후 상기 추가 학습된 나머지 레이어 만을 포함하도록 구성된 모델일 수 있다.Then, by the at least one device, inputting the operation result into a fourth neural network model stored in the at least one device to obtain a text corresponding to the user's voice and performing an operation corresponding to the obtained text The method further comprising: fixing the weight values of the partial layers in the fourth neural network model, a plurality of sample user voices corresponding to the at least one device, and a plurality of sample texts corresponding to the plurality of sample user voices After the remaining layers of the third neural network model are additionally trained based on , the model may be configured to include only the additionally trained remaining layers.

이상과 같은 본 개시의 다양한 실시 예에 따르면, 전자 장치는 사용자 음성에 대응되는 기기를 식별하고, 식별된 기기가 사용자 음성에 대응되는 동작을 수행하도록 할 수 있다.According to various embodiments of the present disclosure as described above, the electronic device may identify a device corresponding to the user's voice, and allow the identified device to perform an operation corresponding to the user's voice.

또한, 전자 장치는 음성 인식 신경망 모델에서 음성 인식 강화를 위한 일부 레이어만이 추가 학습된 신경망 모델을 이용함에 따라 음성 인식 성능이 개선되고, 상기 신경망 모델의 연산 결과를 대응되는 기기로 제공하며, 기기는 음성 인식 신경망 모델에서 기기 특성을 고려하기 위한 나머지 레이어만이 추가 학습된 신경망 모델을 이용함에 따라 기기에 최적화된 음성 인식을 수행할 수 있다.In addition, in the electronic device, voice recognition performance is improved as only some layers for voice recognition reinforcement in the speech recognition neural network model use the additionally trained neural network model, and the operation result of the neural network model is provided to a corresponding device, In the speech recognition neural network model, only the remaining layers for considering the device characteristics use the additionally trained neural network model, so that voice recognition optimized for the device can be performed.

도 1은 본 개시의 일 실시 예에 따른 전자 시스템을 도시한 도면이다.
도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구성을 나타내는 블록도이다.
도 3은 본 개시의 일 실시 예에 따른 신경망 모델의 추가 학습을 설명하기 위한 도면이다.
도 4는 본 개시의 일 실시 예에 따른 기기의 식별 동작을 설명하기 위한 도면이다.
도 5는 본 개시의 일 실시 예에 따른 프로젝션 레이어를 설명하기 위한 도면이다.
도 6은 본 개시의 일 실시 예에 따른 기기 타입에 따른 전자 장치의 동작을 설명하기 위한 도면이다.
도 7은 본 개시의 일 실시 예에 따른 기기의 응답에 따른 전자 장치의 동작을 설명하기 위한 흐름도이다.
도 8은 본 개시의 일 실시 예에 따른 기기의 동작을 설명하기 위한 흐름도이다.
도 9는 본 개시의 일 실시 예에 따른 전자 장치의 제어 방법을 설명하기 위위한 흐름도이다.1 is a diagram illustrating an electronic system according to an embodiment of the present disclosure.
2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure.
3 is a diagram for explaining additional learning of a neural network model according to an embodiment of the present disclosure.
4 is a diagram for explaining an operation of identifying a device according to an embodiment of the present disclosure.
5 is a diagram for describing a projection layer according to an embodiment of the present disclosure.
6 is a diagram for explaining an operation of an electronic device according to a device type according to an embodiment of the present disclosure.
7 is a flowchart illustrating an operation of an electronic device according to a response of a device according to an embodiment of the present disclosure.
8 is a flowchart illustrating an operation of a device according to an embodiment of the present disclosure.
9 is a flowchart for describing a method of controlling an electronic device according to an embodiment of the present disclosure.

이하에서는 첨부 도면을 참조하여 본 개시를 상세히 설명한다.Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

본 개시의 실시 예에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 개시의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in the embodiments of the present disclosure are selected as currently widely used general terms as possible while considering the functions in the present disclosure, which may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, etc. . In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding disclosure. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

본 명세서에서, "가진다," "가질 수 있다," "포함한다," 또는 "포함할 수 있다" 등의 표현은 해당 특징(예: 수치, 기능, 동작, 또는 부품 등의 구성요소)의 존재를 가리키며, 추가적인 특징의 존재를 배제하지 않는다.In this specification, expressions such as “have,” “may have,” “include,” or “may include” indicate the presence of a corresponding characteristic (eg, a numerical value, function, operation, or component such as a part). and does not exclude the presence of additional features.

A 또는/및 B 중 적어도 하나라는 표현은 "A" 또는 "B" 또는 "A 및 B" 중 어느 하나를 나타내는 것으로 이해되어야 한다.The expression "at least one of A and/or B" is to be understood as indicating either "A" or "B" or "A and B".

본 명세서에서 사용된 "제1," "제2," "첫째," 또는 "둘째,"등의 표현들은 다양한 구성요소들을, 순서 및/또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다.As used herein, expressions such as "first," "second," "first," or "second," can modify various elements, regardless of order and/or importance, and refer to one element. It is used only to distinguish it from other components, and does not limit the components.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구성되다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as "comprises" or "consisting of" are intended to designate that the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification exist, and are intended to indicate that one or more other It should be understood that this does not preclude the possibility of addition or presence of features or numbers, steps, operations, components, parts, or combinations thereof.

본 명세서에서, 사용자라는 용어는 전자 장치를 사용하는 사람 또는 전자 장치를 사용하는 장치(예: 인공 지능 전자 장치)를 지칭할 수 있다.In this specification, the term user may refer to a person who uses an electronic device or a device (eg, an artificial intelligence electronic device) using the electronic device.

이하 첨부된 도면들을 참조하여 본 개시의 다양한 실시 예를 보다 상세하게 설명한다.Hereinafter, various embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings.

도 1은 본 개시의 일 실시 예에 따른 전자 시스템(1000)을 도시한 도면이다. 도 1에 도시된 바와 같이 전자 시스템(1000)은 전자 장치(100) 및 복수의 기기(200-1 ~ 200-3)를 포함한다.1 is a diagram illustrating an electronic system 1000 according to an embodiment of the present disclosure. As shown in FIG. 1 , the electronic system 1000 includes an electronic device 100 and a plurality of devices 200 - 1 to 200 - 3 .

전자 장치(100)는 사용자 음성을 수신하는 장치로서, 스마트폰, 태블릿 PC, 웨어러블 기기 등일 수 있다. 도 1에서는 설명의 편의를 위해 전자 장치(100)가 스마트폰인 것으로 도시하였으나, 이에 한정되는 것은 아니며, 전자 장치(100)는 사용자 음성을 수신하여 복수의 기기(200-1 ~ 200-3) 중 적어도 하나의 기기를 제어할 수 있는 장치라면 어떠한 장치라도 무방한다.The electronic device 100 is a device that receives a user's voice, and may be a smart phone, a tablet PC, a wearable device, or the like. In FIG. 1 , the electronic device 100 is illustrated as a smart phone for convenience of explanation, but the present invention is not limited thereto. Any device may be used as long as it is a device capable of controlling at least one device among them.

전자 장치(100)는 음성 인식 신경망 모델에서 일부 레이어가 재학습된 신경망 모델을 저장하고, 저장된 신경망 모델을 이용하여 사용자 음성을 처리할 수 있다. 여기서, 일부 레이어의 재학습은 음성 인식 신경망 모델의 나머지 레이어의 웨이트 값을 고정시킨 후, 다양한 음성 샘플을 통해 음성 인식 성능을 강화하기 위한 재학습일 수 있다.The electronic device 100 may store a neural network model in which some layers have been retrained in the voice recognition neural network model, and may process a user's voice using the stored neural network model. Here, the re-learning of some layers may be re-learning to enhance speech recognition performance through various speech samples after fixing the weight values of the remaining layers of the speech recognition neural network model.

예를 들어, 전자 장치(100)가 도 1과 같이 스마트폰인 경우, 다양한 음성 샘플은 스마트폰 앞에서 발화된 음성일 수 있다. 즉, 학습에 이용되는 음성 샘플은 스마트폰 앞에서 발화된 음성과 대응되는 텍스트일 수 있다.For example, when the electronic device 100 is a smartphone as shown in FIG. 1 , various voice samples may be voices uttered in front of the smartphone. That is, the voice sample used for learning may be a text corresponding to the voice uttered in front of the smartphone.

전자 장치(100)는 사용자 음성을 재학습된 일부 레이어에 입력하여 연산 결과를 획득하고, 획득된 연산 결과를 복수의 기기(200-1 ~ 200-3) 중 적어도 하나로 전송할 수 있다.The electronic device 100 may obtain an operation result by inputting the user's voice into some re-learned layers, and transmit the obtained operation result to at least one of the plurality of devices 200 - 1 to 200 - 3 .

이때, 전자 장치(100)는 음성에 대응되는 기기를 식별하도록 학습된 신경망 모델을 이용하여 사용자 음성에 대응되는 적어도 하나의 기기를 식별하고, 획득된 연산 결과를 식별된 기기로 전송할 수 있다.In this case, the electronic device 100 may identify at least one device corresponding to the user's voice using the neural network model trained to identify the device corresponding to the voice, and transmit the obtained operation result to the identified device.

복수의 기기(200-1 ~ 200-3)는 전자 장치(100)로부터 연산 결과를 수신하여 대응되는 동작을 수행하는 장치로서, 도 1에서는 설명의 편의를 위하여 복수의 기기(200-1 ~ 200-3)가 각각 에어컨, 냉장고, TV인 것으로 도시하였다. 다만, 이에 한정되는 것은 아니며, 복수의 기기(200-1 ~ 200-3)는 전자 장치(100)로부터 연산 결과를 수신하여 대응되는 동작을 수행할 수 있다면 어떠한 장치라도 무방하다.The plurality of devices 200-1 to 200-3 are devices that receive a calculation result from the electronic device 100 and perform a corresponding operation. In FIG. 1 , for convenience of explanation, the plurality of devices 200-1 to 200 -3) is shown as an air conditioner, a refrigerator, and a TV, respectively. However, the present invention is not limited thereto, and any device may be used as long as the plurality of devices 200 - 1 to 200 - 3 can receive an operation result from the electronic device 100 and perform a corresponding operation.

복수의 기기(200-1 ~ 200-3) 각각은 음성 인식 신경망 모델에서 나머지 레이어가 재학습된 신경망 모델을 저장하고, 저장된 신경망 모델을 이용하여 연산 결과에 대응되는 동작을 수행할 수 있다. 여기서, 나머지 레이어의 재학습은 음성 인식 신경망 모델의 일부 레이어의 웨이트 값을 고정시킨 후, 기기 별 음성 샘플을 통해 기기에 최적화된 음성 인식을 수행하기 위한 재학습일 수 있다.Each of the plurality of devices 200-1 to 200-3 may store a neural network model in which the remaining layers have been retrained in the speech recognition neural network model, and may perform an operation corresponding to the operation result by using the stored neural network model. Here, the re-learning of the remaining layers may be re-learning for performing speech recognition optimized for a device through a speech sample for each device after fixing the weight values of some layers of the speech recognition neural network model.

예를 들어, TV의 경우, 볼륨 업, 볼륨 다운, 채널 업, 채널 다운 등과 같이 TV를 제어하기 위해 이용되는 음성 샘플 및 대응되는 텍스트를 이용하여 나머지 레이어가 재학습될 수 있다. 또는, 에어컨의 경우, 온도 올려줘, 온도 낮춰줘, 취침 모드 등과 같이 에어컨을 제어하기 위해 이용되는 음성 샘플 및 대응되는 텍스트를 이용하여 나머지 레이어가 재학습될 수 있다.For example, in the case of a TV, the remaining layers may be re-learned using a voice sample and corresponding text used to control the TV such as volume up, volume down, channel up, channel down, and the like. Alternatively, in the case of an air conditioner, the remaining layers may be re-learned using a voice sample and corresponding text used to control the air conditioner, such as raising the temperature, lowering the temperature, or a sleep mode.

복수의 기기(200-1 ~ 200-3) 각각은 연산 결과를 재학습된 나머지 레이어에 입력하여 사용자 음성에 대응되는 텍스트를 획득하고, 획득된 텍스트에 대응되는 동작을 수행할 수 있다.Each of the plurality of devices 200 - 1 to 200 - 3 may obtain a text corresponding to the user's voice by inputting the operation result to the remaining re-learned layer, and may perform an operation corresponding to the obtained text.

이상과 같이 전자 장치(100) 및 복수의 기기(200-1 ~ 200-3)는 각각 음성 인식 신경망 모델의 특정 레이어를 재학습한 신경망 모델을 저장하고, 사용자 음성에 대응되는 동작을 수행할 수 있다.As described above, the electronic device 100 and the plurality of devices 200-1 to 200-3 each store a neural network model obtained by re-learning a specific layer of the voice recognition neural network model, and perform an operation corresponding to the user's voice. have.

도 1에서는 복수의 기기(200-1 ~ 200-3)가 3개인 것으로 도시하였으나, 이에 한정되는 것은 아니다. 예를 들어, 복수의 기기(200-1 ~ 200-3)는 얼마든지 추가되거나 제거될 수 있다. 또한, 복수의 기기(200-1 ~ 200-3)가 아닌 하나의 기기만이 전자 장치(100)와 연결될 수도 있다.In FIG. 1 , the plurality of devices 200 - 1 to 200 - 3 are illustrated as three, but the present invention is not limited thereto. For example, the plurality of devices 200 - 1 to 200 - 3 may be added or removed at will. Also, only one device instead of the plurality of devices 200 - 1 to 200 - 3 may be connected to the electronic device 100 .

도 1에서는 전자 장치(100) 및 복수의 기기(200-1 ~ 200-3)가 직접 연결된 것으로 도시하였으나, 액세스 포인트와 같은 장치를 통해 연결될 수도 있다.In FIG. 1 , the electronic device 100 and the plurality of devices 200 - 1 to 200 - 3 are illustrated as being directly connected, but they may be connected through a device such as an access point.

도 2는 본 개시의 일 실시 예에 따른 전자 장치(100)의 구성을 나타내는 블록도이다. 전자 장치(100)는 도 2에 도시된 바와 같이, 마이크(110), 메모리(120), 통신 인터페이스(130) 및 프로세서(140)를 포함한다.2 is a block diagram illustrating a configuration of an electronic device 100 according to an embodiment of the present disclosure. As shown in FIG. 2 , the electronic device 100 includes a microphone 110 , a memory 120 , a communication interface 130 , and a processor 140 .

마이크(110)는 사용자 음성을 입력받아 오디오 신호로 변환하기 위한 구성이다. 마이크(110)는 프로세서(140)와 전기적으로 연결되며, 프로세서(140)의 제어에 의해 사용자 음성을 수신할 수 있다. 여기서, 사용자 음성은 전자 장치(100) 및 전자 장치(100) 주변의 타 전자 장치 중 적어도 하나에서 발생하는 음성 및 전자 장치(100) 주변의 노이즈를 포함할 수 있다.The microphone 110 is configured to receive a user's voice and convert it into an audio signal. The microphone 110 is electrically connected to the processor 140 , and may receive a user's voice under the control of the processor 140 . Here, the user's voice may include a voice generated by at least one of the electronic device 100 and other electronic devices surrounding the electronic device 100 and noise around the electronic device 100 .

예를 들어, 마이크(110)는 전자 장치(100)의 상측이나 전면 방향, 측면 방향 등에 일체화된 일체형으로 형성될 수 있다. 또는, 마이크(110)는 전자 장치(100)와는 별도의 리모컨 등에 구비될 수도 있다. 이 경우, 리모컨은 마이크(110)를 통해 사운드를 수신하고, 수신된 사운드를 전자 장치(100)로 제공할 수도 있다.For example, the microphone 110 may be formed integrally with an upper side, a front direction, a side direction, or the like of the electronic device 100 . Alternatively, the microphone 110 may be provided in a remote control separate from the electronic device 100 . In this case, the remote control may receive sound through the microphone 110 and provide the received sound to the electronic device 100 .

마이크(110)는 아날로그 형태의 사운드를 수집하는 마이크, 수집된 사운드를 증폭하는 앰프 회로, 증폭된 사운드를 샘플링하여 디지털 신호로 변환하는 A/D 변환회로, 변환된 디지털 신호로부터 노이즈 성분을 제거하는 필터 회로 등과 같은 다양한 구성을 포함할 수 있다.The microphone 110 includes a microphone for collecting analog sound, an amplifier circuit for amplifying the collected sound, an A/D conversion circuit for sampling the amplified sound and converting it into a digital signal, and a method for removing noise components from the converted digital signal. It may include various configurations such as filter circuits and the like.

마이크(110)는 복수의 서브 마이크를 포함할 수 있다. 예를 들어, 마이크(110)는 전자 장치(100)의 전후좌우에 각각 하나의 서브 마이크를 포함할 수 있다. 다만, 이에 한정되는 것은 아니며, 전자 장치(100)는 하나의 마이크(110)만을 포함할 수도 있다.The microphone 110 may include a plurality of sub-microphones. For example, the microphone 110 may include one sub-microphone at the front, rear, left, and right sides of the electronic device 100 . However, the present invention is not limited thereto, and the electronic device 100 may include only one microphone 110 .

한편, 마이크(110)는 사운드 센서의 형태로 구현될 수도 있다.Meanwhile, the microphone 110 may be implemented in the form of a sound sensor.

메모리(120)는 프로세서(140) 등이 접근할 수 있도록 데이터 등의 정보를 전기 또는 자기 형태로 저장하는 하드웨어를 지칭할 수 있다. 이를 위해, 메모리(120)는 비휘발성 메모리, 휘발성 메모리, 플래시 메모리(Flash Memory), 하드디스크 드라이브(HDD) 또는 솔리드 스테이트 드라이브(SSD), RAM, ROM 등 중에서 적어도 하나의 하드웨어로 구현될 수 있다.The memory 120 may refer to hardware that stores information such as data in an electrical or magnetic form so that the processor 140 can access it. To this end, the memory 120 may be implemented with at least one hardware selected from non-volatile memory, volatile memory, flash memory, hard disk drive (HDD) or solid state drive (SSD), RAM, ROM, etc. .

메모리(120)에는 전자 장치(100) 또는 프로세서(140)의 동작에 필요한 적어도 하나의 인스트럭션(instruction) 또는 모듈이 저장될 수 있다. 여기서, 인스트럭션은 전자 장치(100) 또는 프로세서(140)의 동작을 지시하는 부호 단위로서, 컴퓨터가 이해할 수 있는 언어인 기계어로 작성된 것일 수 있다. 모듈은 작업 단위의 특정 작업을 수행하는 일련의 인스트럭션의 집합체(instruction set)일 수 있다.At least one instruction or module required for the operation of the electronic device 100 or the processor 140 may be stored in the memory 120 . Here, the instruction is a unit of code for instructing the operation of the electronic device 100 or the processor 140 , and may be written in machine language, which is a language that a computer can understand. A module may be a set of instructions that perform a specific task of a unit of work.

메모리(120)에는 문자, 수, 영상 등을 나타낼 수 있는 비트 또는 바이트 단위의 정보인 데이터가 저장될 수 있다. 예를 들어, 메모리(120)에는 제1 신경망 모델 및 제2 신경망 모델이 저장될 수 있다. 여기서, 제1 신경망 모델은 음성으로부터 텍스트를 식별하도록 학습된 제3 신경망 모델의 일부 레이어 만이 추가 학습된 후 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델이며, 제2 신경망 모델은 음성에 대응되는 기기를 식별하도록 학습된 모델일 수 있다.The memory 120 may store data that is information in units of bits or bytes that can represent characters, numbers, images, and the like. For example, the first neural network model and the second neural network model may be stored in the memory 120 . Here, the first neural network model is a model configured to include only some additionally trained layers after only some layers of the third neural network model trained to identify text from speech are additionally trained, and the second neural network model is a device corresponding to the voice. It can be a model that has been trained to identify.

메모리(120)에는 사용자 음성 처리 모듈, 복수의 프로젝션(projection) 레이어에 대한 정보, 복수의 기기에 대한 정보 및 제3 신경망 모델의 나머지 레이어 등이 저장될 수 있다.The memory 120 may store a user voice processing module, information on a plurality of projection layers, information on a plurality of devices, and the remaining layers of the third neural network model.

메모리(120)는 프로세서(140)에 의해 액세스되며, 프로세서(140)에 의해 인스트럭션, 모듈 또는 데이터에 대한 독취/기록/수정/삭제/갱신 등이 수행될 수 있다.The memory 120 is accessed by the processor 140 , and reading/writing/modification/deletion/update of instructions, modules, or data may be performed by the processor 140 .

통신 인터페이스(130)는 다양한 유형의 통신방식에 따라 다양한 유형의 외부 장치와 통신을 수행하는 구성이다. 예를 들어, 전자 장치(100)는 통신 인터페이스(130)를 통해 복수의 기기(200-1 ~ 200-3)와 통신을 수행할 수 있다.The communication interface 130 is configured to communicate with various types of external devices according to various types of communication methods. For example, the electronic device 100 may communicate with the plurality of devices 200 - 1 to 200 - 3 through the communication interface 130 .

통신 인터페이스(130)는 와이파이 모듈, 블루투스 모듈, 적외선 통신 모듈 및 무선 통신 모듈 등을 포함할 수 있다. 여기서, 각 통신 모듈은 적어도 하나의 하드웨어 칩 형태로 구현될 수 있다.The communication interface 130 may include a Wi-Fi module, a Bluetooth module, an infrared communication module, and a wireless communication module. Here, each communication module may be implemented in the form of at least one hardware chip.

와이파이 모듈, 블루투스 모듈은 각각 WiFi 방식, 블루투스 방식으로 통신을 수행한다. 와이파이 모듈이나 블루투스 모듈을 이용하는 경우에는 SSID 및 세션 키 등과 같은 각종 연결 정보를 먼저 송수신하여, 이를 이용하여 통신 연결한 후 각종 정보들을 송수신할 수 있다. 적외선 통신 모듈은 시 광선과 밀리미터파 사이에 있는 적외선을 이용하여 근거리에 무선으로 데이터를 전송하는 적외선 통신(IrDA, infrared Data Association)기술에 따라 통신을 수행한다.The Wi-Fi module and the Bluetooth module perform communication using a WiFi method and a Bluetooth method, respectively. In the case of using a Wi-Fi module or a Bluetooth module, various types of connection information such as an SSID and a session key are first transmitted and received, and then various types of information can be transmitted/received after communication connection using this. The infrared communication module communicates according to the infrared data association (IrDA) technology, which wirelessly transmits data in a short distance using infrared that is between visible light and millimeter waves.

무선 통신 모듈은 상술한 통신 방식 이외에 지그비(zigbee), 3G(3rd Generation), 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), LTE-A(LTE Advanced), 4G(4th Generation), 5G(5th Generation)등과 같은 다양한 무선 통신 규격에 따라 통신을 수행하는 적어도 하나의 통신 칩을 포함할 수 있다.In addition to the above-described communication methods, the wireless communication module includes Zigbee, 3rd Generation (3G), 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), LTE Advanced (LTE-A), 4th Generation (4G), 5G It may include at least one communication chip that performs communication according to various wireless communication standards such as (5th Generation).

또는, 통신 인터페이스(130)는 HDMI, DP, 썬더볼트, USB, RGB, D-SUB, DVI 등과 같은 유선 통신 인터페이스를 포함할 수 있다.Alternatively, the communication interface 130 may include a wired communication interface such as HDMI, DP, Thunderbolt, USB, RGB, D-SUB, DVI, or the like.

그 밖에 통신 인터페이스(130)는 LAN(Local Area Network) 모듈, 이더넷 모듈, 또는 페어 케이블, 동축 케이블 또는 광섬유 케이블 등을 이용하여 통신을 수행하는 유선 통신 모듈 중 적어도 하나를 포함할 수도 있다.In addition, the communication interface 130 may include at least one of a local area network (LAN) module, an Ethernet module, or a wired communication module for performing communication using a pair cable, a coaxial cable, or an optical fiber cable.

프로세서(140)는 전자 장치(100)의 동작을 전반적으로 제어한다. 구체적으로, 프로세서(140)는 전자 장치(100)의 각 구성과 연결되어 전자 장치(100)의 동작을 전반적으로 제어할 수 있다. 예를 들어, 프로세서(140)는 마이크(110), 메모리(120), 통신 인터페이스(130) 등과 같은 구성과 연결되어 전자 장치(100)의 동작을 제어할 수 있다.The processor 140 controls the overall operation of the electronic device 100 . Specifically, the processor 140 may be connected to each component of the electronic device 100 to control the overall operation of the electronic device 100 . For example, the processor 140 may be connected to components such as the microphone 110 , the memory 120 , and the communication interface 130 to control the operation of the electronic device 100 .

일 실시 예에 따라 프로세서(140)는 디지털 시그널 프로세서(digital signal processor(DSP), 마이크로 프로세서(microprocessor), TCON(Time controller)으로 구현될 수 있다. 다만, 이에 한정되는 것은 아니며, 중앙처리장치(central processing unit(CPU)), MCU(Micro Controller Unit), MPU(micro processing unit), 컨트롤러(controller), 어플리케이션 프로세서(application processor(AP)), 또는 커뮤니케이션 프로세서(communication processor(CP)), ARM 프로세서 중 하나 또는 그 이상을 포함하거나, 해당 용어로 정의될 수 있다. 또한, 프로세서(140)는 프로세싱 알고리즘이 내장된 SoC(System on Chip), LSI(large scale integration)로 구현될 수도 있고, FPGA(Field Programmable gate array) 형태로 구현될 수도 있다.According to an embodiment, the processor 140 may be implemented as a digital signal processor (DSP), a microprocessor, or a time controller (TCON). However, the present invention is not limited thereto, and the central processing unit ( central processing unit (CPU)), micro controller unit (MCU), micro processing unit (MPU), controller, application processor (AP), or communication processor (CP), ARM processor In addition, the processor 140 may be implemented as a SoC (System on Chip) or LSI (large scale integration) in which a processing algorithm is embedded, or an FPGA ( Field programmable gate array) may be implemented.

프로세서(140)는 마이크(110)를 통해 사용자 음성이 수신되면 사용자 음성을 제1 신경망 모델에 입력하여 연산 결과를 획득하고, 연산 결과를 제2 신경망 모델에 입력하여 사용자 음성에 대응되는 적어도 하나의 기기를 식별할 수 있다. 여기서, 제1 신경망 모델은 음성으로부터 텍스트를 식별하도록 학습된 제3 신경망 모델의 일부 레이어 만이 추가 학습된 후 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델이고, 제2 신경망 모델은 음성에 대응되는 기기를 식별하도록 학습된 모델이며, 제3 신경망 모델은 음성 인식을 수행하도록 학습된 신경망 모델일 수 있다. 특히, 제1 신경망 모델은 제3 신경망 모델의 나머지 레이어의 가중치 값을 고정시키고, 전자 장치(100)에 대응되는 복수의 샘플 사용자 음성 및 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 일부 레이어가 추가 학습된 후 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델일 수 있다. 이러한 추가 학습을 통해, 제1 신경망 모델은 전자 장치(100)의 주변 환경 특성을 좀더 반영한 음성 인식 결과를 연산 결과로서 출력할 수 있다.When the user's voice is received through the microphone 110 , the processor 140 inputs the user's voice into the first neural network model to obtain an operation result, and inputs the operation result to the second neural network model to obtain at least one user voice corresponding to the user's voice. The device can be identified. Here, the first neural network model is a model configured to include only some additionally trained layers after only some layers of the third neural network model trained to identify text from speech are additionally trained, and the second neural network model is a device corresponding to the voice. A model trained to identify, and the third neural network model may be a neural network model trained to perform speech recognition. In particular, the first neural network model fixes the weight values of the remaining layers of the third neural network model, and based on the plurality of sample user voices corresponding to the electronic device 100 and the plurality of sample texts corresponding to the plurality of sample user voices, It may be a model configured to include only some additionally trained layers after some layers are additionally trained. Through this additional learning, the first neural network model may output a voice recognition result that further reflects the characteristics of the surrounding environment of the electronic device 100 as an operation result.

프로세서(140)는 연산 결과를 적어도 하나의 기기로 전송하도록 통신 인터페이스(130)를 제어할 수 있다. 예를 들어, 프로세서(140)는 연산 결과를 제2 신경망 모델에 입력하여 복수의 기기 각각에 대한 스코어를 획득하고, 연산 결과를 획득된 스코어 중 임계 값 이상의 스코어를 갖는 기기로 전송하도록 통신 인터페이스(140)를 제어할 수 있다.The processor 140 may control the communication interface 130 to transmit the operation result to at least one device. For example, the processor 140 inputs the calculation result into the second neural network model to obtain a score for each of a plurality of devices, and transmits the calculation result to a device having a score greater than or equal to a threshold value among the obtained scores. 140) can be controlled.

프로세서(140)는 사용자 음성을 기설정된 시간 구간 단위로 제1 신경망 모델에 입력하여 기설정된 시간 단위로 연산 결과를 획득하고, 기설정된 시간 단위로 획득되는 연산 결과를 제2 신경망 모델에 입력하여 기설정된 시간 단위로 적어도 하나의 기기를 식별하며, 기설정된 시간 단위로 획득된 연산 결과를 기설정된 시간 단위로 식별된 적어도 하나의 기기로 전송하도록 통신 인터페이스(130)를 제어할 수 있다. 예를 들어, 프로세서(140) 사용자 음성을 25ms 단위로 제1 신경망 모델에 입력하여 25ms 단위로 연산 결과를 획득하고, 25ms 단위로 획득된 연산 결과를 제2 신경망 모델이 입력하여 25ms 단위로 적어도 하나의 기기를 식별하며, 25ms 단위로 획득된 연산 결과를 25ms 단위로 식별된 적어도 하나의 기기로 전송하도록 통신 인터페이스(130)를 제어할 수 있다.The processor 140 inputs the user's voice into the first neural network model in units of a preset time interval to obtain an operation result in units of a preset time, and inputs the operation results obtained in units of a preset time to the second neural network model to At least one device is identified in a preset time unit, and the communication interface 130 may be controlled to transmit an operation result obtained in a preset time unit to the at least one identified device in a preset time unit. For example, the processor 140 inputs the user's voice to the first neural network model in units of 25 ms to obtain an operation result in units of 25 ms, and the second neural network model inputs the operation results obtained in units of 25 ms to at least one in units of 25 ms The device may be identified, and the communication interface 130 may be controlled to transmit an operation result obtained in units of 25 ms to at least one device identified in units of 25 ms.

한편, 메모리(120)는 복수의 기기에 대한 정보 및 복수의 프로젝션(projection) 레이어에 대한 정보를 더 저장하며, 프로세서(140)는 복수의 기기에 대한 정보에 기초하여 적어도 하나의 기기에서 처리 가능한 제2 차원에 대한 정보를 식별하고, 연산 결과의 제1 차원과 제2 차원이 상이한 경우, 복수의 프로젝션 레이어 중 제1 차원 및 제2 차원에 대응되는 프로젝션 레이어에 기초하여 연산 결과를 제2 차원으로 변경하고, 변경된 제2 차원의 연산 결과를 적어도 하나의 기기로 전송하도록 통신 인터페이스(130)를 제어할 수 있다.Meanwhile, the memory 120 further stores information on a plurality of devices and information on a plurality of projection layers, and the processor 140 is capable of being processed by at least one device based on the information on the plurality of devices. Identifies information on the second dimension, and when the first dimension and the second dimension of the calculation result are different, the calculation result is calculated as the second dimension based on the projection layer corresponding to the first dimension and the second dimension among the plurality of projection layers , and the communication interface 130 may be controlled to transmit the changed second-dimensional operation result to at least one device.

예를 들어, 연산 결과는 1536 차원일 수 있으나, 연산 결과가 전송될 기기는 512 차원인 경우, 프로세서(140)는 메모리(120)에 저장된 1536 × 512의 프로젝션 레이어에 기초하여 연산 결과의 차원을 512 차원으로 변환하고, 차원이 변환된 연산 결과를 해당 기기로 전송할 수 있다.For example, the calculation result may be 1536 dimensions, but when the device to which the calculation result is transmitted has a 512 dimension, the processor 140 determines the dimension of the calculation result based on the 1536 × 512 projection layer stored in the memory 120 . It can be converted to 512 dimensions, and the result of the dimensionality-converted operation can be transmitted to the corresponding device.

여기서, 복수의 기기에 대한 정보는 전자 장치(100)와 기기가 최초 연결될 당시 기기로부터 수신된 정보일 수 있다. 다만, 이에 한정되는 것은 아니며, 메모리(120)에는 복수의 기기에 대한 정보가 저장되어 있지 않을 수도 있다. 이 경우, 프로세서(140)는 연산 결과를 전송할 기기로 처리 가능한 차원에 대한 정보를 요청하여 수신하고, 수신된 정보에 기초하여 프로젝션 레이어의 이용 여부를 결정할 수도 있다.Here, the information on the plurality of devices may be information received from the device when the electronic device 100 and the device are first connected. However, the present invention is not limited thereto, and information on a plurality of devices may not be stored in the memory 120 . In this case, the processor 140 may request and receive information on a dimension that can be processed by a device to transmit the operation result, and determine whether to use the projection layer based on the received information.

한편, 메모리(120)는 복수의 기기에 대한 정보 및 제3 신경망 모델의 나머지 레이어를 더 저장하며, 프로세서(140)는 복수의 기기에 대한 정보에 기초하여 적어도 하나의 기기에 음성 인식 기능이 구비되지 않은 것으로 식별되면, 연산 결과를 나머지 레이어에 입력하여 사용자 음성에 대응되는 텍스트를 획득하고, 획득된 텍스트를 적어도 하나의 기기로 전송하도록 통신 인터페이스(130)를 제어할 수 있다. 즉, 프로세서(140)는 기기가 연산 결과를 처리할 능력이 없는 경우, 연산 결과에 대한 나머지 처리를 수행하여 사용자 음성에 대응되는 텍스트를 획득하고, 획득된 텍스트를 기기로 제공할 수도 있다. 이 경우, 기기는 텍스트에 대응되는 동작만을 수행할 수 있다.Meanwhile, the memory 120 further stores information on a plurality of devices and the remaining layers of the third neural network model, and the processor 140 includes a voice recognition function in at least one device based on the information on the plurality of devices. If it is identified as not, the communication interface 130 may be controlled to obtain a text corresponding to the user's voice by inputting the operation result to the remaining layers, and transmit the obtained text to at least one device. That is, when the device does not have the capability to process the calculation result, the processor 140 may obtain a text corresponding to the user's voice by performing the remaining processing on the calculation result, and may provide the obtained text to the device. In this case, the device may only perform an operation corresponding to the text.

또는, 메모리(120)는 복수의 프로젝션(projection) 레이어에 대한 정보 및 제3 신경망 모델의 나머지 레이어를 더 저장하며, 프로세서(140)는 연산 결과를 적어도 하나의 기기로 전송한 후 적어도 하나의 기기로부터 제1 응답이 수신되면, 이후 획득되는 연산 결과를 적어도 하나의 기기로 전송하도록 통신 인터페이스(130)를 제어할 수 있다. 예를 들어, 프로세서(140)는 연산 결과를 적어도 하나의 기기로 전송한 후 적어도 하나의 기기로부터 ok 응답이 수신되면, 이후 획득되는 연산 결과를 적어도 하나의 기기로 전송할 수 있다.Alternatively, the memory 120 further stores information on a plurality of projection layers and the remaining layers of the third neural network model, and the processor 140 transmits the calculation result to at least one device and then at least one device When the first response is received from the , the communication interface 130 may be controlled to transmit an operation result obtained thereafter to at least one device. For example, when an ok response is received from the at least one device after transmitting the calculation result to the at least one device, the processor 140 may transmit the subsequently obtained calculation result to the at least one device.

또는, 프로세서(140)는 연산 결과를 적어도 하나의 기기로 전송한 후 적어도 하나의 기기로부터 제2 응답이 수신되면, 연산 결과를 복수의 프로젝션 레이어 중 하나로 처리하거나 나머지 레이어에 입력할 수 있다.Alternatively, when a second response is received from the at least one device after transmitting the calculation result to the at least one device, the processor 140 may process the calculation result as one of a plurality of projection layers or input the calculation result to the remaining layers.

구체적으로, 프로세서(140)는 제2 응답이 적어도 하나의 기기에서 처리 가능한 차원에 대한 정보이면, 복수의 프로젝션 레이어 중 연산 결과의 차원 및 적어도 하나의 기기에서 처리 가능한 차원에 대응되는 프로젝션 레이에에 기초하여 연산 결과의 차원을 변경하고, 차원이 변경된 연산 결과를 적어도 하나의 기기로 전송하도록 통신 인테페이스(130)를 제어하고, 제2 응답이 연산 정보를 처리하지 못한다는 정보이면, 연산 결과를 나머지 레이어에 입력하여 사용자 음성에 대응되는 텍스트를 획득하고, 획득된 텍스트를 적어도 하나의 기기로 전송하도록 통신 인터페이스(130)를 제어할 수 있다.Specifically, if the second response is information on a dimension processable by at least one device, the processor 140 may store the dimension of the calculation result among the plurality of projection layers and the projection layer corresponding to the dimension processable by the at least one device. The dimension of the calculation result is changed based on the control, and the communication interface 130 is controlled to transmit the calculation result with the changed dimension to at least one device. The communication interface 130 may be controlled to obtain a text corresponding to the user's voice by inputting it into the layer and transmit the obtained text to at least one device.

한편, 적어도 하나의 기기는 연산 결과를 적어도 하나의 기기에 저장된 제4 신경망 모델에 입력하여 사용자 음성에 대응되는 텍스트를 획득하고, 획득된 텍스트에 대응되는 동작을 수행할 수 있다. 여기서, 제4 신경망 모델은 일부 레이어의 가중치 값을 고정시키고, 적어도 하나의 기기에 대응되는 복수의 샘플 사용자 음성 및 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 제3 신경망 모델의 나머지 레이어가 추가 학습된 후 추가 학습된 나머지 레이어 만을 포함하도록 구성된 모델일 수 있다. 이러한 추가 학습을 통해, 제4 신경망 모델은 기기의 제어 명령을 좀더 고려한 음성 인식 결과를 텍스트로서 출력할 수 있다.Meanwhile, the at least one device may obtain a text corresponding to the user's voice by inputting the operation result to the fourth neural network model stored in the at least one device, and may perform an operation corresponding to the obtained text. Here, the fourth neural network model fixes the weight values of some layers, and the remainder of the third neural network model is based on a plurality of sample user voices corresponding to at least one device and a plurality of sample texts corresponding to a plurality of sample user voices. After the layer is additionally trained, it may be a model configured to include only the remaining additionally trained layers. Through this additional learning, the fourth neural network model may output the voice recognition result in consideration of the device's control command as text.

한편, 본 개시에 따른 인공 지능과 관련된 기능은 프로세서(140)와 메모리(120)를 통해 동작된다.Meanwhile, functions related to artificial intelligence according to the present disclosure are operated through the processor 140 and the memory 120 .

프로세서(140)는 하나 또는 복수의 프로세서로 구성될 수 있다. 이때, 하나 또는 복수의 프로세서는 CPU, AP, DSP(Digital Signal Processor) 등과 같은 범용 프로세서, GPU, VPU(Vision Processing Unit)와 같은 그래픽 전용 프로세서 또는 NPU와 같은 인공 지능 전용 프로세서일 수 있다.The processor 140 may include one or a plurality of processors. In this case, the one or more processors may be a general-purpose processor such as a CPU, an AP, a digital signal processor (DSP), or the like, a graphics-only processor such as a GPU, a VPU (Vision Processing Unit), or an artificial intelligence-only processor such as an NPU.

하나 또는 복수의 프로세서는, 메모리에 저장된 기 정의된 동작 규칙 또는 인공 지능 모델에 따라, 입력 데이터를 처리하도록 제어한다. 또는, 하나 또는 복수의 프로세서가 인공 지능 전용 프로세서인 경우, 인공 지능 전용 프로세서는, 특정 인공 지능 모델의 처리에 특화된 하드웨어 구조로 설계될 수 있다. 기 정의된 동작 규칙 또는 인공 지능 모델은 학습을 통해 만들어진 것을 특징으로 한다.One or a plurality of processors control to process input data according to a predefined operation rule or artificial intelligence model stored in the memory. Alternatively, when one or more processors are artificial intelligence-only processors, the artificial intelligence-only processor may be designed with a hardware structure specialized for processing a specific artificial intelligence model. The predefined action rule or artificial intelligence model is characterized in that it is created through learning.

여기서, 학습을 통해 만들어진다는 것은, 기본 인공 지능 모델이 학습 알고리즘에 의하여 다수의 학습 데이터들을 이용하여 학습됨으로써, 원하는 특성(또는, 목적)을 수행하도록 설정된 기 정의된 동작 규칙 또는 인공 지능 모델이 만들어짐을 의미한다. 이러한 학습은 본 개시에 따른 인공 지능이 수행되는 기기 자체에서 이루어질 수도 있고, 별도의 서버 및/또는 시스템을 통해 이루어 질 수도 있다. 학습 알고리즘의 예로는, 지도형 학습(supervised learning), 비지도형 학습(unsupervised learning), 준지도형 학습(semi-supervised learning) 또는 강화 학습(reinforcement learning)이 있으나, 전술한 예에 한정되지 않는다.Here, being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created means burden. Such learning may be performed in the device itself on which artificial intelligence according to the present disclosure is performed, or may be performed through a separate server and/or system. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

인공 지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 복수의 신경망 레이어들 각각은 복수의 가중치들(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치들 간의 연산을 통해 신경망 연산을 수행한다. 복수의 신경망 레이어들이 갖고 있는 복수의 가중치들은 인공 지능 모델의 학습 결과에 의해 최적화될 수 있다. 예를 들어, 학습 과정 동안 인공 지능 모델에서 획득한 로스(loss) 값 또는 코스트(cost) 값이 감소 또는 최소화되도록 복수의 가중치들이 갱신될 수 있다.The artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between the operation result of a previous layer and the plurality of weights. The plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, a plurality of weights may be updated so that a loss value or a cost value obtained from the artificial intelligence model during the learning process is reduced or minimized.

인공 신경망은 심층 신경망(DNN:Deep Neural Network)를 포함할 수 있으며, 예를 들어, CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 또는 심층 Q-네트워크 (Deep Q-Networks) 등이 있으나, 전술한 예에 한정되지 않는다.The artificial neural network may include a deep neural network (DNN), for example, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), There may be a Deep Belief Network (DBN), a Bidirectional Recurrent Deep Neural Network (BRDNN), or a Deep Q-Networks, but is not limited thereto.

이상과 같이 신경망 모델의 추가 학습을 통해 음성 인식 성능을 향상시킬 수 있고, 기기가 추가되더라도 추가 학습 동작 없이 음성 인식을 수행할 수 있다.As described above, voice recognition performance can be improved through additional training of the neural network model, and even when a device is added, voice recognition can be performed without an additional learning operation.

이하에서는 도 3 내지 도 8을 통해 전자 장치(100)의 동작을 좀더 구체적으로 설명한다. 도 3 내지 도 8에서는 설명의 편의를 위해 개별적인 실시 예에 대하여 설명한다. 다만, 도 3 내지 도 8의 개별적인 실시 예는 얼마든지 조합된 상태로 실시될 수도 있다.Hereinafter, the operation of the electronic device 100 will be described in more detail with reference to FIGS. 3 to 8 . 3 to 8 , individual embodiments will be described for convenience of description. However, the individual embodiments of FIGS. 3 to 8 may be embodied in any combination.

도 3은 본 개시의 일 실시 예에 따른 신경망 모델의 추가 학습을 설명하기 위한 도면이다.3 is a diagram for explaining additional learning of a neural network model according to an embodiment of the present disclosure.

도 3의 상단은 음성 인식을 수행하도록 학습된 제3 신경망 모델(310)의 일 예를 나타낸다.The upper part of FIG. 3 shows an example of the third neural network model 310 trained to perform speech recognition.

도 3의 좌측 하단은 제3 신경망 모델(310)의 입력단 부근의 일부 레이어 만이 추가 학습된 후 추가 학습된 일부 레이어 만을 포함하도록 구성된 제1 신경망 모델(320)의 일 예를 나타낸다. 제1 신경망 모델(320)은 제3 신경망 모델(310)의 출력단 부근의 나머지 레이어의 가중치 값을 고정시키고, 전자 장치(100)에 대응되는 복수의 샘플 사용자 음성 및 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 일부 레이어가 추가 학습된 후 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델일 수 있다.The lower left of FIG. 3 shows an example of the first neural network model 320 configured to include only some additionally trained layers after only some layers near the input end of the third neural network model 310 are additionally learned. The first neural network model 320 fixes the weight values of the remaining layers near the output end of the third neural network model 310 , and corresponds to a plurality of sample user voices corresponding to the electronic device 100 and a plurality of sample user voices corresponding to the plurality of sample user voices. It may be a model configured to include only some additionally trained layers after some layers are additionally learned based on a plurality of sample texts.

예를 들어, 전자 장치(100)가 TV인 경우, TV 자체의 사운드 및 사용자 음성을 포함하는 샘플 사용자 음성 및 대응되는 텍스트를 통해 제3 신경망 모델(310)의 일부 레이어가 추가 학습된 후, 추가 학습된 제3 신경망 모델(310)의 일부 레이어로 제1 신경망 모델(320)이 구성될 수 있다. 이러한 추가 학습을 통해 TV 자체의 사운드가 포함된 사용자 음성이 입력되더라도 음성 인식 성능이 유지될 수 있다. 이러한 효과는 TV 뿐만 아니라 스마트폰, 냉장고, 세탁기 등에 대하여도 동일하게 적용될 수 있다.For example, when the electronic device 100 is a TV, some layers of the third neural network model 310 are additionally trained through a sample user voice including a sound of the TV itself and a user voice and a corresponding text, and then additionally The first neural network model 320 may be configured with some layers of the learned third neural network model 310 . Through this additional learning, voice recognition performance can be maintained even when a user's voice including the sound of the TV itself is input. This effect can be equally applied to a smart phone, a refrigerator, a washing machine, etc. as well as a TV.

도 3의 우측 하단은 제3 신경망 모델(310)의 출력단 부근의 나머지 레이어 만이 추가 학습된 후 추가 학습된 나머지 레이어 만을 포함하도록 구성된 제4 신경망 모델(330)의 일 예를 나타낸다. 제4 신경망 모델(330)은 제3 신경망 모델(310)의 입력단 부근의 일부 레이어의 가중치 값을 고정시키고, 기기에 대응되는 복수의 샘플 사용자 음성 및 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 나머지 레이어가 추가 학습된 후 추가 학습된 나머지 레이어 만을 포함하도록 구성된 모델일 수 있다.3 shows an example of the fourth neural network model 330 configured to include only the additionally trained remaining layers after only the remaining layers near the output end of the third neural network model 310 are additionally learned. The fourth neural network model 330 fixes the weight values of some layers near the input end of the third neural network model 310 , and a plurality of sample user voices corresponding to the device and a plurality of sample texts corresponding to the plurality of sample user voices It may be a model configured to include only the additionally trained remaining layers after the remaining layers are additionally learned based on .

예를 들어, 기기가 TV인 경우, 볼륨 업, 볼륨 다운, 채널 업, 채널 다운 등과 같이 TV를 제어하기 위해 이용되는 음성 샘플 및 대응되는 텍스트를 이용하여 나머지 레이어가 재학습될 수 있다. 또는, 기기가 에어컨인 경우, 온도 올려줘, 온도 낮춰줘, 취침 모드 등과 같이 에어컨을 제어하기 위해 이용되는 음성 샘플 및 대응되는 텍스트를 이용하여 나머지 레이어가 재학습될 수 있다. 이러한 추가 학습을 통해 사용자 음성이 각 기기에 대응되는 텍스트로 출력될 가능성이 향상된다.For example, when the device is a TV, the remaining layers may be re-learned using a voice sample and corresponding text used to control the TV such as volume up, volume down, channel up, channel down, and the like. Alternatively, when the device is an air conditioner, the remaining layers may be re-learned using a voice sample and corresponding text used to control the air conditioner, such as raising the temperature, lowering the temperature, or a sleep mode. Through this additional learning, the possibility that the user's voice is output as text corresponding to each device is improved.

도 4는 본 개시의 일 실시 예에 따른 기기의 식별 동작을 설명하기 위한 도면이다.4 is a diagram for explaining an operation of identifying a device according to an embodiment of the present disclosure.

프로세서(140) 도 4에 도시된 바와 같이, 사용자 음성을 25ms 단위로 제1 신경망 모델에 입력하여 25ms 단위로 연산 결과를 획득하고, 25ms 단위로 획득된 연산 결과를 제2 신경망 모델이 입력하여 25ms 단위로 적어도 하나의 기기를 식별하며, 25ms 단위로 획득된 연산 결과를 25ms 단위로 식별된 적어도 하나의 기기로 전송하도록 통신 인터페이스(130)를 제어할 수 있다.Processor 140 As shown in FIG. 4 , the user's voice is input to the first neural network model in units of 25 ms to obtain calculation results in units of 25 ms, and the operation results obtained in units of 25 ms are input by the second neural network model to the model for 25 ms At least one device may be identified in units of units, and the communication interface 130 may be controlled to transmit an operation result obtained in units of 25 ms to at least one device identified in units of 25 ms.

즉, 프로세서(140)는 410 구간의 사용자 음성을 제1 신경망 모델에 입력하여 410 구간에 대응되는 연산 결과를 획득하고, 410 구간에 대응되는 연산 결과를 제2 신경망 모델에 입력하여 410 구간에 대응되는 제1 기기, 제2 기기, 제3 기기를 식별할 수 있다. 여기서, 프로세서(140)는 복수의 기기 각각에 대한 스코어를 획득하고, 임계 값 이상의 스코어를 갖는 제1 기기, 제2 기기, 제3 기기를 식별할 수 있다. 프로세서(140)는 410 구간에 대응되는 연산 결과를 제1 기기, 제2 기기, 제3 기기로 전송할 수 있다.That is, the processor 140 inputs the user voice of section 410 into the first neural network model to obtain an operation result corresponding to section 410, and inputs the operation result corresponding to section 410 into the second neural network model to correspond to section 410 A first device, a second device, and a third device to be used may be identified. Here, the processor 140 may obtain a score for each of the plurality of devices, and identify the first device, the second device, and the third device having a score equal to or greater than a threshold value. The processor 140 may transmit the operation result corresponding to the 410 section to the first device, the second device, and the third device.

그리고, 프로세서(140)는 420 구간의 사용자 음성을 제1 신경망 모델에 입력하여 420 구간에 대응되는 연산 결과를 획득하고, 420 구간에 대응되는 연산 결과를 제2 신경망 모델에 입력하여 420 구간에 대응되는 제1 기기, 제3 기기를 식별할 수 있다. 프로세서(140)는 420 구간에 대응되는 연산 결과를 제1 기기, 제3 기기로 전송할 수 있다.Then, the processor 140 inputs the user voice of section 420 into the first neural network model to obtain a calculation result corresponding to section 420, and inputs the calculation result corresponding to section 420 into the second neural network model to correspond to section 420 A first device and a third device to be used may be identified. The processor 140 may transmit an operation result corresponding to section 420 to the first device and the third device.

그리고, 프로세서(140)는 430 구간의 사용자 음성을 제1 신경망 모델에 입력하여 430 구간에 대응되는 연산 결과를 획득하고, 430 구간에 대응되는 연산 결과를 제2 신경망 모델에 입력하여 430 구간에 대응되는 제3 기기를 식별할 수 있다. 프로세서(140)는 430 구간에 대응되는 연산 결과를 제3 기기로 전송할 수 있다.Then, the processor 140 inputs the user voice of the 430 section into the first neural network model to obtain an operation result corresponding to the 430 section, and inputs the operation result corresponding to the 430 section into the second neural network model to correspond to the 430 section A third device to be used can be identified. The processor 140 may transmit an operation result corresponding to section 430 to the third device.

이상과 같은 방식으로 프로세서(140)는 연산 결과를 전송할 기기를 식별할 수 있다. 즉, 연산 결과를 전송할 기기는 사용자 음성에 대응되는 기기일 수 있다. 이러한 동작은 각 구간이 매우 짧기 때문에 기기에서 텍스트를 도출하기 전에 최종 전송 기기가 결정될 수 있다. 도 4에서는 제1 기기는 50ms 동안의 연산 결과를 수신하나, 추가 연산 결과가 수신되지 않아 텍스트를 도출할 수 없다. 이러한 점은 제2 기기에서도 동일하다. 최종적으로, 제3 기기만이 연산 결과를 지속적으로 수신하여 텍스트를 획득하고, 텍스트에 대응되는 동작을 수행할 수 있다.In the above manner, the processor 140 may identify a device to which the operation result is to be transmitted. That is, the device to which the calculation result is transmitted may be a device corresponding to the user's voice. In this operation, since each section is very short, the final transmission device may be determined before text is derived from the device. In FIG. 4 , the first device receives the operation result for 50 ms, but the text cannot be derived because the additional operation result is not received. This point is also the same in the second device. Finally, only the third device may continuously receive the operation result to obtain the text and perform an operation corresponding to the text.

반면, 제3 기기는 25ms로 수신된 연산 결과를 실시간으로 처리하여 텍스트 획득 속도는 종래보다 향상될 수 있다. 가령, 종래 장치가 기설정된 시간 구간으로 연산 결과를 제공하지 않고, 사용자 음성 전체 시간 구간에 기초하여 사용자 음성에 대응되는 기기를 식별하는 경우, 사용자 음성 전체를 처리한 후에야 대응되는 기기로 사용자 음성을 전송하게 된다. 이 경우, 지연 시간은 사용자 음성 전체를 처리하는 시간일 수 있다. 반면, 본 개시에 의하면, 기기는 25ms 마다 연산 결과를 수신하여 처리하고 있기 때문에 지연 시간은 사용자 음성 전체를 처리하는 시간이 아닌 약 25ms일 수 있다. 따라서, 구간 단위로 연산 결과를 전송함에 따라 음성 인식 속도를 향상시킬 수 있다.On the other hand, the third device processes the received operation result in 25 ms in real time, so that the text acquisition speed may be improved compared to the prior art. For example, when the conventional apparatus does not provide a calculation result in a preset time period and identifies a device corresponding to the user's voice based on the entire user voice time period, the user's voice is transmitted to the corresponding device only after processing the entire user's voice. will send In this case, the delay time may be a time for processing the entire user's voice. On the other hand, according to the present disclosure, since the device receives and processes the operation result every 25 ms, the delay time may be about 25 ms rather than the time for processing the entire user voice. Accordingly, the speech recognition speed can be improved by transmitting the operation result in units of sections.

한편, 도 4에서는 25ms를 기설정된 시간 구간으로서 도시하였으나, 이에 한정되는 것은 아니며, 하드웨어적인 스펙, 각 시간 구간의 연산 결과를 처리하기 위한 시간 등에 따라 얼마든지 다양한 시간 구간이 설정될 수도 있다.Meanwhile, although 25 ms is illustrated as a preset time interval in FIG. 4 , the present invention is not limited thereto, and various time intervals may be set according to hardware specifications, a time for processing an operation result of each time interval, and the like.

도 5는 본 개시의 일 실시 예에 따른 프로젝션 레이어를 설명하기 위한 도면이다.5 is a diagram for describing a projection layer according to an embodiment of the present disclosure.

도 5에 도시된 바와 같이, 제1 신경망 모델의 출력인 연산 결과가 1536 차원이고, 기기의 처리 가능한 차원이 512 차원인 경우, 프로세서(140)는 1536 × 512의 프로젝션 레이어를 통해 연산 결과를 512 차원으로 변경하고, 변경된 연산 결과를 기기로 제공할 수 있다.As shown in FIG. 5 , when the calculation result, which is the output of the first neural network model, is 1536 dimensions, and the processingable dimension of the device is 512 dimensions, the processor 140 calculates the calculation result through a 1536 × 512 projection layer. It can be changed to a dimension, and the changed operation result can be provided to the device.

다만, 이에 한정되는 것은 아니며, 프로세서(140)는 기기가 1536 × 512의 프로젝션 레이어를 저장하고 있다고 식별되면, 1536 차원의 연산 결과를 기기로 제공하고, 기기가 연산 결과의 차원을 변경할 수도 있다.However, the present invention is not limited thereto, and when it is identified that the device stores the 1536×512 projection layer, the processor 140 may provide a 1536-dimensional calculation result to the device, and the device may change the dimension of the calculation result.

한편, 도 5에서는 1536 × 512의 프로젝션 레이어를 도시하였으나, 얼마든지 다양한 형태의 프로젝션 레이어가 존재할 수도 있다.Meanwhile, although FIG. 5 shows a projection layer of 1536×512, various types of projection layers may exist.

도 6은 본 개시의 일 실시 예에 따른 기기 타입에 따른 전자 장치(100)의 동작을 설명하기 위한 도면이다.6 is a diagram for explaining an operation of the electronic device 100 according to a device type according to an embodiment of the present disclosure.

전자 장치(100)의 프로세서(140)는 사용자 음성을 제1 신경망 모델에 입력하여 연산 결과(Encoder state)를 획득할 수 있다(Encoder). 그리고, 프로세서(140)는 연산 결과를 제2 신경망 모델에 입력하여 사용자 음성에 대응되는 적어도 하나의 기기를 식별할 수 있다(Domain Classifier). 여기서, 제1 신경망 모델은 음성으로부터 텍스트를 식별하도록 학습된 제3 신경망 모델의 일부 레이어 만이 추가 학습된 후 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델이며, 제2 신경망 모델은 음성에 대응되는 기기를 식별하도록 학습된 모델일 수 있다.The processor 140 of the electronic device 100 may obtain an operation result (Encoder state) by inputting the user's voice into the first neural network model (Encoder). Then, the processor 140 may identify at least one device corresponding to the user's voice by inputting the operation result into the second neural network model (Domain Classifier). Here, the first neural network model is a model configured to include only some additionally trained layers after only some layers of the third neural network model trained to identify text from speech are additionally trained, and the second neural network model is a device corresponding to the voice. It can be a model that has been trained to identify.

제1 기기(200-1)가 사용자 음성에 대응되는 기기이나 연산 결과를 처리할 수 없는 기기라면, 프로세서(140)는 연산 결과를 제3 신경망 모델의 나머지 레이어에 입력하여 사용자 음성에 대응되는 텍스트를 획득하고(Decoder), 텍스트를 제1 기기(200-1)로 제공할 수 있다.If the first device 200 - 1 is a device corresponding to the user's voice or a device that cannot process the calculation result, the processor 140 inputs the calculation result to the remaining layers of the third neural network model to provide text corresponding to the user's voice. may be obtained (Decoder), and the text may be provided to the first device 200 - 1 .

제2 기기(200-2) 및 제3 기기(200-3)는 연산 결과를 처리할 수 있는 기기이고, 프로세서(140)는 제2 신경망 모델의 출력에 기초하여 연산 결과를 제2 기기(200-2) 또는 제3 기기(200-3) 중 적어도 하나로 전송할 수 있다. 도 6에서는 제3 기기(200-3)가 사용자 음성에 대응되는 기기이고, 프로세서(140)는 연산 결과를 제3 기기(200-3)로 전송할 수 있다.The second device 200 - 2 and the third device 200 - 3 are devices capable of processing the calculation result, and the processor 140 outputs the calculation result to the second device 200 based on the output of the second neural network model. -2) or the third device 200-3. In FIG. 6 , the third device 200 - 3 is a device corresponding to the user's voice, and the processor 140 may transmit the operation result to the third device 200 - 3 .

제3 기기(200-3)는 연산 결과를 제4 신경망 모델에 입력하여 사용자 음성에 대응되는 텍스트를 획득하고, 획득된 텍스트에 대응되는 동작을 수행할 수 있다. 여기서, 제4 신경망 모델은 일부 레이어의 가중치 값을 고정시키고, 적어도 하나의 기기에 대응되는 복수의 샘플 사용자 음성 및 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 제3 신경망 모델의 나머지 레이어가 추가 학습된 후 추가 학습된 나머지 레이어 만을 포함하도록 구성된 모델일 수 있다.The third device 200 - 3 may obtain a text corresponding to the user's voice by inputting the operation result into the fourth neural network model, and may perform an operation corresponding to the obtained text. Here, the fourth neural network model fixes the weight values of some layers, and the remainder of the third neural network model is based on a plurality of sample user voices corresponding to at least one device and a plurality of sample texts corresponding to a plurality of sample user voices. After the layer is additionally trained, it may be a model configured to include only the remaining additionally trained layers.

도 7은 본 개시의 일 실시 예에 따른 기기의 응답에 따른 전자 장치(100)의 동작을 설명하기 위한 흐름도이다.7 is a flowchart illustrating an operation of the electronic device 100 according to a response of a device according to an embodiment of the present disclosure.

먼저, 프로세서(140)는 기기 확률(스코어)이 임계 값을 초과하는지 식별한다(S710). 프로세서(140)는 기기 확률이 임계 값을 초과하면 연산 결과를 전송하고(S720), 기기 확률이 임계 값을 초과하지 않으면 추가 동작을 수행하지 않을 수 있다.First, the processor 140 identifies whether the device probability (score) exceeds a threshold value (S710). The processor 140 may transmit the operation result when the device probability exceeds the threshold value (S720), and may not perform an additional operation if the device probability does not exceed the threshold value.

프로세서(140)는 연산 결과를 전송하고, 기기로부터 응답을 수신할 수 있다(S730).The processor 140 may transmit the operation result and receive a response from the device (S730).

프로세서(140)는 제1 응답이 수신되면 연산 결과 전송 동작을 유지할 수 있다(S740). 예를 들어, 프로세서(140)는 ok 응답이 수신되면 연산 결과 전송 동작을 유지할 수 있다.When the first response is received, the processor 140 may maintain the operation result transmission operation (S740). For example, when an ok response is received, the processor 140 may maintain the operation result transmission operation.

또는, 프로세서(140)는 제2 응답이 수신되면 연산 결과를 디코딩하고 디코딩 결과를 전송할 수 있다(S750). 예를 들어, 프로세서(140)는 기기가 연산 결과를 처리하지 못한다는 응답이 수신되면, 연산 결과로부터 텍스트를 획득하고, 획득된 텍스트를 기기로 전송할 수 있다.Alternatively, when the second response is received, the processor 140 may decode the operation result and transmit the decoding result (S750). For example, when a response indicating that the device cannot process the calculation result is received, the processor 140 may obtain text from the calculation result and transmit the obtained text to the device.

또는, 프로세서(140)는 제3 응답이 수신되면, 프로젝션 레이어가 존재하는지 식별할 수 있다(S760). 예를 들어, 프로세서(140)는 기기로부터 연산 결과의 처리가 가능하나, 처리 가능한 차원이 연산 결과의 차원과 상이하다는 정보를 수신할 수 있다. 이 경우, 프로세서(140)는 프로젝션 레이어가 존재하지 않으면 연산 결과를 디코딩하고 디코딩 결과를 전송할 수 있다(S750). 또는, 프로세서(140)는 프로젝션 레이어가 존재하면 연산 결과를 프로젝션하여(S770) 차원을 변경하고, 프로젝션된 연산 결과를 기기로 전송할 수도 있다(S780).Alternatively, when the third response is received, the processor 140 may identify whether a projection layer exists (S760). For example, the processor 140 may receive information from the device that the processing result is capable of being processed, but the dimension that can be processed is different from the dimension of the operation result. In this case, if the projection layer does not exist, the processor 140 may decode the operation result and transmit the decoding result (S750). Alternatively, if the projection layer exists, the processor 140 may change the dimension by projecting the operation result (S770), and transmit the projected operation result to the device (S780).

도 8은 본 개시의 일 실시 예에 따른 기기의 동작을 설명하기 위한 흐름도이다.8 is a flowchart illustrating an operation of a device according to an embodiment of the present disclosure.

먼저, 기기는 전자 장치(100)로부터 연산 결과를 수신할 수 있다(S810). 그리고, 기기는 디코더가 존재하는지 식별할 수 있다(S820). 즉, 기기는 연산 결과를 처리할 수 있으면, 처리할 수 있는 차원(dimension)과 연산 결과의 차원이 동일한지 식별하고(S830), 연산 결과를 처리할 수 없으면 전자 장치(100)로 no를 전송할 수 있다(S870).First, the device may receive an operation result from the electronic device 100 ( S810 ). Then, the device may identify whether a decoder exists (S820). That is, if the device can process the operation result, it identifies whether a processable dimension and a dimension of the operation result are the same (S830), and if the operation result cannot be processed, transmits no to the electronic device 100 It can be (S870).

기기는 차원들이 동일하면 전자 장치(100)로 ok를 전송하고(S850), 동일하지 않으면 프로젝션 레이어가 존재하는지 식별할 수 있다(S840).If the dimensions are the same, the device may transmit ok to the electronic device 100 (S850), and if they are not the same, the device may identify whether a projection layer exists (S840).

기기는 프로젝션 레이어가 존재하면 전자 장치(100)로 ok를 전송하고(S850), 프로젝션 레이어가 존재하지 않으면 기기가 처리할 수 있는 차원 정보를 전송할 수 있다(S860).If the projection layer exists, the device may transmit ok to the electronic device 100 (S850), and if the projection layer does not exist, it may transmit dimension information that the device can process (S860).

기기는 차원들이 동일하면 전자 장치(100)로부터 수신된 연산 결과를 그대로 이용하고, 차원들이 동일하지 않고 프로젝션 레이어가 존재하면 프로젝션 레이어에 기초하여 전자 장치(100)로부터 수신된 연산 결과의 차원을 변경하여 이용할 수 있다.If the dimensions are the same, the device uses the calculation result received from the electronic device 100 as it is, and if the dimensions are not the same and a projection layer exists, the dimension of the calculation result received from the electronic device 100 is changed based on the projection layer can be used by

도 9는 본 개시의 일 실시 예에 따른 전자 장치의 제어 방법을 설명하기 위위한 흐름도이다.9 is a flowchart for describing a method of controlling an electronic device according to an embodiment of the present disclosure.

먼저, 사용자 음성이 수신되면 사용자 음성을 제1 신경망 모델에 입력하여 연산 결과를 획득한다(S910). 그리고, 연산 결과를 제2 신경망 모델에 입력하여 사용자 음성에 대응되는 적어도 하나의 기기를 식별한다(S920). 그리고, 연산 결과를 적어도 하나의 기기로 전송한다(S930). 여기서, 제1 신경망 모델은 음성으로부터 텍스트를 식별하도록 학습된 제3 신경망 모델의 일부 레이어 만이 추가 학습된 후 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델이며, 제2 신경망 모델은 음성에 대응되는 기기를 식별하도록 학습된 모델일 수 있다.First, when the user's voice is received, the user's voice is input to the first neural network model to obtain a calculation result (S910). Then, at least one device corresponding to the user's voice is identified by inputting the calculation result into the second neural network model ( S920 ). Then, the calculation result is transmitted to at least one device (S930). Here, the first neural network model is a model configured to include only some additionally trained layers after only some layers of the third neural network model trained to identify text from speech are additionally trained, and the second neural network model is a device corresponding to the voice. It can be a model that has been trained to identify.

여기서, 획득하는 단계(S910)는 사용자 음성을 기설정된 시간 구간 단위로 제1 신경망 모델에 입력하여 기설정된 시간 단위로 연산 결과를 획득하고, 식별하는 단계(S920)는 기설정된 시간 단위로 획득되는 연산 결과를 제2 신경망 모델에 입력하여 기설정된 시간 단위로 적어도 하나의 기기를 식별하며, 전송하는 단계(S930)는 기설정된 시간 단위로 획득된 연산 결과를 기설정된 시간 단위로 식별된 적어도 하나의 기기로 전송할 수 있다.Here, the obtaining step (S910) is to input the user's voice into the first neural network model in a predetermined time interval unit to obtain a calculation result in a predetermined time unit unit, and the identifying step (S920) is to obtain a predetermined time unit In the step of inputting the calculation result into the second neural network model to identify at least one device in a preset time unit, and transmitting (S930), the operation result obtained in the preset time unit is input to the at least one device identified in the preset time unit. can be transferred to the device.

또한, 복수의 기기에 대한 정보에 기초하여 적어도 하나의 기기에서 처리 가능한 제2 차원에 대한 정보를 식별하는 단계, 연산 결과의 제1 차원과 제2 차원이 상이한 경우, 복수의 프로젝션 레이어 중 제1 차원 및 제2 차원에 대응되는 프로젝션 레이어에 기초하여 연산 결과를 제2 차원으로 변경하는 단계를 더 포함하며, 전송하는 단계(S930)는 변경된 제2 차원의 연산 결과를 적어도 하나의 기기로 전송할 수 있다.In addition, the step of identifying information on a second dimension that can be processed by at least one device based on the information on the plurality of devices; The method further includes the step of changing the calculation result to the second dimension based on the dimension and the projection layer corresponding to the second dimension, and transmitting (S930) may transmit the changed second-dimensional calculation result to at least one device. have.

한편, 복수의 기기에 대한 정보에 기초하여 적어도 하나의 기기에 음성 인식 기능이 구비되지 않은 것으로 식별되면, 연산 결과를 제3 신경망 모델의 나머지 레이어에 입력하여 사용자 음성에 대응되는 텍스트를 획득하는 단계를 더 포함하며, 전송하는 단계(S930)는 획득된 텍스트를 적어도 하나의 기기로 전송할 수 있다.On the other hand, when it is identified that at least one device does not have a speech recognition function based on the information on the plurality of devices, inputting the calculation result into the remaining layers of the third neural network model to obtain text corresponding to the user's voice It further includes, and the transmitting ( S930 ) may transmit the obtained text to at least one device.

또한, 식별하는 단계(S920)는 연산 결과를 제2 신경망 모델에 입력하여 복수의 기기 각각에 대한 스코어를 획득하고, 전송하는 단계(S930)는 연산 결과를 획득된 스코어 중 임계 값 이상의 스코어를 갖는 기기로 전송할 수 있다.In addition, the step of identifying (S920) obtains a score for each of the plurality of devices by inputting the calculation result into the second neural network model, and the step of transmitting (S930) includes the calculation result having a score greater than or equal to a threshold value among the obtained scores can be sent to the device.

한편, 연산 결과를 적어도 하나의 기기로 전송한 후 적어도 하나의 기기로부터 제1 응답이 수신되면, 이후 획득되는 연산 결과를 적어도 하나의 기기로 전송하고, 연산 결과를 적어도 하나의 기기로 전송한 후 적어도 하나의 기기로부터 제2 응답이 수신되면, 연산 결과를 복수의 프로젝션 레이어 중 하나로 처리하거나 제3 신경망 모델의 나머지 레이어에 입력하는 단계를 더 포함할 수 있다.On the other hand, when a first response is received from the at least one device after transmitting the calculation result to the at least one device, after transmitting the obtained calculation result to the at least one device, and after transmitting the calculation result to the at least one device When the second response is received from the at least one device, the method may further include processing the calculation result as one of a plurality of projection layers or inputting the operation result into the remaining layers of the third neural network model.

여기서, 입력하는 단계는 제2 응답이 적어도 하나의 기기에서 처리 가능한 차원에 대한 정보이면, 복수의 프로젝션 레이어 중 연산 결과의 차원 및 적어도 하나의 기기에서 처리 가능한 차원에 대응되는 프로젝션 레이에에 기초하여 연산 결과의 차원을 변경하고, 차원이 변경된 연산 결과를 적어도 하나의 기기로 전송하고, 제2 응답이 연산 정보를 처리하지 못한다는 정보이면, 연산 결과를 나머지 레이어에 입력하여 사용자 음성에 대응되는 텍스트를 획득하고, 획득된 텍스트를 적어도 하나의 기기로 전송할 수 있다.Here, when the second response is information on a dimension processable by at least one device, the inputting step is based on a dimension of a calculation result among a plurality of projection layers and a projection ray corresponding to a dimension processable by at least one device. If the dimension of the computation result is changed, the dimension-changed computation result is transmitted to at least one device, and the second response is information indicating that the computation information cannot be processed, the computation result is input to the remaining layers and text corresponding to the user's voice may be obtained, and the obtained text may be transmitted to at least one device.

한편, 제1 신경망 모델은 제3 신경망 모델의 나머지 레이어의 가중치 값을 고정시키고, 전자 장치에 대응되는 복수의 샘플 사용자 음성 및 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 일부 레이어가 추가 학습된 후 추가 학습된 일부 레이어 만을 포함하도록 구성된 모델일 수 있다.Meanwhile, in the first neural network model, the weight values of the remaining layers of the third neural network model are fixed, and some layers are formed based on a plurality of sample user voices corresponding to the electronic device and a plurality of sample texts corresponding to the plurality of sample user voices. After additional training, it may be a model configured to include only some additionally trained layers.

그리고, 적어도 하나의 기기에 의해, 연산 결과를 상기 적어도 하나의 기기에 저장된 제4 신경망 모델에 입력하여 사용자 음성에 대응되는 텍스트를 획득하는 단계 및 획득된 텍스트에 대응되는 동작을 수행하는 단계를 더 포함하며, 제4 신경망 모델은 일부 레이어의 가중치 값을 고정시키고, 적어도 하나의 기기에 대응되는 복수의 샘플 사용자 음성 및 복수의 샘플 사용자 음성에 대응되는 복수의 샘플 텍스트에 기초하여 제3 신경망 모델의 나머지 레이어가 추가 학습된 후 추가 학습된 나머지 레이어 만을 포함하도록 구성된 모델일 수 있다.The method further includes, by at least one device, inputting a calculation result into a fourth neural network model stored in the at least one device to obtain text corresponding to the user's voice and performing an operation corresponding to the obtained text. wherein the fourth neural network model fixes the weight values of some layers, and based on a plurality of sample user voices corresponding to at least one device and a plurality of sample texts corresponding to a plurality of sample user voices, After the remaining layers are additionally trained, it may be a model configured to include only the additionally trained remaining layers.

한편, 이상에서는 설명의 편의를 위하여, 복수의 기기가 전자 장치로부터 연산 결과를 수신하고, 수신된 연산 결과에 대응되는 동작을 수행하는 것으로만 기재하였다. 즉, 이상에서는 복수의 기기 각각이 제4 신경망 모델을 이용하여 연산 결과를 처리하는 것으로만 기재하였으나, 이에 한정되는 것은 아니다. 예를 들어, 복수의 기기 각각은 제1 신경망 모델 및 제2 신경망 모델을 저장하고, 사용자 음성을 수신하여 전자 장치와 같이 연산한 후, 전자 장치 또는 적어도 하나의 기기로 연산 결과를 전송할 수도 있다.Meanwhile, in the above description, for convenience of description, it has been described that a plurality of devices receive an operation result from the electronic device and perform an operation corresponding to the received operation result. That is, although it has been described above that each of the plurality of devices processes the calculation result using the fourth neural network model, the present invention is not limited thereto. For example, each of the plurality of devices may store a first neural network model and a second neural network model, receive a user's voice, perform an operation together with the electronic device, and then transmit the operation result to the electronic device or at least one device.

한편, 본 개시의 일시 예에 따르면, 이상에서 설명된 다양한 실시 예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media)에 저장된 명령어를 포함하는 소프트웨어로 구현될 수 있다. 기기는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시 예들에 따른 전자 장치(예: 전자 장치(A))를 포함할 수 있다. 명령이 프로세서에 의해 실행될 경우, 프로세서가 직접, 또는 프로세서의 제어 하에 다른 구성요소들을 이용하여 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.Meanwhile, according to a temporary example of the present disclosure, the various embodiments described above may be implemented as software including instructions stored in a machine-readable storage media readable by a machine (eg, a computer). can A device is a device capable of calling a stored command from a storage medium and operating according to the called command, and may include an electronic device (eg, the electronic device A) according to the disclosed embodiments. When the instruction is executed by the processor, the processor may perform a function corresponding to the instruction by using other components directly or under the control of the processor. Instructions may include code generated or executed by a compiler or interpreter. The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' means that the storage medium does not include a signal and is tangible, and does not distinguish that data is semi-permanently or temporarily stored in the storage medium.

또한, 본 개시의 일 실시 예에 따르면, 이상에서 설명된 다양한 실시 예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 온라인으로 배포될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.In addition, according to an embodiment of the present disclosure, the method according to the various embodiments described above may be provided by being included in a computer program product. Computer program products may be traded between sellers and buyers as commodities. The computer program product may be distributed in the form of a machine-readable storage medium (eg, compact disc read only memory (CD-ROM)) or online through an application store (eg, Play Store™). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily generated in a storage medium such as a memory of a server of a manufacturer, a server of an application store, or a relay server.

또한, 본 개시의 일 실시 예에 따르면, 이상에서 설명된 다양한 실시 예들은 소프트웨어(software), 하드웨어(hardware) 또는 이들의 조합을 이용하여 컴퓨터(computer) 또는 이와 유사한 장치로 읽을 수 있는 기록 매체 내에서 구현될 수 있다. 일부 경우에 있어 본 명세서에서 설명되는 실시 예들이 프로세서 자체로 구현될 수 있다. 소프트웨어적인 구현에 의하면, 본 명세서에서 설명되는 절차 및 기능과 같은 실시 예들은 별도의 소프트웨어 모듈들로 구현될 수 있다. 소프트웨어 모듈들 각각은 본 명세서에서 설명되는 하나 이상의 기능 및 동작을 수행할 수 있다.In addition, according to an embodiment of the present disclosure, the various embodiments described above are stored in a recording medium readable by a computer or a similar device using software, hardware, or a combination thereof. can be implemented in In some cases, the embodiments described herein may be implemented by the processor itself. According to the software implementation, embodiments such as the procedures and functions described in this specification may be implemented as separate software modules. Each of the software modules may perform one or more functions and operations described herein.

한편, 상술한 다양한 실시 예들에 따른 기기의 프로세싱 동작을 수행하기 위한 컴퓨터 명령어(computer instructions)는 비일시적 컴퓨터 판독 가능 매체(non-transitory computer-readable medium)에 저장될 수 있다. 이러한 비일시적 컴퓨터 판독 가능 매체에 저장된 컴퓨터 명령어는 특정 기기의 프로세서에 의해 실행되었을 때 상술한 다양한 실시 예에 따른 기기에서의 처리 동작을 특정 기기가 수행하도록 한다. 비일시적 컴퓨터 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 비일시적 컴퓨터 판독 가능 매체의 구체적인 예로는, CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등이 있을 수 있다.Meanwhile, computer instructions for performing the processing operation of the device according to the above-described various embodiments may be stored in a non-transitory computer-readable medium. When the computer instructions stored in the non-transitory computer-readable medium are executed by the processor of the specific device, the specific device performs the processing operation in the device according to the various embodiments described above. The non-transitory computer-readable medium refers to a medium that stores data semi-permanently, not a medium that stores data for a short moment, such as a register, cache, memory, etc., and can be read by a device. Specific examples of the non-transitory computer-readable medium may include a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

또한, 상술한 다양한 실시 예들에 따른 구성 요소(예: 모듈 또는 프로그램) 각각은 단수 또는 복수의 개체로 구성될 수 있으며, 전술한 해당 서브 구성 요소들 중 일부 서브 구성 요소가 생략되거나, 또는 다른 서브 구성 요소가 다양한 실시 예에 더 포함될 수 있다. 대체적으로 또는 추가적으로, 일부 구성 요소들(예: 모듈 또는 프로그램)은 하나의 개체로 통합되어, 통합되기 이전의 각각의 해당 구성 요소에 의해 수행되는 기능을 동일 또는 유사하게 수행할 수 있다. 다양한 실시예들에 따른, 모듈, 프로그램 또는 다른 구성 요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱하게 실행되거나, 적어도 일부 동작이 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.In addition, each of the components (eg, a module or a program) according to the above-described various embodiments may be composed of a single or a plurality of entities, and some sub-components of the aforementioned sub-components may be omitted, or other sub-components may be omitted. Components may be further included in various embodiments. Alternatively or additionally, some components (eg, a module or a program) may be integrated into a single entity to perform the same or similar functions performed by each corresponding component prior to integration. According to various embodiments, operations performed by a module, program, or other component are executed sequentially, parallel, iteratively, or heuristically, or at least some operations are executed in a different order, are omitted, or other operations are added. can be

이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시에 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In the above, preferred embodiments of the present disclosure have been illustrated and described, but the present disclosure is not limited to the specific embodiments described above, and it is common in the technical field pertaining to the present disclosure without departing from the gist of the present disclosure as claimed in the claims. Various modifications may be made by those having the knowledge of

1000 : 전자 시스템 100 : 전자 장치
110 : 마이크 120 : 메모리
130 : 통신 인터페이스 140 : 프로세서
200-1 ~ 200-3 : 복수의 기기1000: electronic system 100: electronic device
110: microphone 120: memory
130: communication interface 140: processor
200-1 ~ 200-3: Multiple devices

Claims

In an electronic device,
MIC;
a memory in which the first neural network model and the second neural network model are stored;
communication interface; and
a processor connected to the microphone, the memory, and the communication interface to control the electronic device;
The processor is
When a user voice is received through the microphone, the user voice is input to the first neural network model to obtain an operation result, and the operation result is input to the second neural network model to select at least one device corresponding to the user voice identify,
controlling the communication interface to transmit the operation result to the at least one device;
The first neural network model is
After only some layers of the third neural network model trained to identify text from speech are additionally trained, it is a model configured to include only some of the additionally trained layers,
The second neural network model is
An electronic device, which is a model trained to identify a device corresponding to a voice.

According to claim 1,
The processor is
inputting the user voice into the first neural network model in units of a preset time interval to obtain the operation result in units of the preset time,
inputting the operation result obtained in the preset time unit into the second neural network model to identify the at least one device in the preset time unit;
and controlling the communication interface to transmit the operation result obtained in the preset time unit to the at least one device identified in the preset time unit.

According to claim 1,
The memory further stores information on a plurality of devices and information on a plurality of projection layers,
The processor is
Identifies information on a second dimension that can be processed by the at least one device based on the information on the plurality of devices,
When the first dimension and the second dimension of the operation result are different from each other, the operation result is changed to the second dimension based on a projection layer corresponding to the first dimension and the second dimension among the plurality of projection layers, and ,
and controlling the communication interface to transmit the changed second-dimensional calculation result to the at least one device.

According to claim 1,
The memory further stores information on a plurality of devices and the remaining layers of the third neural network model,
The processor is
When it is identified that the at least one device does not have a voice recognition function based on the information on the plurality of devices, the operation result is input to the remaining layer to obtain a text corresponding to the user's voice,
and controlling the communication interface to transmit the obtained text to the at least one device.

According to claim 1,
The processor is
Input the calculation result to the second neural network model to obtain a score for each of a plurality of devices,
and controlling the communication interface to transmit the operation result to a device having a score equal to or greater than a threshold among the obtained scores.

According to claim 1,
The memory further stores information on a plurality of projection layers and the remaining layers of the third neural network model,
The processor is
When a first response is received from the at least one device after transmitting the calculation result to the at least one device, controlling the communication interface to transmit the calculation result obtained thereafter to the at least one device,
When a second response is received from the at least one device after transmitting the calculation result to the at least one device, the calculation result is processed as one of the plurality of projection layers or input to the remaining layers.

7. The method of claim 6,
The processor is
If the second response is information on a dimension processable by the at least one device, the second response is based on a dimension of the operation result among the plurality of projection layers and a projection ray corresponding to a dimension processable by the at least one device. changing the dimension of the calculation result, and controlling the communication interface to transmit the calculation result of which the dimension is changed to the at least one device,
If the second response is information indicating that the operation information cannot be processed, input the operation result to the remaining layer to obtain a text corresponding to the user's voice, and transmit the obtained text to the at least one device An electronic device for controlling the communication interface.

According to claim 1,
The first neural network model is
The weight values of the remaining layers of the third neural network model are fixed, and the partial layers are additionally learned based on a plurality of sample user voices corresponding to the electronic device and a plurality of sample texts corresponding to the plurality of sample user voices. Afterwards, the electronic device is a model configured to include only some of the additionally learned layers.

According to claim 1,
the at least one device,
inputting the calculation result into a fourth neural network model stored in the at least one device to obtain a text corresponding to the user's voice;
performing an operation corresponding to the obtained text,
The fourth neural network model is
The weight values of the partial layers are fixed, and the remaining layers of the third neural network model are added based on a plurality of sample user voices corresponding to the at least one device and a plurality of sample texts corresponding to the plurality of sample user voices. After being trained, the electronic device is a model configured to include only the additionally trained remaining layers.

In an electronic system,
When the user's voice is received, the user's voice is input to a first neural network model to obtain an operation result, and the operation result is input to a second neural network model to identify at least one device corresponding to the user's voice, and the operation result an electronic device that transmits to the at least one device; and
At least one device for obtaining a text corresponding to the user's voice by inputting the operation result into a fourth neural network model and performing an operation corresponding to the obtained text;
The first neural network model is
After only some layers of the third neural network model trained to identify text from speech are additionally trained, it is a model configured to include only some of the additionally trained layers,
The second neural network model is
A model trained to identify a device corresponding to a voice,
The fourth neural network model is
The electronic system of claim 1, wherein after only the remaining layers of the third neural network model are additionally trained, the model is configured to include only the additionally trained remaining layers.

11. The method of claim 10,
the at least one device,
If the first dimension of the operation result is different from the second dimension of the input of the fourth neural network model, it is identified whether a projection layer corresponding to the first dimension exists,
When the projection layer exists, a first signal is transmitted to the electronic device,
When the projection layer does not exist, a second signal including information on the second dimension is transmitted to the electronic device.

12. The method of claim 11,
The electronic device is
When the first signal is received, the operation of transmitting the operation result to the at least one device is maintained,
When the second signal is received, the operation result is projected into the second dimension and the projected operation result is transmitted to the at least one device, or the operation result is decoded and the decoded operation result is transmitted to the at least one device. An electronic system that transmits to a device.

A method for controlling an electronic device, comprising:
when the user's voice is received, inputting the user's voice into a first neural network model to obtain an operation result;
inputting the calculation result into a second neural network model to identify at least one device corresponding to the user's voice; and
Including; transmitting the operation result to the at least one device;
The first neural network model is
After only some layers of the third neural network model trained to identify text from speech are additionally trained, it is a model configured to include only some of the additionally trained layers,
The second neural network model is
A control method, which is a model trained to identify a device corresponding to a voice.

14. The method of claim 13,
The obtaining step is
inputting the user voice into the first neural network model in units of a preset time interval to obtain the operation result in units of the preset time,
The identifying step is
inputting the operation result obtained in the preset time unit into the second neural network model to identify the at least one device in the preset time unit;
The transmitting step is
The control method of transmitting the operation result obtained in the preset time unit to the at least one device identified in the preset time unit.

14. The method of claim 13,
identifying information on a second dimension that can be processed by the at least one device based on information about a plurality of devices;
When the first dimension and the second dimension of the operation result are different from each other, changing the operation result to the second dimension based on a projection layer corresponding to the first dimension and the second dimension among a plurality of projection layers further including;
The transmitting step is
and transmitting the changed second-dimensional calculation result to the at least one device.

14. The method of claim 13,
When it is identified that the at least one device is not equipped with a speech recognition function based on information on a plurality of devices, the operation result is input to the remaining layers of the third neural network model to obtain text corresponding to the user's voice further comprising;
The transmitting step is
and transmitting the obtained text to the at least one device.

14. The method of claim 13,
The identifying step is
Input the calculation result to the second neural network model to obtain a score for each of a plurality of devices,
The transmitting step is
and transmitting the calculation result to a device having a score equal to or greater than a threshold among the obtained scores.

14. The method of claim 13,
After transmitting the operation result to the at least one device, when a first response is received from the at least one device, a subsequent calculation result is transmitted to the at least one device, and the calculation result is transmitted to the at least one device When a second response is received from the at least one device after transmitting to , processing the calculation result as one of a plurality of projection layers or inputting the operation result into the remaining layers of the third neural network model.

19. The method of claim 18,
The input step is
If the second response is information on a dimension processable by the at least one device, the second response is based on a dimension of the operation result among the plurality of projection layers and a projection ray corresponding to a dimension processable by the at least one device. If the dimension of the calculation result is changed, the calculation result of which the dimension has been changed is transmitted to the at least one device, and the second response is information indicating that the calculation information cannot be processed, the calculation result is input to the remaining layer, A control method of acquiring a text corresponding to the user's voice and transmitting the acquired text to the at least one device.

14. The method of claim 13,
The first neural network model is
The weight values of the remaining layers of the third neural network model are fixed, and the partial layers are additionally learned based on a plurality of sample user voices corresponding to the electronic device and a plurality of sample texts corresponding to the plurality of sample user voices. Then, the control method, which is a model configured to include only some of the additionally learned layers.