WO2020222384A1

WO2020222384A1 - Electronic device and control method therefor

Info

Publication number: WO2020222384A1
Application number: PCT/KR2019/016547
Authority: WO
Inventors: 박영환; 김명선; 김대현; 에거버나드; 최기영; 하순회; 강두석; 김희수; 토마스 헤리스바렌드; 배인표
Original assignee: 삼성전자주식회사; 서울대학교산학협력단
Priority date: 2019-04-30
Filing date: 2019-11-28
Publication date: 2020-11-05
Also published as: KR20200126675A

Abstract

An electronic device and a control method therefor are provided. The present electronic device may comprise: a microphone; a memory having at least one instruction stored therein; and a processor for executing the at least one instruction, wherein the processor recognizes a user who has uttered a sound input through the microphone, and after the recognition of the user, when the recognized user's voice command is input through the microphone, the processor inputs the user's voice command to a first neural network of a first artificial intelligence model, inputs results output from the first neural network, to a second neural network of the first artificial intelligence model, acquires control information corresponding to the user's voice command on the basis of the results output from the first neural network and the second neural network, and performs a function corresponding to the acquired control information, wherein the first neural network is a neural network trained on the basis of data stored in a plurality of user databases, and the second neural network is a neural network additionally trained on the basis of data stored in a database for the recognized user.

Description

Electronic device and control method thereof

The present disclosure relates to an electronic device and a control method thereof, and more particularly, to an electronic device that recognizes a user who uttered an input user voice and performs a function corresponding to the recognized user voice command, and a control method thereof. .

In recent years, speech recognition systems using artificial intelligence systems are being used in various fields. Unlike the existing rule-based smart system, the artificial intelligence system is a system in which machines learn, judge, and become smart. As the artificial intelligence system is used, the recognition rate improves and users' tastes can be understood more accurately, and the existing rule-based smart system is gradually being replaced by a deep learning-based artificial intelligence system.

Artificial intelligence technology consists of machine learning (for example, deep learning) and component technologies using machine learning.

Machine learning is an algorithm technology that classifies/learns the features of input data by itself, and element technology is a technology that simulates functions such as cognition and judgment of the human brain using machine learning algorithms such as deep learning. It consists of technical fields such as understanding, reasoning/prediction, knowledge expression, and motion control. In particular, linguistic understanding is a technology that recognizes and applies/processes human language/text, and includes natural language processing, machine translation, dialogue systems, question and answer, and speech recognition/synthesis.

However, since a high-performance computer server is used for the existing deep machine learning, a large amount of cost and space is required, and a very large database, which is difficult for general users to obtain, is also required. Therefore, there is a limitation in that it is difficult to utilize deep machine learning technology in general home appliance systems or portable devices.

Meanwhile, in recent years, technology for recognizing a user who uttered an input voice by extracting a voice fingerprint from a user voice has been continuously developed. However, in the past, speaker recognition was performed by storing only the voice provided with the user's feedback as a voice fingerprint.However, in order to utilize the voice fingerprint supplemented with the user's information obtained through continuous use, the user's feedback is continuously required. There was a limit to that.

The present disclosure was devised to solve the above-described problem, and the object of the present disclosure is to recognize a user's voice by using a voice fingerprint supplemented with user information that is continuously provided, and obtain information on the user's voice command to obtain a voice command. It is to provide an electronic device that performs a function corresponding to and a control method thereof.

An electronic device according to an embodiment of the present disclosure for achieving the above object includes a microphone, a display, a memory storing at least one instruction, and a processor executing the at least one instruction, and the processor is the microphone Recognizes the user who uttered the voice input through, and when the recognized user's voice command is input through the microphone after the user is recognized, the user's voice command is transmitted to the first neural network of the first artificial intelligence model. And inputs the recognized user's database and the result output from the first neural network into the second neural network of the first artificial intelligence model, and the user based on the results output from the first neural network and the second neural network The control information corresponding to the voice command of is acquired, the first neural network is a neural network that is learned based on data stored in a plurality of user databases, and the second neural network is additionally learned based on data stored in the user's database. It can be characterized as being a neural network.

Meanwhile, in the electronic device control method according to an embodiment of the present disclosure for achieving the above object, when a user voice is input, recognizing a user who uttered the input voice, and after the user is recognized, the recognized user When a voice command of is input, inputting the user voice command to a first neural network of a first artificial intelligence model, inputting a result output from the first neural network to a second neural network of the first artificial intelligence model, Acquiring control information corresponding to the user's voice command based on results output from the first neural network and the second neural network, and performing a function corresponding to the acquired control information, The first neural network may be a neural network that is learned based on data stored in a plurality of user databases, and the second neural network may be a neural network that is additionally learned based on data stored in a user's database.

As described above, according to various embodiments of the present disclosure, the electronic device continuously recognizes a user by using a voice fingerprint supplemented with user information, and responds to a user voice command through an artificial intelligence model including a plurality of neural networks. By securing the control information, the user can conveniently utilize the voice command recognition system.

1 is a diagram illustrating a use of an electronic device that recognizes a user who has spoken and performs a function according to a command of the recognized user according to an embodiment of the present disclosure.

2A is a block diagram schematically illustrating a configuration of an electronic device according to an embodiment of the present disclosure;

2B is a block diagram showing in detail the configuration of an electronic device according to an embodiment of the present disclosure;

3 is a block diagram illustrating a process of extracting a voice fingerprint according to an embodiment of the present disclosure;

4A is a view for explaining a process of comparing an extracted voice fingerprint and a pre-registered voice fingerprint according to an embodiment of the present disclosure;

4B is a diagram for explaining a UI when recognizing a user who has spoken according to an embodiment of the present disclosure;

FIG. 4C is a diagram for explaining a process of additionally storing voice fingerprints when voice fingerprints are stored more than a preset number according to an embodiment of the present disclosure;

5 is a flowchart illustrating a process of comparing and storing voice fingerprints according to an embodiment of the present disclosure;

6 is a block diagram illustrating a process of recognizing a user's voice command according to an embodiment of the present disclosure;

7 is a block diagram illustrating a process of recognizing a user's voice command according to an embodiment of the present disclosure;

8 is a block diagram showing a configuration of an electronic device for learning and using an artificial intelligence model according to an embodiment of the present disclosure;

9 is a block diagram showing a detailed configuration of a learning unit and a recognition unit according to an embodiment of the present disclosure;

10 is a diagram for explaining a UI for checking whether recognized voice command information is correct according to an embodiment of the present disclosure;

11 is a flowchart illustrating a method of controlling an electronic device according to an embodiment of the present disclosure.

Hereinafter, various embodiments of the present document will be described with reference to the accompanying drawings. However, this is not intended to limit the technology described in this document to a specific embodiment, it should be understood to include various modifications, equivalents, and/or alternatives of the embodiments of this document. . In connection with the description of the drawings, similar reference numerals may be used for similar elements.

In this document, expressions such as "have," "may have," "include," or "may contain" are the presence of corresponding features (eg, elements such as numbers, functions, actions, or parts). And does not exclude the presence of additional features.

In this document, expressions such as "A or B," "at least one of A or/and B," or "one or more of A or/and B" may include all possible combinations of the items listed together. . For example, "A or B," "at least one of A and B," or "at least one of A or B" includes (1) at least one A, (2) at least one B, Or (3) it may refer to all cases including both at least one A and at least one B.

Expressions such as "first," "second," "first," or "second," used in this document can modify various elements regardless of their order and/or importance, and It is used to distinguish it from other components and does not limit the components.

Some component (eg, a first component) is "(functionally or communicatively) coupled with/to)" to another component (eg, a second component) or " When referred to as "connected to", it should be understood that the certain component may be directly connected to the other component or may be connected through another component (eg, a third component). On the other hand, when a component (eg, a first component) is referred to as being “directly connected” or “directly connected” to another component (eg, a second component), the component and the It may be understood that no other component (eg, a third component) exists between the different components.

The expression "configured to" as used in this document is, for example, "suitable for," "having the capacity to" depending on the situation. ," "designed to," "adapted to," "made to," or "capable of." The term "configured to (or set)" may not necessarily mean only "specifically designed to" in hardware. Instead, in some situations, the expression "a device configured to" may mean that the device "can" along with other devices or parts. For example, the phrase “a subprocessor configured (or configured) to perform A, B, and C” refers to a dedicated processor (eg, an embedded processor) for performing the operation, or executing one or more software programs stored in a memory device. By doing so, it may mean a generic-purpose processor (eg, a CPU or an application processor) capable of performing corresponding operations.

Electronic devices according to various embodiments of the present document include, for example, smart phones, tablet PCs, mobile phones, video phones, e-book readers, desktop PCs, laptop PCs, netbook computers, workstations, servers, PDAs, and PMPs. It may include at least one of (portable multimedia player), MP3 player, medical device, camera, or wearable device. In some embodiments, the electronic device is, for example, a television, a digital video disk (DVD) player, an audio, a refrigerator, an air conditioner, a vacuum cleaner, a microwave oven, an air purifier, a set-top box, a media box (eg, Samsung HomeSyncTM, Apple TVTM or Google TVTM), a game console (eg, XboxTM, PlayStationTM), an electronic dictionary, an electronic key, or an electronic frame.

In another embodiment, the electronic device includes various medical devices (eg, various portable medical measuring devices (blood glucose meter, heart rate meter, blood pressure meter, or body temperature meter, etc.), a navigation device, a global navigation satellite system (GNSS). ), avionics, security devices, industrial or home robots, drones, ATMs of financial institutions, point of sales (POS) of stores, or Internet of Things devices.

In this document, the term user may refer to a person using an electronic device or a device (eg, an artificial intelligence electronic device) using an electronic device.

Hereinafter, the present disclosure will be described in more detail with reference to the drawings.

First, as shown in (a) of FIG. 1, the electronic device 100 may receive a voice 10 from a user through the microphone 110. In particular, the voice 10 received from the user may be a trigger voice that initiates the voice fingerprint extraction 30 representing the user. The trigger voice may be'Hi Bixby' as shown in (a) of FIG. 1, but this is only an example and can be freely set by the user.

Meanwhile, the electronic device 100 may pre-process the voice 10 received from the user before inputting it into the learned second artificial intelligence model 20 to obtain a voice fingerprint. According to an embodiment, the electronic device 100 may distinguish between an actual user's voice and a background noise that does not contain a meaning from the input voice and delete the background noise. That is, the electronic device 100 may delete meaningless background noise from among the input voices 10 in order to input only the user voice having actual meaning to the second artificial intelligence model 20. In addition, as an embodiment, the electronic device 100 may amplify only the amplitude of the frequency band in which the user's voice exists among voices 10 input from the user and delete the remaining frequency bands. In addition, the electronic device 100 may divide the preprocessed speech into a preset frame unit, and extract a characteristic vector or a characteristic matrix of the speech in the frame section.

The electronic device 100 may input a feature vector or a feature matrix of the voice into the second artificial intelligence model 20 to extract a voice fingerprint representing the user (30), but this is only an example, and the electronic device 100 By inputting the preprocessed voice into the second artificial intelligence model 20 as it is, a voice fingerprint representing the user may be extracted (30).

The voice fingerprint 41 representing the user shown in FIG. 1B may be data obtained by analyzing the frequency of energy of the user's voice. The data obtained by analyzing the frequency of the energy of the user's voice may be in the form of a vector or matrix, and may be in the form of an image that visually represents the spectrum in the frequency band of the voice.

The electronic device 100 may recognize a user who uttered the input user's voice by comparing the voice fingerprint 41 representing the user with the previously registered voice fingerprints 42, 43-1, and 43-2. Specifically, the electronic device 100 may calculate an error by comparing the voice fingerprint 41 representing the user with the previously registered voice fingerprints 42, 43-1, and 43-2. In addition, when the calculated error does not exceed the threshold value, the electronic device 100 may recognize the user who uttered the input voice.

Meanwhile, when the calculated error does not exceed a preset value, the electronic device 100 may store the voice fingerprint 41 representing the user in the memory 130. In particular, the preset value may be a value smaller than the threshold value. In addition, when a voice fingerprint representing a user exceeding a preset number is stored in the memory 130, the electronic device 100 may delete the last stored voice fingerprint representing the user. That is, the electronic device 100 may store and manage the voice fingerprint in a first-in first-out (FIFO) method. That is, the electronic device 100 may include a buffer for storing the voice fingerprint 41.

Meanwhile, in case the additionally stored voice fingerprint is incorrect data, the electronic device 100 may continue to store the voice fingerprint 42 indicating the first registered user. Accordingly, when a voice fingerprint representing a user is stored in excess of the preset number, the electronic device 100 may delete the last stored voice fingerprint except for the first registered voice fingerprint 42.

In addition, the pre-registered voice fingerprint may be average data of the voice fingerprint 42 representing the first registered user and the voice fingerprint 43-1 and 43-2 representing the stored user. In one embodiment, in a state in which a preset number of voice fingerprints are stored, if there is an additionally stored voice fingerprint, the electronic device 100 may be configured to use the last voice fingerprint 43-stored except for the first registered voice fingerprint 42. After deleting 1), the average data of the stored voice fingerprint can be obtained.

In addition, the threshold value may be changed in correspondence with a voice fingerprint representing a newly stored user. That is, since the user's voice may not always be the same and may vary over time, the electronic device 100 may change the threshold value according to an additionally stored voice fingerprint. Specifically, when the energy frequency distribution of the voice corresponding to the newly stored voice fingerprint is different from the energy frequency distribution of the voice corresponding to the previously registered voice fingerprint, the electronic device 100 may change the threshold value to correspond to the difference. have.

Meanwhile, when the electronic device 100 recognizes a user who uttered an input voice, the electronic device 100 may control the display 120 to display a UI 50 including a message indicating that the user has been recognized.

As shown in Fig. 1(c), after the user is recognized through a voice fingerprint, when the recognized user's voice command 60 is input, the electronic device 100 transmits the voice command to a device including a plurality of neural networks. 1 Control information corresponding to a voice command may be obtained by inputting it into the artificial intelligence model 70. In particular, the first artificial intelligence model 70 may be a model using a first neural network learned through a plurality of user voice commands and a second neural network learned through a user’s voice command. In addition, the first neural network may be referred to as a general neural network, a global neural network, and the like, and will be described below as a general neural network. Also, the second neural network may be referred to as a personal neural network, and hereinafter, it will be described as a personal neural network.

Specifically, the electronic device 100 may input the inputted user's voice command 60 to the general neural network of the first artificial intelligence model 70. The general neural network may be a neural network learned based on data stored in a plurality of user databases. In the plurality of user databases, information of a plurality of users, a plurality of voice commands, and control information corresponding to the voice commands may be stored.

Further, the electronic device 100 may input the recognized user's database and the result output from the general neural network into the personal neural network of the first artificial intelligence model 70. The personal neural network can learn additionally based on data stored in the recognized user's database. The user's database may store recognized user information, a previously input voice command, and control information corresponding to the voice command.

In addition, control information for a user's voice command may be obtained based on a result output from a general neural network and a result output from a personal neural network based on data stored in the user database. That is, the electronic device 100 may recognize the voice command through control information on the user voice command (80).

In addition, the electronic device 100 may perform a function corresponding to the acquired control information. For example, when a voice command 60 of'tell me today's weather' is input to the first artificial intelligence model 70 to obtain control information corresponding to the voice command, the electronic device 100 provides information on today's weather. The display 120 may be controlled to display the UI 90 including.

Meanwhile, the artificial intelligence model learned in the present disclosure may be constructed in consideration of the application field of the recognition model or the computer performance of the device. For example, the artificial intelligence model may be trained to obtain control information corresponding to the voice command by inputting a voice command. The recognition model may be designed to simulate the human brain structure on a computer and may include a plurality of network nodes with weights, which simulate neurons of a human neural network. A plurality of network nodes may each form a connection relationship so as to simulate the synaptic activity of neurons that send and receive signals through synapses. At this time, in this case, the artificial intelligence model may be a deep neural network (DNN), but this is only an example, and another artificial intelligence model may be used.

Also, the electronic device 100 may use an artificial intelligence agent to search for information corresponding to a voice command. At this time, the artificial intelligence agent is a dedicated program for providing AI (Artificial Intelligence)-based services (e.g., voice recognition service, secretary service, translation service, search service, etc.), and is an existing general-purpose processor (for example, CPU) or a separate AI dedicated processor (for example, GPU, etc.).

Specifically, if a user voice is input after a user voice 10, that is, a trigger voice is input or a button provided on the electronic device 100 (for example, a button for executing an artificial intelligence agent) is pressed, Intelligent agents can operate. In addition, the artificial intelligence agent may acquire information corresponding to the voice command based on the user's voice.

Of course, when a user voice is input after a preset trigger voice is input or a button provided on the electronic device 100 (for example, a button for executing an artificial intelligence agent) is pressed, the artificial intelligence agent may be operated. . Alternatively, the artificial intelligence agent is previously executed before the user's voice is input after a preset trigger voice is input or a button provided in the electronic device 100 (for example, a button for executing an artificial intelligence agent) is pressed. It can be a state. In this case, after a preset trigger voice is input or a button provided on the electronic device 100 (for example, a button for executing an artificial intelligence agent) is pressed and a user voice is input, the electronic device 100 The artificial intelligence agent may perform a function of retrieving control information information for a user's voice command. In addition, the artificial intelligence agent is in a standby state before the user's voice is input after a preset trigger voice is input or a button provided on the electronic device 100 (for example, a button for executing an artificial intelligence agent) is pressed. I can. Here, the standby state is a state in which a predefined user input is received to control the start of the operation of the artificial intelligence agent. When a user voice command is input after a preset trigger voice is input or a button provided on the electronic device 100 (eg, a button for executing an artificial intelligence agent) is pressed while the artificial intelligence agent is in the standby state, the electronic The device 100 may operate an artificial intelligence agent, and may search for and provide control information according to a user's voice command.

2A is a block diagram schematically illustrating a configuration of an electronic device according to an embodiment of the present disclosure. 2A, the electronic device 100 may include a microphone 110, a display 120, a memory 130, and a processor 140. The configurations shown in FIG. 2A are exemplary diagrams for implementing embodiments of the present disclosure, and appropriate hardware/software configurations, which are obvious to a person skilled in the art, may be additionally included in the electronic device 100.

The microphone 110 is a configuration for receiving a user's voice and may be provided inside the electronic device 100, but this is only an example, and is provided outside the electronic device 100 It may be electrically connected or may be communicatively connected through the communication unit 160. In particular, the microphone 110 may be provided in a separate remote controller for controlling the electronic device 100, and at this time, the remote controller may be provided with a button for executing an artificial intelligence agent.

The display 120 may display various types of information under the control of the processor 140. In particular, when the electronic device 100 recognizes a user who uttered a voice input through the microphone 110, the display 120 may display a UI including a message indicating that the user has been recognized. In addition, the display 120 may display a UI for checking whether the information acquired by the electronic device 100 through the first artificial intelligence model 70 corresponds to information on a user's voice command.

In addition, the display 120 may be implemented as a touch screen together with a touch panel. However, it is not limited to the above-described implementation, and the display 120 may be implemented differently according to the type of the electronic device 100.

The memory 130 may store an instruction or data related to at least one other component of the electronic device 100. In particular, the memory 130 may be implemented as a non-volatile memory, a volatile memory, a flash-memory, a hard disk drive (HDD), or a solid state drive (SSD). The memory 130 is accessed by the processor 140, and data read/write/edit/delete/update by the processor 140 may be performed. In the present disclosure, the term memory refers to a memory 130, a ROM (not shown) in the processor 140, a RAM (not shown), or a memory card (not shown) mounted in the electronic device 100 (for example, micro SD Card, memory stick). In addition, the memory 130 may store programs and data for configuring various screens to be displayed in the display area of the display 120.

In particular, the memory 130 may store a program for executing an artificial intelligence agent. In this case, the artificial intelligence agent is a personalized program for providing various services to the electronic device 100. In addition, the memory 130 may store the learned first artificial intelligence model 70 to obtain control information corresponding to a user's voice command. In addition, the memory 130 may store a voice fingerprint representing a user extracted from the second artificial intelligence model 20 and the second artificial intelligence model 20 learned to acquire a voice fingerprint.

The processor 140 is electrically connected to the memory 130 to control overall operations and functions of the electronic device 100. In particular, the processor 140 extracts a voice fingerprint representing the user based on the voice input through the microphone 110 by executing at least one instruction stored in the memory 130, and The user who uttered the input user's voice is recognized by comparing the previously registered voice fingerprints, and when the user's voice command recognized through the microphone is input after the user is recognized, the user's voice command is transmitted to a plurality of neural networks. ), the control information corresponding to the user's voice command may be obtained by inputting into the first artificial intelligence model 70 including ), and a function corresponding to the acquired control information may be performed.

In particular, the processor 140 may pre-process the input voice when a user voice is input through the microphone 110. Specifically, the processor 140 may obtain only the voice part of the actual user by classifying a non-significant background noise part and a voice part of an actual user among the input voice, and deleting the background noise part. Further, the processor 140 may amplify only the amplitude of the frequency band in which the user's voice exists and delete the remaining bands. However, the above-described method is only an embodiment, and the processor 140 may pre-process the input voice in various ways.

In addition, the processor 140 may divide the preprocessed speech into a predetermined frame unit, and extract a characteristic vector or a characteristic matrix of the speech in the frame section. In addition, the processor 140 may extract a voice fingerprint representing a user by inputting a feature vector of a voice, a feature matrix, or the preprocessed voice itself into the second artificial intelligence model 20. The voice fingerprint representing the user may be data obtained by analyzing the frequency of energy of the user's voice. The data obtained by analyzing the frequency of the energy of the user's voice may be in the form of a vector or matrix, and may be in the form of an image that visually represents the spectrum in the frequency band of the voice.

Meanwhile, the processor 140 may recognize a user who uttered the input voice by comparing the extracted voice fingerprint with a previously registered voice fingerprint. Specifically, the processor 140 calculates an error by comparing a voice fingerprint representing a user with a previously registered voice fingerprint, and when the calculated error does not exceed a threshold value, the processor 140 may recognize a user who uttered the input voice. . When the calculated error does not exceed a preset value, the processor 140 may store a voice fingerprint representing the user in the memory 130. In addition, when a voice fingerprint representing a user is stored in the memory 130 in excess of a preset number, the last stored voice fingerprint representing the user may be deleted.

Meanwhile, the previously registered voice fingerprint may be average data of a voice fingerprint indicating a first registered user and a voice fingerprint indicating a stored user. Accordingly, when the number of voice fingerprints is stored in excess of the preset number, the processor 140 may delete the last stored voice fingerprint except for the first registered voice fingerprint. In addition, the processor 140 may obtain average data of the initially registered voice fingerprint and the remaining voice fingerprints, and calculate a voice fingerprint representing the newly extracted user and an error.

In addition, the processor 140 may change the threshold value in response to the newly stored voice fingerprint representing the user. Specifically, when the energy frequency distribution of the voice corresponding to the newly stored voice fingerprint is different from the energy frequency distribution of the voice corresponding to the previously registered voice fingerprint, the processor 140 may change the threshold value to correspond to the difference. .

In addition, the processor 140 inputs the input user's voice command into the first artificial intelligence model 70, and the result output from the recognized user's database and the general neural network is input to the individual of the first artificial intelligence model 70. It can be entered into a neural network. Further, the processor 140 may obtain information on a user's voice command based on results output from the general neural network and the personal neural network. Specifically, the processor 140 may input a user's voice command to a pre-learned general neural network based on data stored in a plurality of user databases. In addition, the processor 140 may input control information corresponding to a voice command output from a general neural network and a recognized user's database to the personal neural network. Meanwhile, the personal neural network may be additionally learned based on data stored in the user database. Further, the processor 140 may obtain final control information for a user's voice command based on results output from the general neural network and the personal neural network.

In addition, the processor 140 may control the display 120 to display a UI for checking whether the acquired information corresponds to information about the user's voice command. When there is a user input indicating that the acquired control information does not match the information on the user voice command, the processor 140 may additionally learn the personal neural network by inputting the acquired information and the user voice command into the personal neural network. Accordingly, when the same voice command is input later, the processor 140 may acquire control information corresponding to the correct voice command through the learned personal neural network.

In addition, the processor 140 may perform a function corresponding to the acquired control information. For example, when a user's voice command for notifying today's weather is input to obtain control information corresponding to today's weather, the processor 140 may control the display 120 to display a UI for today's weather.

2B is a block diagram illustrating a detailed configuration of the electronic device 100 according to an embodiment of the present disclosure. As shown in FIG. 2B, the electronic device 100 includes a microphone 110, a display 120, a memory 130, a processor 140, a speaker 150, a communication unit 160, a camera 170, and an input unit. It may include 180. Meanwhile, since the microphone 110, the display 120, the memory 130, and the processor 140 have been described in FIG. 2A, redundant descriptions will be omitted.

The speaker 150 is a component that outputs various notification sounds or voice messages as well as various audio data on which various processing tasks such as decoding, amplification, and noise filtering have been performed by an audio processing unit (not shown). However, the speaker 150 is merely an exemplary embodiment, and may be implemented as another output terminal capable of outputting audio data.

The communication unit 160 may communicate with an external device through various communication methods. The communication connection between the communication unit 160 and an external device may include communicating via a third device (eg, a repeater, a hub, an access point, a server, or a gateway).

Meanwhile, the communication unit 160 may include various communication modules to perform communication with an external device. As an example, the communication unit 150 may include a wireless communication module, for example, LTE, LTE-A (LTE Advance), CDMA (code division multiple access), WCDMA (wideband CDMA), UMTS (universal mobile telecommunications). system), WiBro (Wireless Broadband), or GSM (Global System for Mobile Communications), or the like. As another example, the wireless communication module is, for example, WiFi (wireless fidelity), Bluetooth, Bluetooth low power (BLE), Zigbee, NFC (near field communication), magnetic secure transmission, radio frequency It may include at least one of (RF) or a body area network (BAN). In addition, the communication unit 160 may include a wired communication module, for example, universal serial bus (USB), high definition multimedia interface (HDMI), recommended standard232 (RS-232), power line communication, or plain old (POTS). telephone service) and the like. The network in which wireless communication or wired communication is performed may include at least one of a telecommunication network, for example, a computer network (eg, LAN or WAN), the Internet, or a telephone network.

The camera 170 may photograph a user. In particular, the photographed user's picture may be included in a UI displayed when the user is recognized. In addition, the camera 170 may be provided in at least one of the front and the rear of the electronic device 100. Meanwhile, the camera 170 may be provided inside the electronic device 100, but this is only an example, and exists outside the electronic device 100, and may be connected to the electronic device 100 by wire or wirelessly.

The input unit 180 may receive various user inputs and transmit them to the processor 140. In particular, the input unit 180 may include a touch sensor, a (digital) pen sensor, a pressure sensor, a key, or a microphone. The touch sensor may use at least one of, for example, a capacitive type, a pressure sensitive type, an infrared type, or an ultrasonic type. The (digital) pen sensor may be part of, for example, a touch panel, or may include a separate recognition sheet. The key may include, for example, a physical button, an optical key, or a keypad.

In particular, the input unit 180 may obtain an input signal according to a user command input through the UI. In addition, the input unit 180 may transmit an input signal to the processor 140.

The processor 140 is a central processing unit (CPU) that processes digital signals, a micro controller unit (MCU), a micro processing unit (MPU), a controller, and an application processor (AP). , Or one or more of a communication processor (CP) and an ARM processor, or may be defined with a corresponding term. Further, the processor 140 may be implemented as a system on chip (SoC) or large scale integration (LSI) in which a processing algorithm is embedded, or may be implemented in the form of a field programmable gate array (FPGA). The processor 140 may perform various functions by executing computer executable instructions stored in the memory 130. In addition, the processor 140 may include at least one of a graphics-processing unit (GPU), a neural processing unit (NPU), and a visual processing unit (VPU), which are separate AI-only processors, in order to perform an artificial intelligence function. have.

3 is a block diagram illustrating a process of extracting a voice fingerprint according to an embodiment of the present disclosure. First, the electronic device 100 may receive a user voice through the microphone 110 (310). The inputted user voice may be a trigger voice for starting voice fingerprint extraction. The trigger voice may be'Hi Bixby', but this is only an example and can be freely changed by the user.

When a trigger voice is input from the user, the electronic device 100 may pre-process the input voice (320). The electronic device 100 may preprocess the user's voice in various ways, such as amplifying the amplitude of the frequency band in which the actual voice is located and deleting the remaining bands. Alternatively, the electronic device 100 may go through a process of removing background noise from the input voice and securing only the voice with real meaning in order to input the learned second artificial intelligence model 20 to obtain a voice fingerprint. . Further, the electronic device 100 may divide the preprocessed speech into a predetermined frame unit, and extract a characteristic vector or a characteristic matrix of the speech in the frame section.

In addition, the electronic device 100 may input the extracted feature vector, the feature matrix, or the preprocessed voice itself into the second artificial intelligence model 20 (330). That is, the electronic device 100 may input the voice feature vector and the feature matrix as well as the preprocessed voice itself into the second artificial intelligence model 20 to extract the voice fingerprint. On the other hand, the second artificial intelligence model 20 is an artificial intelligence model that has been trained to acquire a voice fingerprint. Accordingly, the second artificial intelligence model 20 may extract a voice fingerprint, which is data obtained by analyzing the distribution of energy frequencies of the user's voice, based on the feature vector, the feature matrix, and the preprocessed voice of the input voice (340 ). The voice fingerprint may be image data indicating the distribution of the energy frequency of the user's voice, and may be expressed in the form of a vector or matrix. In addition, the electronic device 100 may recognize a user who uttered the input voice by comparing the extracted voice fingerprint with a previously registered voice fingerprint. The specific process will be described in detail in FIGS. 4A to 4C and FIG. 5.

4A is a diagram for describing a process of comparing an extracted voice fingerprint and a pre-registered voice fingerprint according to an embodiment of the present disclosure. As illustrated in FIG. 4A, in order to recognize the user who spoke, the electronic device 100 may compare a voice fingerprint 410 representing an extracted user with a previously registered

voice fingerprint

420, 430, and 440. Specifically, the electronic device 100 calculates an error by comparing the voice fingerprint 410 representing the user with the previously registered

voice fingerprints

420, 430, 440, and inputs when the calculated error does not exceed a threshold value. The user who uttered the voiced voice can be recognized. In addition, the previously registered voice fingerprint may be average data of the first registered voice fingerprint 420 and the stored

voice fingerprints

430 and 440. Accordingly, the electronic device 100 may calculate an error by comparing the average data of the initially registered voice fingerprint 420 and the stored

voice fingerprints

430 and 440 with the voice fingerprint 410 representing the user.

Further, the electronic device 100 may use the cosine similarity when calculating an error between voice fingerprints.

(One)

Equation (1) above is an equation to calculate the cosine similarity between voice fingerprints. For example, if the voice fingerprint representing the user is a and the previously registered voice fingerprint is b, the electronic device 100 may calculate the similarity and calculate the error using the above equation (1).

(2)

The above equation (2) is an equation for calculating the cosine similarity when the number of previously registered voice fingerprints is plural. For example, if the voice fingerprint representing the user is Xin, the number of pre-registered voice fingerprints is N, and the pre-registered voice fingerprint is Xq, the electronic device 100 calculates the similarity by using the above (2) equation. You can calculate and calculate the error.

4B is a diagram illustrating a UI when recognizing a user who has spoken according to an embodiment of the present disclosure. As illustrated in FIG. 4B, when recognizing a user who has spoken, the electronic device 100 may control the display 120 to display a message indicating that the user has been recognized and a UI including user information. The user information may be a user's name, a user's picture taken through the camera 170, and the like, but this is only an example.

FIG. 4C is a diagram illustrating a process of additionally storing voice fingerprints by the electronic device 100 when voice fingerprints are stored in excess of a preset number, according to an embodiment of the present disclosure. If the error calculated by comparing the voice fingerprint 410 representing the user with the previously registered

voice fingerprint

420, 430, and 440 does not exceed a preset value, the electronic device 100 may display the voice fingerprint 410 representing the user. May be stored in the memory 130. In addition, when a voice fingerprint representing a user is stored in the memory 130 in excess of a preset number, the electronic device 100 may delete the voice fingerprint 430 that represents the last stored user. In preparation for a case in which an incorrect voice fingerprint is stored, the electronic device 100 may delete the last stored voice fingerprint 430 except for the first registered voice fingerprint 420. In FIG. 4C, it is assumed that the preset number is three, but this is only an example, and the preset number may be changed according to the type of the electronic device 100 or user setting.

In addition, the electronic device 100 may change the threshold value in response to the voice fingerprint 410 representing a newly stored user. Specifically, when the frequency distribution of the energy of the voice corresponding to the newly stored voice fingerprint 410 is different from the frequency distribution of the energy of the voice corresponding to the previously registered

voice fingerprint

420, 430, 440, the electronic device ( 100) may change the threshold value to correspond to the difference.

5 is a flowchart illustrating a process of comparing and storing voice fingerprints according to an embodiment of the present disclosure.

First, the electronic device 100 may calculate an error by comparing a voice fingerprint representing a user with a previously registered voice fingerprint (S510). Specifically, the electronic device 100 may calculate an error by comparing a previously registered voice fingerprint, which is the average data of the voice fingerprint registered first and the voice fingerprint stored later, with the voice fingerprint representing the user. When the calculated error exceeds the threshold value, the electronic device 100 may fail to recognize the user (S525). When the calculated error does not exceed the threshold value, the electronic device 100 may recognize the user who uttered the input user voice (S530). Further, the electronic device 100 may control the display 120 to display a UI including information on the recognized user and a message indicating that the recognition has been completed. Also, the electronic device 100 may determine whether the calculated error exceeds a preset value smaller than a threshold value (S540). When it is determined that the calculated error does not exceed a preset value, the electronic device 100 may store a voice fingerprint representing the user in the memory 130 (S550). In addition, the electronic device 100 may determine whether or not more than a preset number of voice fingerprints are stored in the memory 130 (S560). The preset number may vary according to the type of electronic device 100 and may be determined by a user. When it is determined that voice fingerprints are stored in the memory 130 in excess of the preset number, the electronic device 100 may delete the last stored voice fingerprint (S570). In particular, the electronic device 100 may delete the last stored voice fingerprint except for the first registered voice fingerprint. In addition, the electronic device 100 may change the threshold value in response to the newly stored voice fingerprint (S580). Specifically, when the energy frequency distribution of the voice corresponding to the newly stored voice fingerprint is different from the energy frequency distribution of the voice corresponding to the previously registered voice fingerprint, the electronic device 100 may change the threshold value to correspond to the difference. have.

6 is a block diagram illustrating a process of recognizing a user's voice command according to an embodiment of the present disclosure. After the user is recognized through the voice fingerprint, the electronic device 100 may receive the recognized user's voice command through the microphone 110 (610). The electronic device 100 may input the inputted user's voice command to the first artificial intelligence model 70 including a plurality of neural networks. The first artificial intelligence model 70 uses a general neural network learned through a plurality of user's voice commands and a personal neural network learned through a user's voice command to control information corresponding to a user's voice command. Can be obtained.

Specifically, the electronic device 100 may input the inputted user's voice command to the general neural network 620 of the first artificial intelligence model. The general neural network 620 is a neural network learned based on data stored in a plurality of user databases. In the plurality of user databases, information on a plurality of users, voice commands previously input by the plurality of users, and data on functions performed according to the voice commands are stored. Accordingly, the electronic device 100 may obtain information corresponding to the voice command by inputting a user voice command to the general neural network 620 learned based on data stored in a plurality of user databases.

As an example, the general neural network 620 may extract feature points of an input user voice command. The feature point may be in the form of a vector, but is only an example and may be stored in various forms. Further, the general neural network 620 may extract control information for the voice command by comparing the feature points of the extracted user voice command with data stored from a plurality of user databases.

Further, the electronic device 100 may input the result output from the general neural network 620 and the recognized user's database 640 into the personal neural network 630 of the first artificial intelligence model 70. The personal neural network 630 is a neural network that is additionally learned based on data stored in the user's database 640. The user's database 640 stores user information and data on a voice command previously input by the user and a function performed according to the voice command. Accordingly, the personal neural network 630 may acquire information on a voice command based on data stored in the user's database 640 and a newly inputted user voice command, and further learn. In addition, the personal neural network 630 may output control information for a user's voice command by connecting a result input from the general neural network 620 with information acquired by itself. That is, the electronic device 100 may obtain information on a user voice command based on results output from the general neural network 620 and the personal neural network 630 (650 ).

7 is a block diagram illustrating a process of recognizing a user's voice command according to an embodiment of the present disclosure.

The personal neural network 630 may include a neural layer 710 and a connection layer 720. The neural layer 710 is learned based on data stored in the user database 640 and a newly input voice command, and may output control information for the voice command. Information on a user voice command newly output from the neural layer 710 may be stored in the user database 640. Further, the neural layer 710 may additionally output information on a voice command based on a result output from the general neural network 620. The neural layer 710 may receive the final result output from the general neural network, but this is only an example, and the intermediate result output from the general neural network 620 may be input.

In an embodiment, when the first artificial intelligence model 70 is made of a convolutional neural network (CNN) model, the neural layer 710 may include a convolutional layer and a pooling layer. . In the convolutional layer, a feature map can be output by extracting a feature of a voice command and passing a filter. In the integration layer, the size of the feature map output from the convolutional layer can be reduced. Specifically, the feature map is composed of a plurality of submaps, and values corresponding to voice features are stored in the submap. The integration layer can reduce the size of the feature map by storing only the largest value or the average value among values corresponding to the voice features stored in each submap. Accordingly, the neural layer 710 may output information on a voice command through a convolutional layer and an integration layer on the result output from the general neural network.

In addition, the connection layer 720 may connect the result output from the general neural network and the result output from the neural layer 710 to output final control information for the voice command. Specifically, the connection layer 720 is a user that can be recognized by the electronic device 100 by integrating information on a voice command output from the general neural network 620 and information on a voice command output from the neural layer 710 Control information for voice commands can be output. Accordingly, the electronic device 100 may finally obtain control information corresponding to the voice command output from the connection layer 720.

8 is a block diagram illustrating a configuration of an electronic device for learning and using an artificial intelligence model according to an embodiment of the present disclosure.

Referring to FIG. 8, the processor 800 may include at least one of a learning unit 810 and an acquisition unit 820. The processor 800 of FIG. 8 may correspond to the processor 140 of the electronic device 100 of FIGS. 2A and 2B or a processor of a data learning server (not shown).

The learning unit 810 may generate or train a first model for obtaining control information corresponding to a user's voice command and a second model for extracting a voice fingerprint representing the user. The learning unit 810 may generate a trained model having a recognition criterion using the collected training data.

For example, the learning unit 810 may generate, learn, or update a first model for acquiring control information corresponding to the user voice command by using the user voice command as input data. In addition, the learning unit 810 may generate, learn, or update a second model for extracting a voice fingerprint representing a user by using a feature vector, a feature matrix, and a preprocessed voice as input data. Meanwhile, according to another embodiment of the present invention, the first model and the second model may be implemented as an integrated model. That is, the integrated model may extract information on the user's voice command and a voice fingerprint representing the user by using the user's voice command as input data.

The acquisition unit 820 may acquire various types of information by using predetermined data as input data of the trained model.

For example, the acquisition unit 820 may acquire (or recognize, estimate) information on the user's voice command by using the user's voice command as input data of the learned first model. In addition, the acquisition unit 820 may acquire (or estimate, infer, or recognize) a voice fingerprint representing the user by using the feature vector, the feature matrix, and the preprocessed voice of the user voice as input data of the learned second model. .

At least a portion of the learning unit 810 and at least a portion of the acquisition unit 820 may be implemented as a software module or manufactured in the form of at least one hardware chip and mounted on an electronic device. For example, at least one of the learning unit 810 and the acquisition unit 820 may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or an existing general-purpose processor (eg, CPU or application processor) or a graphics dedicated processor (for example, a GPU), and mounted on the aforementioned various electronic devices. At this time, the dedicated hardware chip for artificial intelligence is a dedicated processor specialized in probability calculation, and has higher parallel processing performance than conventional general-purpose processors, so it can quickly process computation tasks in artificial intelligence fields such as machine learning. When the learning unit 810 and the acquisition unit 820 are implemented as a software module (or a program module including an instruction), the software module is a non-transitory readable recording medium that can be read by a computer. transitory computer readable media). In this case, the software module may be provided by an OS (Operating System) or a predetermined application. Alternatively, some of the software modules may be provided by an operating system (OS), and some of the software modules may be provided by a predetermined application.

Meanwhile, the learning unit 810 and the acquisition unit 820 may be mounted on one electronic device or may be mounted on separate electronic devices, respectively. For example, one of the learning unit 810 and the acquisition unit 820 may be included in the electronic device 100 and the other may be included in an external server. In addition, the learning unit 810 and the acquisition unit 820 may provide model information built by the learning unit 810 to the acquisition unit 820 through wired or wireless, or input to the learning unit 810 Data may be provided to the learning unit 810 as additional learning data.

9 is a block diagram showing a detailed configuration of a learning unit and a recognition unit according to an embodiment of the present disclosure.

Referring to FIG. 9A, the learning unit 810 according to some embodiments may include a learning data acquisition unit 810-1 and a model learning unit 810-4. In addition, the learning unit 810 may selectively further include at least one of a training data preprocessor 810-2, a training data selection unit 810-3, and a model evaluation unit 810-5.

The training data acquisition unit 810-1 may acquire training data required for the first model and the second model. In an embodiment of the present disclosure, the learning data acquisition unit 11110-1 may acquire a user voice command, a feature vector of a voice, a feature matrix, and a preprocessed voice as learning data. The learning data may be data collected or tested by the learning unit 810 or the manufacturer of the learning unit 810.

The model learning unit 810-4 may learn to have a reference for how to obtain control information for a user's voice command and a voice fingerprint representing the user by using the training data. For example, the model learning unit 810-4 may train an artificial intelligence model through supervised learning using at least a portion of the training data as a criterion for determination. Alternatively, the model learning unit 810-4, for example, by self-learning using the training data without any special guidance, through unsupervised learning to find a criterion for determining the situation, artificial intelligence You can train the model. In addition, the model learning unit 810-4 may train the artificial intelligence model through reinforcement learning using feedback on whether a result of situation determination according to learning is correct, for example. In addition, the model learning unit 810-4 may train the artificial intelligence model using, for example, a learning algorithm including error back-propagation or gradient descent.

When there are a plurality of pre-built artificial intelligence models, the model learning unit 810-4 may determine an artificial intelligence model having a high correlation between the input training data and the basic training data as an artificial intelligence model to be trained. In this case, the basic training data may be pre-classified by data type, and the artificial intelligence model may be pre-built for each data type. For example, the basic learning data may be classified according to various criteria such as an area in which the training data is generated, a time when the training data is generated, a size of the training data, a genre of the training data, and a creator of the training data.

When the artificial intelligence model is trained, the model learning unit 810-4 may store the learned artificial intelligence model. In this case, the model learning unit 810-4 may store the learned artificial intelligence model in the memory 130 of the electronic device 100. Alternatively, the model learning unit 810-4 may store the learned artificial intelligence model in a memory of a server connected to the electronic device 100 through a wired or wireless network.

The learning unit 810 is a training data preprocessing unit 810-2 and a training data selection unit 810-3 in order to improve the recognition result of the artificial intelligence model or save resources or time required for generation of the artificial intelligence model. ) May be further included.

The learning data preprocessor 810-2 may preprocess the acquired data so that the acquired data can be used for learning for acquiring control information corresponding to a voice command and extracting a voice fingerprint. The training data preprocessing unit 810-2 is configured to preset the acquired data so that the model learning unit 810-4 can use the acquired data for learning for acquiring control information corresponding to a voice command and extracting a voice fingerprint. It can be processed into a format.

The learning data selection unit 810-3 may select data necessary for learning from data acquired by the learning data acquisition unit 810-1 or data preprocessed by the learning data preprocessor 810-2. The selected training data may be provided to the model learning unit 810-4. The learning data selection unit 810-3 may select learning data required for learning from acquired or preprocessed data according to a preset selection criterion. In addition, the training data selection unit 810-3 may select training data according to a preset selection criterion by learning by the model learning unit 810-4.

The learning unit 810 may further include a model evaluation unit 810-5 in order to improve the recognition result of the artificial intelligence model.

The model evaluation unit 810-5 inputs evaluation data to the artificial intelligence model, and when the recognition result output from the evaluation data does not satisfy a predetermined criterion, the model learning unit 810-4 may retrain. have. In this case, the evaluation data may be predefined data for evaluating an artificial intelligence model.

For example, the model evaluation unit 810-5 may set a predetermined criterion when the number or ratio of evaluation data whose recognition result is not accurate among the recognition results of the learned artificial intelligence model for the evaluation data exceeds a preset threshold. It can be evaluated as not satisfied.

On the other hand, when there are a plurality of learned artificial intelligence models, the model evaluation unit 810-5 evaluates whether each of the learned artificial intelligence models satisfies a predetermined criterion, and determines a model that satisfies the predetermined criterion. Can be determined as a model. In this case, when there are a plurality of models that satisfy a predetermined criterion, the model evaluation unit 810-5 may determine any one or a predetermined number of models set in advance in the order of the highest evaluation scores as the final artificial intelligence model.

Referring to FIG. 9B, the acquisition unit 820 according to some embodiments may include an input data acquisition unit 820-1 and a providing unit 820-4.

*In addition, the acquisition unit 820 may further selectively include at least one of the input data preprocessor 820-2, the input data selection unit 820-3, and the model update unit 820-5.

The input data acquisition unit 820-1 may acquire data necessary for acquiring control information corresponding to a user's voice command and extracting a voice fingerprint representing the user. The providing unit 820-4 applies the input data obtained by the input data acquisition unit 820-1 to the artificial intelligence model learned as an input value to obtain control information corresponding to the user's voice command, and a voice fingerprint representing the user. Can be extracted. The providing unit 820-4 may obtain a recognition result by applying the data selected by the input data preprocessor 820-2 or the input data selection unit 820-3 to be described later to the artificial intelligence model as an input value. . The recognition result can be determined by an artificial intelligence model.

In an embodiment, the providing unit 820-4 obtains control information corresponding to the user’s voice command by applying the user’s voice command obtained from the input data obtaining unit 820-1 to the learned first model (or, Can be estimated).

As another example, the providing unit 820-4 applies a feature vector, a feature matrix, and a preprocessed voice of the user's voice acquired by the input data acquisition unit 820-1 to the learned second model, Can be obtained (or estimated).

The acquisition unit 820 is an input data preprocessing unit 820-2 and an input data selection unit 1120-3 in order to improve the recognition result of the artificial intelligence model or save resources or time for providing the recognition result. It may further include.

The input data preprocessor 820-2 may pre-process the acquired data so that the acquired data to be input to the first and second models can be used. The input data preprocessing unit 820-2 uses the acquired data so that the providing unit 820-4 can use the acquired data to obtain control information corresponding to the user's voice command and extract a voice fingerprint representing the user. It can be processed into a predefined format.

The input data selection unit 820-3 may select data necessary for situation determination from data obtained by the input data acquisition unit 820-1 or data preprocessed by the input data preprocessor 820-2. The selected data may be provided to the providing unit 820-4. The input data selection unit 820-3 may select some or all of the acquired or preprocessed data according to a preset selection criterion for determining a situation. In addition, the input data selection unit 820-3 may select data according to a preset selection criterion by learning by the model learning unit 810-4.

The model update unit 820-5 may control the artificial intelligence model to be updated based on an evaluation of the recognition result provided by the providing unit 820-4. For example, the model update unit 820-5 provides the recognition result provided by the providing unit 820-4 to the model learning unit 810-4, so that the model learning unit 810-4 You can request to further train or update the model.

FIG. 10 is a diagram for explaining a UI for checking whether recognized voice command information is correct according to an embodiment of the present disclosure. The electronic device 100 may perform a function corresponding to control information corresponding to the acquired user voice command. In an embodiment, when a voice command for notifying today's weather is input through the microphone 110, the electronic device 100 acquires information about today's weather, and a UI (User Interface) including information about today's weather The display 120 may be controlled to display.

In addition, the electronic device 100 may control the display 120 to display a UI 1010 that checks whether the acquired information corresponds to information about a user's voice command. As shown in FIG. 10, in an embodiment, the electronic device 100 displays information about today's weather and displays a UI 1010 that checks whether the information about the weather is the information requested by the user. I can.

If information indicating that the acquired control information does not match the user's voice command information is input through the UI 1010, the electronic device 100 may input the acquired control information and the user's voice command to the personal neural network. . Further, the electronic device 100 may additionally learn a personal neural network based on the acquired control information and a user's voice command. Accordingly, the electronic device 100 may obtain correct information on the voice command.

When the user's voice is input, the electronic device 100 may extract a voice fingerprint representing the user based on the input voice (S1110). In particular, the user voice may be a trigger voice for initiating voice fingerprint extraction. In addition, the electronic device 100 may extract a voice fingerprint representing a user by inputting the input voice into the learned second artificial intelligence model 20 to obtain a voice fingerprint. The electronic device 100 may recognize a user who uttered the input user voice by comparing a voice fingerprint representing the user with a previously registered voice fingerprint (S1120). Specifically, the electronic device 100 may calculate an error by comparing a previously registered voice fingerprint, which is the average data of the initially registered voice fingerprint and the stored voice fingerprint, with a voice fingerprint representing the user. Further, when the calculated error does not exceed the threshold value, the electronic device 100 may recognize a user who uttered the input voice.

In addition, after the user is recognized, when the recognized user's voice command is input, the electronic device 100 inputs the user's voice command into the first artificial intelligence model 70 including a plurality of neural networks to respond to the user's voice command. Control information may be acquired (S1130). The first artificial intelligence model 70 may acquire control information corresponding to a user's voice command by using a general neural network learned through a plurality of user's voice commands and a personal neural network learned through the user's voice command. . In addition, the electronic device 100 may perform a function corresponding to the acquired control information (S1140).

Various embodiments of the present disclosure may be implemented as software including instructions stored in a machine-readable storage medium (eg, a computer). The device receives instructions stored from the storage medium. A device capable of making a call and operating according to the called command, may include an electronic device (eg, the electronic device 100) according to the disclosed embodiments. When the command is executed by a processor, the processor directly, Alternatively, a function corresponding to the command may be performed using other components under the control of the processor, and the command may include a code generated or executed by a compiler or an interpreter. , May be provided in the form of a non-transitory storage medium, where'non-transitory' means that the storage medium does not contain a signal and is tangible. It does not distinguish between being stored semi-permanently or temporarily.

According to an embodiment, a method according to various embodiments disclosed in the present document may be provided by being included in a computer program product. Computer program products can be traded between sellers and buyers as commodities. The computer program product may be distributed online in the form of a device-readable storage medium (eg, compact disc read only memory (CD-ROM)) or through an application store (eg, Play StoreTM). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily generated in a storage medium such as a server of a manufacturer, a server of an application store, or a memory of a relay server.

Each of the constituent elements (eg, a module or a program) according to various embodiments may be composed of a singular or a plurality of entities, and some sub-elements among the aforementioned sub-elements are omitted, or other sub-elements are It may be further included in various embodiments. Alternatively or additionally, some constituent elements (eg, a module or a program) may be integrated into one entity, and functions performed by each corresponding constituent element prior to the consolidation may be performed identically or similarly. Operations performed by modules, programs, or other components according to various embodiments are sequentially, parallel, repetitively or heuristically executed, or at least some operations are executed in a different order, omitted, or other operations are added. Can be.

Claims

In the electronic device,

MIC;

A memory for storing at least one instruction; And

Includes; a processor that executes the at least one instruction,

The processor executes the at least one instruction,

Recognizing the user who uttered the voice input through the microphone,

After the user is recognized, when the recognized user's voice command is input through the microphone, the user's voice command is input to the first neural network of the first artificial intelligence model,

Input the result output from the first neural network to the second neural network of the first artificial intelligence model,

Acquires control information corresponding to the user's voice command based on the results output from the first neural network and the second neural network,

Performs a function corresponding to the acquired control information,

The first neural network is a neural network learned based on data stored in a plurality of user databases,

And the second neural network is a neural network that is additionally learned based on data stored in the recognized user's database.
The method of claim 1,

The processor,

Extracting a voice fingerprint representing the user based on the voice input through the microphone,

An electronic device for recognizing a user who uttered the input user voice by comparing the voice fingerprint representing the user with a previously registered voice fingerprint.
The method of claim 2,

The processor,

Pre-process the input voice,

The voice fingerprint representing the user is extracted by inputting the preprocessed voice into the learned second artificial intelligence model to obtain the voice fingerprint, and the voice fingerprint representing the user analyzes the frequency distribution of the energy of the user's voice. Electronic device, characterized in that the data.
The method of claim 2,

The electronic device, characterized in that the input voice is a trigger voice for initiating extraction of a voice fingerprint representing the user.
The method of claim 2,

The processor,

Computing an error by comparing the voice fingerprint representing the user with the previously registered voice fingerprint,

When the calculated error does not exceed a threshold value, the user who uttered the input voice is recognized,

When the error does not exceed a preset value, a voice fingerprint representing the user is stored in the memory,

When a voice fingerprint representing the user is stored in the memory exceeding a preset number, the last voice fingerprint representing the stored user is deleted.
The method of claim 5,

The electronic device, wherein the pre-registered voice fingerprint is an average data of a voice fingerprint representing a first registered user and a voice fingerprint representing the stored user.
The method of claim 5,

The electronic device, wherein the threshold value is changed in response to the newly stored voice fingerprint representing the user.
The method of claim 1,

The second neural network includes a neural layer and a connection layer,

The neural layer receives the result output from the first neural network and outputs information on the voice command,

Wherein the connection layer connects a result output from the first neural network and a result output from the neural layer to output final control information on the voice command.
The method of claim 1,

The processor,

An electronic device that controls the display to display a UI for checking whether the acquired control information corresponds to information about the user's voice command.
The method of claim 9,

The processor,

When there is a user input indicating that the acquired control information does not match information on the user voice command, the electronic device trains the second neural network by inputting the acquired information and the user voice command into a second neural network.
In the control method of an electronic device,

Recognizing a user who uttered the input user voice upon receiving a user voice;

Inputting the user voice command into a first neural network of a first artificial intelligence model when the recognized user's voice command is input after the user is recognized;

Inputting a result output from the first neural network to a second neural network of the first artificial intelligence model;

Acquiring control information corresponding to the user's voice command based on results output from the first and second neural networks; And;

Including; performing a function corresponding to the obtained control information,

The first neural network is a neural network learned based on data stored in a plurality of user databases,

And the second neural network is a neural network that is additionally learned based on data stored in the recognized user's database.
The method of claim 11,

The step of recognizing,

Extracting a voice fingerprint representing the user based on the input voice,

Recognizing the user who uttered the input user voice by comparing the voice fingerprint representing the user with a previously registered voice fingerprint

Control method of electronic devices.
The method of claim 11,

The step of recognizing,

Pre-processing the input voice; And

Including; inputting the preprocessed voice into a second artificial intelligence model learned to acquire the voice fingerprint to extract a voice fingerprint representing the user; and

The electronic device, wherein the voice fingerprint representing the user is data obtained by analyzing a frequency distribution of energy of the user voice.
The method of claim 12,

The input voice is a control method of an electronic device, wherein the input voice is a trigger voice to start extracting a voice fingerprint representing the user.
The method of claim 12,

Calculating an error by comparing the voice fingerprint representing the user with the previously registered voice fingerprint;

Recognizing a user who uttered the input voice when the calculated error does not exceed a threshold value;

Storing a voice fingerprint representing the user when the error does not exceed a preset value; And

And if the voice fingerprint representing the user is stored in excess of the preset number, deleting the last voice fingerprint representing the user.