WO2019212221A1

WO2019212221A1 - Voice input authentication device and method

Info

Publication number: WO2019212221A1
Application number: PCT/KR2019/005174
Authority: WO
Inventors: 허준호; 김형식; 이자즈 아메드무함마드; 곽일엽; 김일주; 제상준
Original assignee: 삼성전자 주식회사
Priority date: 2018-05-04
Filing date: 2019-04-30
Publication date: 2019-11-07

Abstract

Disclosed are a method for authenticating a voice input provided from a user, and a method for detecting a voice input having a strong propensity to attack. The voice input authentication method comprises the steps of: receiving the voice input; obtaining signal characteristic data indicating signal characteristics of the voice input from the voice input; and authenticating the voice input by applying the obtained signal characteristic data to a first learning model for determining an attribute of the voice input, wherein the first learning model is learned for use in determining an attribute of the voice input on the basis of a voice spoken by a person and a voice output from a device.

Description

Voice input authentication device and method

The present disclosure relates to a device and a method for authenticating a voice input, and more particularly, to a device and a method for authenticating a voice input based on signal characteristics of the voice input.

Artificial Intelligence (AI) system is a computer system that implements human-level intelligence, and unlike conventional rule-based smart systems, machines learn, judge, and become smart. As the AI system is used, the recognition rate is improved and the user's taste can be understood more accurately. The existing Rule-based smart system is gradually replaced by the deep learning-based AI system.

AI technology is composed of elementary technologies that utilize machine learning (deep learning) and machine learning. Machine learning is an algorithm technology that classifies / learns characteristics of input data by itself, and element technology is a technology that simulates the functions of human brain cognition and judgment by using machine learning algorithms such as deep learning. It consists of technical areas such as understanding, reasoning / prediction, knowledge representation, and motion control.

The various fields in which artificial intelligence technology is applied are as follows. Linguistic understanding is a technology for recognizing and applying / processing human language / characters and includes natural language processing, machine translation, dialogue system, question and answer, speech recognition / synthesis, and the like. Visual understanding is a technology that recognizes and processes objects as human vision, and includes object recognition, object tracking, image retrieval, person recognition, scene understanding, spatial understanding, and image enhancement. Inference Prediction is a technique for judging, logically inferring, and predicting information. It includes knowledge / probability-based inference, optimization prediction, preference-based planning, and recommendation. Knowledge expression is a technology that automatically processes human experience information into knowledge data, and includes knowledge construction (data generation / classification) and knowledge management (data utilization). Motion control is a technology for controlling autonomous driving of a vehicle and movement of a robot, and includes motion control (navigation, collision, driving), operation control (action control), and the like.

On the other hand, with the development of artificial intelligence technology, the user can operate or use various devices or services through voice input, and accordingly, the importance of security and authentication for the voice input provided by the user is highlighted.

Some embodiments may provide a device and method for authenticating voice input using a learning model that can distinguish whether a voice input is spoken from a person or output from a device based on signal characteristic data.

In addition, some embodiments may provide a device and method for authenticating a user using a learning model that can authenticate a user based on a voice input pattern.

In addition, some embodiments may provide a device and method for further authenticating a user using a learning model capable of authenticating the user through a question and answer.

As a technical means for achieving the above technical problem, a voice input authentication device according to an embodiment of the present disclosure, a microphone for receiving the voice input, a memory for storing one or more instructions and a processor for executing the one or more instructions The processor may include: acquiring signal characteristic data representing a signal characteristic of the voice input from the voice input by executing the one or more instructions, and converting the obtained signal characteristic data into attributes of the voice input. Authenticating the voice input by applying it to a first learning model for determining, the first learning model being trained to determine the attribute of the voice input based on the voice spoken by a person and the voice output from the device. have.

In addition, the voice input authentication method according to an embodiment of the present disclosure, the step of receiving a voice input, obtaining signal characteristic data representing the signal characteristics of the voice input from the voice input and the obtained signal characteristic data Authenticating the voice input by applying it to a first learning model for determining an attribute of the voice input, wherein the first learning model is based on a voice spoken by a person and a voice output from the device; It can be learned to determine the attribute of the voice input.

In addition, the computer-readable recording medium according to an embodiment of the present disclosure may be a computer-readable recording medium that records a program for executing the above-described method.

According to the device and the method for authenticating the voice input according to the present disclosure, it is possible to more effectively defend against attacks by the external attacker's voice input.

The present disclosure can be easily understood by the following detailed description and the accompanying drawings in which reference numerals refer to structural elements.

1 is a block diagram of a device in accordance with some embodiments.

2 is an exemplary flow diagram illustrating a method for a device to authenticate voice input, in accordance with some embodiments.

3 is an exemplary flow diagram for describing a method for a device to obtain signal characteristic data, according to some embodiments.

4A to 4D are diagrams for exemplarily describing signal characteristic data.

5 is an exemplary flowchart for describing a method of authenticating a voice input by a device according to some embodiments.

6 is an exemplary flow diagram illustrating a method for a device to authenticate voice input, in accordance with some embodiments.

7 is an exemplary flowchart for describing a method in which a device additionally authenticates a user, according to some embodiments.

8 is an exemplary flowchart for describing a method of a device additionally authenticating a user, according to some embodiments.

9 is an exemplary flowchart for describing a method of authenticating a voice input by a device according to some embodiments.

10 is an exemplary flowchart for describing a method of authenticating a voice input by a device according to some embodiments.

11 is a diagram for describing a function of a device, according to some embodiments.

12 is a block diagram of a device in accordance with some embodiments.

13 is a block diagram of a device in accordance with some embodiments.

14 is a block diagram of a processor in accordance with some embodiments.

15 is a block diagram of a data learner according to an exemplary embodiment.

16 is a block diagram of a data recognizer according to some example embodiments.

17 is a diagram illustrating an example of learning and recognizing data by interworking with a device and a server according to some embodiments.

The voice input authentication method according to the present disclosure comprises the steps of: receiving the voice input, acquiring signal characteristic data indicating a signal characteristic of the voice input from the voice input, and obtaining the attribute of the voice input from the acquired signal characteristic data. Authenticating the voice input by applying it to a first learning model for determining, wherein the first learning model is based on a voice spoken by a person and a voice output from a device; Can be learned to determine.

Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present disclosure. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In addition, in order to clearly describe the present disclosure in the drawings, irrelevant parts are omitted, and like reference numerals designate like parts throughout the specification.

Some embodiments of the present disclosure may be represented by functional block configurations and various processing steps. Some or all of these functional blocks may be implemented in various numbers of hardware and / or software configurations that perform particular functions. For example, the functional blocks of the present disclosure may be implemented by one or more microprocessors or by circuit configurations for a given function. In addition, for example, the functional blocks of the present disclosure may be implemented in various programming or scripting languages. The functional blocks may be implemented in algorithms running on one or more processors. In addition, the present disclosure may employ the prior art for electronic configuration, signal processing, and / or data processing.

In addition, the connecting lines or connecting members between the components shown in the drawings are merely illustrative of functional connections and / or physical or circuit connections. In an actual device, the connections between components may be represented by various functional connections, physical connections, or circuit connections that are replaceable or added.

In addition, the terms "... unit", "module", and the like described herein refer to a unit that processes at least one function or operation, which may be implemented in hardware or software or a combination of hardware and software. have. The "unit" and "module" may be implemented by a program stored in a storage medium that can be addressed and executed by a processor.

On the other hand, the embodiments of the present disclosure disclosed in the specification and drawings are merely presented specific examples to easily explain the technical contents of the present disclosure and aid in understanding the present disclosure, and are not intended to limit the scope of the present disclosure. That is, it will be apparent to those skilled in the art that other modifications based on the technical spirit of the present disclosure can be implemented. In addition, each embodiment may be combined with each other if necessary to operate. For example, portions of one embodiment of the present disclosure and another embodiment of the present disclosure may be combined with each other to operate an apparatus.

1 is a block diagram of a device in accordance with some embodiments.

Referring to FIG. 1, the device 10 includes a processor 11, a memory 12, and a microphone 13.

The user 1 may provide a voice input to the device 10. The voice input may be voice spoken by the user and may include text information. The user 1 may control the device 10 or an electronic device connected thereto using a voice input. Voice input may be provided in a variety of forms and languages. For example, the user 1 may speak a voice input such as "check the newly received mail and make the device 10".

The device 10 may receive a voice input using the microphone 13. At this time, the microphone 13 may receive a voice input by converting the ambient sound into electrical data. In an embodiment, the device 10 may include a plurality of microphones 13. The microphone 13 may provide the received voice input to the processor 11.

The processor 11 controls the overall operation of the device 10. The processor 11 may be configured to process at least one instruction by performing basic arithmetic, logic and input / output operations. The above-described instructions may be provided from the memory 12 to the processor 11. In other words, the processor 11 may be configured to execute an instruction according to a program code stored in a recording device such as the memory 12. Alternatively, the instruction may be received by the device 10 through a communication module (not shown) and provided to the processor 11.

The processor 11 may authenticate the received voice input and control the device 10 or an electronic device connected thereto using the voice input based on the authentication result. For example, if the voice input of “Bixby, check newly received mail” is authenticated, the processor 11 may control the device 10 to check newly received mail.

Meanwhile, the processor 11 may refuse to allow the user 1 to control the device 10 or an electronic device connected thereto using an unauthenticated voice input. For example, if the voice input of “Bixby, check newly received mail” is not authenticated, the processor 11 may not perform the mail checking operation of the device 10.

In an embodiment, the processor 11 may obtain signal characteristic data from a voice input. The signal characteristic data is data representing electrical signal characteristics of the voice input. In an embodiment, the signal characteristic data may be data obtained by analyzing a voice input based on at least one of frequency, time, or power. The processor 11 may apply the signal characteristic data to the first learning model to authenticate the voice input.

More specifically, the processor 11 may determine the property of the voice input by applying the signal characteristic data to the first learning model. In an embodiment, the attribute of the voice input may indicate whether the voice input is spoken by a person or output from a device. In the present disclosure, voice input spoken by a person means voice spoken through a human vocal cord. In the present disclosure, the voice input output from the device refers to a voice that is electronically synthesized or recorded and output through a speaker, a recorder, a playback device, and the like. The processor 11 may use the attribute of the voice input to determine whether the received voice input is by a user or by an external attack using the device.

In an embodiment, the processor 11 may apply signal characteristic data to the first learning model to determine the reliability of the attribute of the voice input and the determined attribute. The reliability may be a probability that the attribute of the determined voice input coincides with the reality. Reliability can be determined in various forms. For example, the processor 11 may determine that the voice input is spoken by a person with 90% reliability. Alternatively, the processor 11 may determine that the voice input is output from the device with reliability corresponding to a specific step among predetermined steps.

The processor 11 may authenticate the voice input based on the determined attribute of the voice input. For example, if it is determined that the voice input is spoken by a person, the processor 11 may authenticate the voice input. If it is determined that the voice input is output from the device, the processor 11 may not authenticate the voice input.

On the other hand, based on the reliability, it may be difficult for the processor 11 to determine whether the voice input is spoken by a person or output from the device. In this case, the processor 11 may determine whether first user authentication is required based on the determined reliability.

If it is determined that the first user authentication is required, the processor 11 applies the voice input to the second learning model to authenticate the user.

The first user authentication is an operation of authenticating a user who has spoken a voice input based on a voice input pattern of the user. In the present disclosure, the voice input pattern may be a pattern determined based on a voice input input by a user or a situation of inputting a voice input to control the device. For example, the voice input pattern may indicate a usage behavior of a user inputting voice through the processor 11. The processor 11 may apply the voice input to the second learning model to deny authentication of a user who attempts abnormal use.

In an embodiment, the processor 11 may provide context information to the second learning model to train the second learning model based on the voice input pattern.

The context information may include at least one of surrounding environment information of the device 10, state information of the device 10, state information of the user, usage history information of the device 10, and schedule information of the user. It doesn't happen.

On the other hand, the processor 11 may determine that the second user authentication is required when the user authentication by the first user authentication is denied or difficult to determine. If it is determined that the second user authentication is required, the processor 11 may additionally authenticate the user.

The second user authentication is an operation of authenticating a user who has spoken a voice input by using an additional input provided from the user. In an embodiment, the processor 11 may additionally authenticate the user based on the cipher text received from the user. In another embodiment, the processor 11 may additionally authenticate the user by using a third learning model for authenticating the user through a query response. In another embodiment, the processor 11 may additionally authenticate the user by using biometrics such as fingerprint recognition or face recognition.

The device 10 may be a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a media player, a micro server, a global positioning system (GPS) device, an e-book device, a digital broadcasting terminal, and navigation. , Kiosks, MP3 players, digital cameras, home appliances, and other mobile or non-mobile computing devices. In addition, the device 10 may be a wearable device such as a watch, glasses, a hair band and a ring having a communication function and a data processing function. However, the present invention is not limited thereto, and the device 10 may include all kinds of devices capable of receiving a voice input of the user 1 and authenticating the received voice input.

In addition, the device 10 may communicate with a server and another device (not shown) through a predetermined network in order to use various context information. In this case, the network may include a local area network (LAN), a wide area network (WAN), a value added network (VAN), a mobile radio communication network, a satellite communication network, and their interconnections. It is a comprehensive data communication network that includes a combination and allows each network component to communicate smoothly with each other. The network communication device may include a wired internet, a wireless internet, and a mobile wireless communication network. Wireless communication includes, for example, wireless LAN (Wi-Fi), Bluetooth, Bluetooth low energy (ZigBee), Zigbee, Wi-Fi Direct (WFD), ultra wideband (UWB), infrared communication (IrDA) ), But may include Near Field Communication (NFC), but is not limited thereto.

In operation S210, the device 10 may receive a voice input. The device 10 may receive a voice input provided from a user using at least one microphone.

In operation S220, the device 10 may obtain signal characteristic data from the voice input. The signal characteristic data may be data representing electrical signal characteristics of the voice input. In an embodiment, the signal characteristic data may be data obtained by analyzing a voice input based on at least one of frequency, time, or power. For example, the signal characteristic data may include a spectrogram or per-frequecy cumulative power of a voice input. However, this is exemplary and the type of the signal characteristic data of the present disclosure is not limited to the above-described kind.

In operation S230, the device 10 may apply signal characteristic data to the first learning model to authenticate the voice input.

More specifically, the device 10 may determine the property of the voice input by applying the signal characteristic data to the first learning model. In an embodiment, the attribute of the voice input may indicate whether the voice input is spoken by a person or output from a device. The device 10 may determine whether the received voice input is by a user or by an external attack using the device, by using the property of the voice input.

In an embodiment, the device 10 may apply the signal characteristic data to the first learning model to determine the reliability of the attribute of the voice input and the attribute of the determined voice input. Reliability can be determined in various forms.

In an embodiment, the device 10 may authenticate the voice input based on the determined attributes and reliability of the voice input. For example, if it is determined that the voice input is spoken by a person, the device 10 may compare the reliability with a threshold stored in the memory. The device 10 may authenticate the voice input based on the comparison result. Alternatively, when it is determined that the voice input is output from the apparatus, the device 10 may compare the reliability with a threshold stored in the memory and may not authenticate the voice input based on the comparison result.

The first learning model may be preset and stored in the device 10. In this case, a server that generates and operates the first learning model may provide the first learning model to the device 10, and the device 10 stores and manages the first learning model received from the server in the device 10. can do.

In addition, the preset first learning model may be stored in a server. In this case, the device 10 may provide the signal characteristic data to the server, and receive the attribute of the voice input determined based on the signal characteristic data from the server.

The preset first learning model may be a learning model trained using at least one artificial intelligence algorithm among a machine learning algorithm, a neural network algorithm, a genetic algorithm, a deep learning algorithm, and a classification algorithm.

In an embodiment, the preset first learning model may be pre-trained to determine a property of a voice input based on a voice spoken by a person and a voice output from a device. For example, a plurality of voices may be provided, including voices of various lengths and contents spoken by a user, and voices output from various devices, and by using signal characteristic data obtained from the provided voices as training data. The first learning model preset may be learned.

3 is an exemplary flow diagram for describing a method for a device to obtain signal characteristic data, according to some embodiments. In an embodiment, the signal characteristic data includes information on per-frequency cumulative power.

In operation S221, the device 10 may convert the voice input into the frequency dimension. In an embodiment, the device 10 may obtain a spectrogram for the voice input. The spectrogram described above may include information about each frequency and its corresponding power with respect to time.

In an embodiment, the device 10 may obtain the above-described spectrogram by using a Fourier transform. For example, the device 10 may obtain the spectrogram described above using a short-time Fourier transform (STFT). However, the embodiment of obtaining the spectrogram for the voice input in the present disclosure is not limited to the examples described above.

In operation S222, the device 10 may obtain information about cumulative power for each frequency. The spectrogram may include information about each frequency and its corresponding power with respect to time. The device 10 may obtain cumulative power information for each frequency by calculating cumulative power for a predetermined time of each frequency. The predetermined time described above may be a time at which the voice input is spoken.

4A to 4D are diagrams for exemplarily describing signal characteristic data.

Figure 4a is a diagram showing a spectrogram for speech spoken from a person. 4B is a diagram illustrating a spectrogram of the voice output from the apparatus. In each spectrogram, frequencies are expressed in Hz, time in seconds, and power in dB.

4C is a diagram illustrating cumulative power for each frequency of the voice of FIG. 4A. 4D is a diagram illustrating cumulative power for each frequency of the voice of FIG. 4B. The voice used in FIGS. 4A to 4D is exemplary, and the technical spirit of the present disclosure is not limited thereto.

4A to 4D, voices spoken by a person and voices output from a device exhibit different electrical signal characteristics. For example, the voice spoken by a person and the voice output from a device show a different form of attenuation in cumulative power for each frequency. In addition, the power of the voice uttered by a person and the voice output from the device are concentrated in different frequency bands, and the relative magnitudes of the high frequency and the low frequency band are also different from each other.

Accordingly, the first learning model may be pre-learned to determine the property of the voice input using signal characteristic data obtained from the voices based on the voice spoken by the person and the voice output from the device.

5 is an exemplary flowchart for describing a method of authenticating a voice input by a device according to some embodiments. Referring to FIG. 5, the device 10 may selectively authenticate a user using the second learning model.

In operation S510, the device 10 may receive a voice input. The device 10 may receive a voice input provided from a user using at least one microphone.

In operation S520, the device 10 may obtain signal characteristic data from the voice input. The signal characteristic data is data representing electrical signal characteristics of the voice input. In an embodiment, the signal characteristic data may include a spectrogram or per-frequecy cumulative power of a voice input. However, this is exemplary and the type of the signal characteristic data of the present disclosure is not limited to the above-described kind.

In operation S530, the device 10 may determine whether the first user authentication is required by applying the signal characteristic data to the first learning model.

In an embodiment, the device 10 may apply the signal characteristic data to the first learning model to determine the property of the voice input and the reliability thereof. In an embodiment, the attribute of the voice input may indicate whether the voice input is spoken by a person or output from a device.

The device 10 may authenticate the voice input based on the determined attribute of the voice input. For example, if it is determined that the voice input is spoken by a person, the device 10 may authenticate the voice input. If it is determined that the voice input is output from the device, the device 10 may not authenticate the voice input.

In an embodiment, the device 10 may authenticate the voice input based on the determined reliability. For example, if it is determined that the voice input is spoken by a person, the device 10 may compare the reliability with a threshold stored in the memory and authenticate the voice input based on the comparison result. Alternatively, when it is determined that the voice input is output from the apparatus, the device 10 may compare the reliability with a threshold stored in the memory and may not authenticate the voice input based on the comparison result.

On the other hand, based on the reliability, the device 10 may be difficult to determine whether the voice input is spoken by a person or output from the device. In this case, the device 10 may determine whether first user authentication is required based on the determined reliability. For example, the device 10 may determine that the first user authentication is required when the determined reliability is included in a predetermined range, for example, 90% or less.

In operation S540, if it is determined that the first user authentication is required, the device 10 may selectively apply a voice input to the second learning model to authenticate the user.

The first user authentication is an operation of authenticating a user who has spoken a voice input based on a voice input pattern of the user. In the present disclosure, the voice input pattern may be a pattern determined based on a voice input input by a user or a situation of inputting a voice input to control a device. The device 10 may apply the voice input to the second learning model to deny authentication of a user who attempts abnormal use.

In this case, the device 10 may perform first user authentication by applying the user's voice input and information representing the current situation to the second learning model.

The second learning model may be a learning model learned based on a voice input pattern of the user. For example, the second learning model may be previously trained using contextual information indicating a user's voice input and a context in which the user inputs the voice as training data.

The environment information of the device 10 refers to environment information within a predetermined radius from the device 10, and may include weather information, temperature information, humidity information, illuminance information, noise information, sound information, time information, and the like. have. For example, the second learning model may be trained so that the user who spoke the voice input is not authenticated if a voice input is provided from the user at a place different from the learned place. Alternatively, the second learning model may be trained so that the user is not authenticated if a voice input is provided from the user at a time different from the learned time. However, data for training the second learning model is not limited thereto.

The state information of the device 10 may include mode information indicating an operation mode of the device 10 (for example, a sound mode, a vibration mode, a silent mode, a power saving mode, a blocking mode, a multi window mode, an automatic rotation mode, and the like), and a device ( 10) location information, time information, activation information of the communication module (for example, Wi-Fi ON / Bluetooth OFF / GPS ON / NFC ON, etc.), network connection status information of the device 10, the device 10 Application information (eg, application identification information, application type, application usage time, application usage cycle), and the like.

The user's status information is information about the user's characteristics, the user's movements, and the lifestyle. The user's status information may include information about the user's gender, the user's walking state, exercise state, driving state, sleep state, and user's mood state. Can be. For example, the second learning model may be trained so that the user is not authenticated if the user recognizes a gesture that is not frequently used in voice input. However, the state information of the user included in the context information is not limited thereto.

The usage history information of the user's device 10 is information about a history of the user's use of the device 10, and includes a history of execution of an application, a history of functions executed in the application, a history of a user's call, a history of a user's text, and a voice. The frequency of the words included in the input, the average number of times the voice input is used, and the average time between the operation of the user using the voice input, and the like.

The schedule information of the user is information about a past schedule and a scheduled schedule of the user. The schedule information may be provided by a user's prior input. Alternatively, the device 10 may receive schedule information from a server or other electronic devices connected through a network.

In authenticating a user, which context information is used may be determined according to learning based on predetermined criteria. For example, busy learning that discovers a pattern of user authentication by supervised learning that takes a predetermined voice input and predetermined context information as an input value, and learns the kind of context information required for user authentication without any guidance. Unsupervised learning can be used for user authentication. Further, for example, reinforcement learning using feedback on whether the result of the user's intention grasping according to the learning is correct may be used for user authentication.

Meanwhile, the device 10 may perform speech to text (STT) conversion on the voice input. The device 10 may extract a user command from the voice input converted through the STT. The device 10 may apply at least one of the language, type, length, and content of the extracted user command to the second learning model.

In operation S550, the device 10 authenticates a voice input based on the user authentication result. For example, if the user is authenticated, the device 10 may authenticate the voice input even if the attribute of the voice input is determined to be low reliability. If the user is not authenticated, the device 10 may not authenticate the voice input.

6 is an exemplary flow diagram illustrating a method for a device to authenticate voice input, in accordance with some embodiments. Referring to FIG. 6, the device 10 may selectively authenticate additional users.

In operation S610, the device 10 may receive a voice input. The device 10 may receive a voice input provided from a user using at least one microphone.

In operation S620, the device 10 may obtain signal characteristic data from the voice input. In an embodiment, the signal characteristic data may include a spectrogram or per-frequecy cumulative power of a voice input. However, this is exemplary and the type of the signal characteristic data of the present disclosure is not limited to the above-described kind.

In operation S630, the device 10 may determine whether the first user authentication is required by applying the signal characteristic data to the first learning model.

In an embodiment, the device 10 may authenticate the voice input based on the determined reliability. On the other hand, based on the reliability, the device 10 may be difficult to determine whether the voice input is spoken by a person or output from the device. In this case, the device 10 may determine whether first user authentication is required based on the determined reliability.

In operation S640, when it is determined that the first user authentication is required, the device 10 may determine whether the second user authentication is required by applying the voice input to the second learning model.

The first user authentication is an operation of authenticating a user who has spoken a voice input based on a voice input pattern of the user. In the present disclosure, the voice input pattern is indicative of the usage behavior of the user using voice input through the device 10. The device 10 may apply the voice input to the second learning model to deny authentication of a user who attempts abnormal use. On the other hand, the device 10 may determine that the second user authentication is required when the user authentication by the first user authentication is denied or difficult to determine.

In operation S650, if it is determined that the second user authentication is required, the device 10 may additionally authenticate the user.

The second user authentication is an operation of authenticating a user who has spoken a voice input by using an additional input provided from the user. In an embodiment, the device 10 may additionally authenticate the user based on the cipher text received from the user. In another embodiment, the device 10 may additionally authenticate the user by using a third learning model for authenticating the user through a query response. In another embodiment, the device 10 may additionally authenticate the user by using biometrics such as fingerprint recognition or face recognition.

In operation S660, the device 10 may authenticate the voice input based on the additional user authentication result. For example, if the user is additionally authenticated, the device 10 may authenticate a voice input. If the user is not authenticated, the device 10 may not authenticate the voice input.

7 is an exemplary flowchart for describing a method in which a device additionally authenticates a user, according to some embodiments. Referring to FIG. 7, the device 10 may authenticate a user based on a cipher text received from the user.

In operation S710, the device 10 may output a preset word. In an embodiment, the preset word may be a word randomly generated or selected by the device 10. The preset word may be a plurality of words. The plurality of words may not be associated with each other. That is, the plurality of words may be generated or selected independently of each other. Since the word output from the device 10 is not a word selected by the user, it is difficult to predict from an external attacker.

In operation S720, the device 10 may receive a cipher text including the output word from the user. For example, when the device 10 outputs "home" and "dog", the cipher text received by the device 10 may be "my dog".

In operation S730, the device 10 may further authenticate the user based on the received cipher text. For example, the device 10 may compare the cipher text received from the user with a previously set cipher text, and authenticate the user based on the comparison result.

8 is an exemplary flowchart for describing a method of a device additionally authenticating a user, according to some embodiments. Referring to FIG. 8, the device 10 may authenticate a user by using a third learning model for authenticating the user through a query response.

In operation S810, the device 10 may obtain context information. The text information may include at least one of surrounding environment information of the device 10, status information of the device 10, status information of the user, usage history information of the device 10, and schedule information of the user. It is not.

In operation S820, the device 10 may apply context information to a third learning model for authenticating a user through a query response. The device 10 may apply the context information to the third learning model to provide the user with at least one question and provide the response received therefrom to the third learning model.

In operation S830, the device 10 may additionally authenticate the user based on the query response using the third learning model of operation S820.

In operation S910, the device 10 may receive a voice input. The device 10 may receive a voice input provided from a user using at least one microphone.

In operation S920, the device 10 may obtain signal characteristic data from the voice input. The signal characteristic data is data representing electrical signal characteristics of the voice input.

In operation S930, the device 10 may determine the attribute and the reliability of the voice input based on the signal characteristic data. In an embodiment, the attribute of the voice input may indicate whether the voice input is spoken by a person or output from a device. The reliability may be a probability that the attribute of the determined voice input coincides with the actual. In an embodiment, the device 10 may apply the signal characteristic data to the first learning model to determine the property of the voice input and the reliability thereof. Alternatively, the device 10 may compare the signal characteristic data with a plurality of signal characteristic data previously stored in the memory 13 to determine the attribute of the voice input and the reliability thereof.

In operation S940 and operation S950, the device 10 may determine whether the user is authenticated based on a result of comparing the determined reliability with a previously stored threshold value. For example, the device 10 may determine that user authentication is required if the determined reliability is less than or equal to the threshold.

In operation S960, if it is determined that user authentication is required, the device 10 may authenticate the user based on the voice input pattern. In the present disclosure, the voice input pattern is indicative of the usage behavior of the user using voice input through the device 10. In an embodiment, the device 10 may deny authentication of a user who attempts abnormal use by applying a voice input to a learning model.

In operation S970, the device 10 may authenticate a voice input. For example, if the user is authenticated, the device 10 may authenticate the voice input even if the attribute of the voice input is determined to be low reliability. If the user is not authenticated, the device 10 may not authenticate the voice input.

In operation S1010, the device 10 may receive a voice input. The device 10 may receive a voice input provided from a user using at least one microphone.

In operation S1020, the device 10 may obtain signal characteristic data from the voice input. The signal characteristic data is data representing electrical signal characteristics of the voice input.

In operation S1030, the device 10 may determine the attribute and the reliability of the voice input based on the signal characteristic data. In an embodiment, the attribute of the voice input may indicate whether the voice input is spoken by a person or output from a device. The reliability may be a probability that the attribute of the determined voice input coincides with the actual.

In operation S1040 and operation S1050, the device 10 may determine whether to authenticate the first user based on a result of comparing the determined reliability with a previously stored threshold value. For example, the device 10 may determine that the first user authentication is required if the determined reliability is equal to or smaller than the threshold.

In operation S1060 and S1070, if it is determined that the first user authentication is required, the device 10 may authenticate the user based on the voice input pattern. In the present disclosure, the voice input pattern is indicative of the usage behavior of the user using voice input through the device 10. In an embodiment, the device 10 may deny authentication of a user who attempts abnormal use by applying a voice input to a learning model. Also, the device 10 may determine whether to authenticate the second user based on the result of the first user authentication. For example, the device 10 may determine that the second user authentication is required if the user is not authenticated in the first user authentication.

In operation S1080, if it is determined that the second user authentication is required, the device 10 may additionally authenticate the user.

In operation S1090, the device 10 may authenticate a voice input. For example, if the user is authenticated, the device 10 may authenticate the voice input even if the attribute of the voice input is determined to be low reliability. If the user is not authenticated, the device 10 may not authenticate the voice input.

11 is a diagram for describing a function of the device 100, according to some embodiments.

The device 100 authenticates the voice input based on the attribute of the voice input using the first learning model 120 trained based on the signal characteristic. Through this, the device 100 may continuously detect and prevent an attack from the outside using the voice output from the device.

The device 100 uses the second learning model 130 trained based on the voice input pattern to authenticate the user based on the behavior of the user using the voice input. Through this, the device 100 may additionally prevent an attack from the outside by using a voice spoken by a human being rather than a user.

In addition, the device 100 may receive an additional input from the user to authenticate the user (140). Through this, the device 100 may authenticate the voice input when the user attempts an unusual voice input, and effectively prevent an attack from the outside by using the voice spoken by the human being, not the user.

Meanwhile, the device 100 is connected to various external electronic devices and servers through a network. The device 100 may collect various information including context information from connected devices and a server. The information collected by the device 100 includes at least one of sound information, text information, location information, and time information (110). The device 100 may train the first or

second learning models

120 and 130 using the collected information.

12 and 13 are block diagrams of a device 1000 according to some embodiments.

As illustrated in FIG. 13, the device 1000 according to some embodiments may include a user input unit 1100, an output unit 1200, a controller 1300, and a communication unit 1500. However, not all of the components illustrated in FIG. 13 are essential components of the device 1000. The device 1000 may be implemented by more components than those illustrated in FIG. 13, and the device 1000 may be implemented by fewer components than those illustrated in FIG. 13.

For example, as illustrated in FIG. 13, the device 1000 according to some embodiments may include a sensing unit 1400 in addition to the user input unit 1100, the output unit 1200, the control unit 1300, and the communication unit 1500. ) May further include an A / V input unit 1600 and a memory 1700.

The user input unit 1100 means a means for a user to input data for controlling the device 1000. For example, the user input unit 1100 includes a key pad, a dome switch, a touch pad (contact capacitive type, pressure resistive layer type, infrared sensing type, surface ultrasonic conduction type, and integral type). Tension measurement method, piezo effect method, etc.), jog wheel, jog switch, microphone, etc., but is not limited thereto. In an embodiment, the user input unit 1100 may include the A / V input unit 1600 illustrated in FIG. 13.

The user input unit 1100 may receive a voice input provided from the user.

The output unit 1200 may output an audio signal, a video signal, or a vibration signal, and the output unit 1200 may include a display unit 1210, an audio output unit 1220, and a vibration motor 1230. have.

The display unit 1210 displays and outputs information processed by the device 1000. For example, the display 1210 may display a user interface for providing a voice input authentication result to the user.

Meanwhile, when the display unit 1210 and the touch pad form a layer structure and are configured as a touch screen, the display unit 1210 may be used as an input device in addition to the output device. The display unit 1210 may include a liquid crystal display, a thin film transistor-liquid crystal display, an organic light-emitting diode, a flexible display, and a three-dimensional display. 3D display, an electrophoretic display. In addition, the device 1000 may include two or more display units 1210 according to an implementation form of the device 1000. In this case, the two or more display units 1210 may be disposed to face each other using a hinge.

The sound output unit 1220 outputs audio data received from the communication unit 1500 or stored in the memory 1700. In addition, the sound output unit 1220 outputs a sound signal related to a function (for example, a call signal reception sound, a message reception sound, and a notification sound) performed by the device 1000. The sound output unit 1220 may include a speaker, a buzzer, and the like.

The vibration motor 1230 may output a vibration signal. For example, the vibration motor 1230 may output a vibration signal corresponding to the output of audio data or video data (eg, a call signal reception sound, a message reception sound, etc.). In addition, the vibration motor 1230 may output a vibration signal when a touch is input to the touch screen.

The controller 1300 generally controls the overall operation of the device 1000. For example, the controller 1300 executes programs stored in the memory 1700, such that the user input unit 1100, the output unit 1200, the sensing unit 1400, the communication unit 1500, and the A / V input unit 1600 are provided. ) Can be controlled overall.

In detail, the controller 1300 may authenticate the received voice input and control the device 1000 or an electronic device connected thereto through the voice input based on the authentication result.

In an embodiment, the controller 1300 may obtain signal characteristic data from a voice input. The signal characteristic data is data representing electrical signal characteristics of the voice input. In an embodiment, the signal characteristic data may be data obtained by analyzing a voice input based on at least one of frequency, time, or power. The controller 1300 may apply the signal characteristic data to the first learning model to determine the property of the voice input and the reliability thereof. In an embodiment, the attribute of the voice input may indicate whether the voice input is spoken by a person or output from a device.

In an embodiment, the controller 1300 may determine whether first user authentication is required based on the determined reliability. The first user authentication is an operation of authenticating a user who has spoken a voice input based on a voice input pattern of the user. In the present disclosure, the voice input pattern indicates usage behavior in which the user uses voice input.

If it is determined that the first user authentication is required, the controller 1300 may authenticate the user by applying the voice input to the second learning model. If the authentication by the first user authentication is denied or difficult to be determined, the controller 1300 may determine that the second user authentication is required, and additionally authenticate the user.

The sensing unit 1400 may detect a state of the device 1000 or a state around the device 1000 and transmit the detected information to the controller 1300.

The sensing unit 1400 may include a geomagnetic sensor 1410, an acceleration sensor 1420, a temperature / humidity sensor 1430, an infrared sensor 1440, a gyroscope sensor 1450, and a position sensor. (Eg, GPS) 1460, barometric pressure sensor 1470, proximity sensor 1480, and RGB sensor (illuminance sensor) 1490, but are not limited thereto. Since functions of the respective sensors can be intuitively deduced by those skilled in the art from the names, detailed descriptions thereof will be omitted.

The communicator 1500 may include one or more components that allow communication between the device 1000 and the HMD apparatus or the device 1000 and the server. For example, the communicator 1500 may include a short range communicator 1510, a mobile communicator 1520, and a broadcast receiver 1530.

The short-range wireless communication unit 1510 includes a Bluetooth communication unit, a Bluetooth low energy (BLE) communication unit, a near field communication unit, a WLAN (Wi-Fi) communication unit, a Zigbee communication unit, and an infrared ray ( IrDA (Infrared Data Association) communication unit, WFD (Wi-Fi Direct) communication unit, UWB (ultra wideband) communication unit, Ant + communication unit and the like, but may not be limited thereto.

The mobile communication unit 1520 transmits and receives a radio signal with at least one of a base station, an external terminal, and a server on a mobile communication network. Here, the wireless signal may include various types of data according to transmission and reception of a voice call signal, a video call call signal, or a text / multimedia message.

The broadcast receiving unit 1530 receives a broadcast signal and / or broadcast related information from the outside through a broadcast channel. The broadcast channel may include a satellite channel and a terrestrial channel. According to an implementation example, the device 1000 may not include the broadcast receiver 1530.

In addition, the communication unit 1500 may transmit / receive information for using the context information with the HMD device, the server, and the peripheral device.

The A / V input unit 1600 is for inputting an audio signal or a video signal, and may include a camera 1610 and a microphone 1620. The camera 1610 may obtain an image frame such as a still image or a moving image through an image sensor in a video call mode or a photographing mode. The image captured by the image sensor may be processed by the controller 1300 or a separate image processor (not shown).

The image frame processed by the camera 1610 may be stored in the memory 1700 or transmitted to the outside through the communication unit 1500. Alternatively, the image frame may be used to determine the voice active condition and the voice inactive condition of the controller 1300. Two or more cameras 1610 may be provided according to the configuration aspect of the terminal.

The microphone 1620 receives a voice input. In addition, the microphone 1620 receives an external sound signal and processes it as electrical voice data. For example, the microphone 1620 may receive an acoustic signal from an external device or speaker. The microphone 1620 may use various noise removing algorithms for removing noise generated in the process of receiving an external sound signal including a voice input.

The memory 1700 may store a program for processing and controlling the controller 1300, and may store data input to or output from the device 1000.

The memory 1700 may include a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (for example, SD or XD memory), RAM Random Access Memory (RAM) Static Random Access Memory (SRAM), Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Programmable Read-Only Memory (PROM), Magnetic Memory, Magnetic Disk It may include at least one type of storage medium of the optical disk.

Programs stored in the memory 1700 may be classified into a plurality of modules according to their functions. For example, the programs stored in the memory 1700 may be classified into a UI module 1710, a touch screen module 1720, a notification module 1730, and the like. .

The UI module 1710 may provide a specialized UI, GUI, or the like that is linked with the device 1000 for each application. The touch screen module 1720 may detect a touch gesture on a user's touch screen and transmit information about the touch gesture to the controller 1300. The touch screen module 1720 according to some embodiments may recognize and analyze a touch code. The touch screen module 1720 may be configured as separate hardware including a controller.

Various sensors may be provided inside or near the touch screen to detect a touch or proximity touch of the touch screen. An example of a sensor for sensing a touch of a touch screen is a tactile sensor. The tactile sensor refers to a sensor that senses the contact of a specific object to the extent that a person feels or more. The tactile sensor may sense various information such as the roughness of the contact surface, the rigidity of the contact object, the temperature of the contact point, and the like.

In addition, an example of a sensor for sensing a touch of a touch screen is a proximity sensor.

The proximity sensor refers to a sensor that detects the presence or absence of an object approaching a predetermined detection surface or an object present in the vicinity without using a mechanical contact by using an electromagnetic force or infrared rays. Examples of the proximity sensor include a transmission photoelectric sensor, a direct reflection photoelectric sensor, a mirror reflection photoelectric sensor, a high frequency oscillation proximity sensor, a capacitive proximity sensor, a magnetic proximity sensor, and an infrared proximity sensor. The user's touch gesture may include tap, touch and hold, double tap, drag, pan, flick, drag and drop, and swipe.

The notification module 1730 may generate a signal for notifying occurrence of an event of the device 1000. Examples of events occurring in the device 1000 include call signal reception, message reception, key signal input, and schedule notification. The notification module 1730 may output a notification signal in the form of a video signal through the display unit 1210, may output the notification signal in the form of an audio signal through the sound output unit 1220, and the vibration motor 1230. Through the notification signal may be output in the form of a vibration signal.

14 is a block diagram of a processor 1300 in accordance with some embodiments.

Referring to FIG. 14, a processor 1300 according to some embodiments may include a data learner 1310 and a data recognizer 1320.

According to an embodiment, at least a part of the data learner 1310 and at least a part of the data recognizer 1320 may be implemented as a software module or manufactured in the form of a hardware chip and mounted on a device.

The data learner 1310 may learn a criterion for determining a property of a voice input and a reliability thereof. The data learner 1310 may learn a criterion about what data to use to determine the attribute of the voice input, the reliability thereof, and to authenticate the voice input. In addition, the data learner 1310 may use the provided data to learn a criterion about how to determine a property of a voice input, how to determine its reliability, and how to determine a criterion for voice input authentication. The data learner 1310 acquires data to be used for learning and applies the acquired data to a data recognition model to be described later, so as to learn a criterion regarding a voice input property and reliability thereof.

In addition, the data learner 1310 may learn a criterion for authenticating a user based on a voice input pattern. The data learner 1310 may learn a criterion about which context data to use to authenticate a user. In addition, the data learner 1310 may learn a criterion on how to authenticate a user using the provided context data. The data learner 1310 acquires data to be used for learning, and applies the acquired data to a data recognition model to be described later, thereby learning a criterion for user authentication.

The data learner 1310 may provide functions of the learning models used by the device in FIGS. 1 to 10, and the learning model used by the device in FIGS. 1 through 10 by one or more data learners 1310. Their function can be implemented.

The data recognizer 1320 may determine the attributes of the voice input, the reliability thereof, or authenticate the user based on the data. The data recognizer 1320 may determine the attributes of the voice input and reliability thereof from the predetermined data or authenticate the user by using the learned data recognition model. The data recognizing unit 1320 may obtain predetermined data according to a preset reference by learning, and use the data recognition model by using the acquired data as an input value. In addition, the data recognizing unit 1320 may determine the attribute of the voice input and the reliability thereof based on the predetermined data or authenticate the user by using the data recognizing unit 1320. In addition, the result value output by the data recognition model using the acquired data as an input value may be used to update the data recognition model.

At least one of the data learner 1310 and the data recognizer 1320 may be manufactured in the form of at least one hardware chip and mounted on the device 1000. For example, at least one of the data learner 1310 and the data recognizer 1320 may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or an existing general purpose processor (eg, a CPU). Alternatively, the device may be manufactured as a part of an application processor or a graphics processor (eg, a GPU) and mounted on the aforementioned various devices 1000. At this time, the dedicated hardware chip for artificial intelligence is a dedicated processor specialized in probability calculation, and has higher parallelism performance than the conventional general-purpose processor, so that it is possible to process arithmetic tasks in the field of artificial intelligence such as machine learning.

The data learner 1310 and the data recognizer 1320 may be mounted in one device or may be mounted in separate devices, respectively. For example, one of the data learner 1310 and the data recognizer 1320 may be included in the device, and the other may be included in the server. In addition, the data learner 1310 and the data recognizer 1320 may provide model information constructed by the data learner 1310 to the data recognizer 1320 via a wired or wireless connection. The data input to 1320 may be provided to the data learner 1310 as additional learning data.

Meanwhile, at least one of the data learner 1310 and the data recognizer 1320 may be implemented as a software module. When at least one of the data learning unit 1310 and the data recognizing unit 1320 is implemented as a software module (or a program module including an instruction), the software module may be a computer readable non-transitory computer. It may be stored in a non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

15 is a block diagram of a data learner 1310 according to some embodiments.

Referring to FIG. 15, the data learner 1310 may include a data acquirer 1310-1, a preprocessor 1310-2, a training data selector 1310-3, and a model learner 1310. -4) and the model evaluator 1310-5.

The data acquirer 1310-1 may determine attributes of the voice input and reliability thereof, or acquire data necessary for authenticating a user. The data acquirer 1310-1 may acquire, for example, predetermined user voice or context information.

The preprocessor 1310-2 may preprocess the acquired data to determine the attribute of the voice input and the reliability thereof, or to use the acquired data for learning to authenticate the user. The preprocessor 1310-2 acquires the data so that the model learner 1310-4 to be described later can determine the attributes of the voice input and the reliability thereof, or use the data acquired for learning the user for authentication. Can be processed in a preset format.

The training data selector 1310-3 may select data required for learning from the preprocessed data. The selected data may be provided to the model learner 1310-4. The training data selector 1310-3 may select data required for learning from preprocessed data according to a property of a voice input and a reliability thereof, or based on preset criteria for authenticating a user. In addition, the training data selector 1310-3 may select data according to preset criteria by learning by the model learner 1310-4 to be described later.

The model learner 1310-4 may determine a property of a voice input and a reliability thereof based on the training data, or learn a criterion about whether to authenticate a user. In addition, the model learner 1310-4 may determine a property of a voice input and a reliability thereof, or learn a criterion about what training data should be used to authenticate a user.

In addition, the model learner 1310-4 may learn the data recognition model used to determine the property of the voice input and the reliability thereof or to authenticate the user using the training data. In this case, the data recognition model may be a pre-built model. For example, the data recognition model may be a model built in advance by receiving basic training data (eg, sample data).

The data recognition model may be constructed in consideration of the application field of the recognition model, the purpose of learning, or the computer performance of the device. The data recognition model can be designed to simulate the human brain structure on a computer. The data recognition model may include a plurality of weighted network nodes that simulate neurons in a human neural network. The plurality of network nodes may form a connection relationship so that neurons simulate synaptic activity through which signals are sent and received through synapses. The data recognition model may include, for example, a neural network model or a deep learning model developed from the neural network model. In the deep learning model, a plurality of network nodes may be located at different depths (or layers) and exchange data according to a convolutional connection relationship. The data recognition model may include, for example, models such as a deep neural network (DNN), a recurrent neural network (RNN), a bidirectional recurrent deep neural network (BRDNN), and the like.

According to various embodiments of the present disclosure, when there are a plurality of pre-built data recognition models, the model learner 1310-4 may be a data recognition model to learn a data recognition model having a large correlation between input training data and basic training data. You can decide. In this case, the basic training data may be previously classified by the type of data, and the data recognition model may be pre-built by the type of data. For example, the basic training data is classified based on various criteria such as the region where the training data is generated, the time at which the training data is generated, the size of the training data, the genre of the training data, the creator of the training data, and the types of objects in the training data. It may be.

In addition, the model learner 1310-4 may train the data recognition model using, for example, a learning algorithm including an error back-propagation method or a gradient descent method. .

In addition, the model learner 1310-4 may train the data recognition model through, for example, supervised learning using the training data as an input value. In addition, the model learner 1310-4, for example, by unsupervised learning to find a criterion for situation determination by learning the kind of data necessary for situation determination without any guidance, You can train the data recognition model. In addition, the model learner 1310-4 may train the data recognition model, for example, through reinforcement learning using feedback on whether the result of the situation determination according to the learning is correct.

In addition, when the data recognition model is trained, the model learner 1310-4 may store the trained data recognition model. In this case, the model learner 1310-4 may store the learned data recognition model in a memory of the electronic device including the data recognizer 1320. Alternatively, the model learner 1310-4 may store the learned data recognition model in a memory of an electronic device including the data recognizer 1320, which will be described later. Alternatively, the model learner 1310-4 may store the learned data recognition model in a memory of a server connected to the electronic device through a wired or wireless network.

In this case, the memory in which the learned data recognition model is stored may store, for example, commands or data related to at least one other element of the electronic device. The memory may also store software and / or programs. The program may include, for example, a kernel, middleware, an application programming interface (API) and / or an application program (or “application”), and the like.

The model evaluator 1310-5 may input the evaluation data into the data recognition model, and cause the model learner 1310-4 to relearn if the recognition result output from the evaluation data does not satisfy a predetermined criterion. have. In this case, the evaluation data may be preset data for evaluating the data recognition model.

For example, the model evaluator 1310-5 may determine a predetermined criterion when the number or ratio of the evaluation data that is not accurate among the recognition results of the learned data recognition model for the evaluation data exceeds a preset threshold. It can be evaluated as not satisfied. For example, when a predetermined criterion is defined at a ratio of 2%, the model evaluator 1310-5 when the learned data recognition model outputs an incorrect recognition result for more than 20 evaluation data out of a total of 1000 evaluation data. Can be judged that the learned data recognition model is not suitable.

On the other hand, when there are a plurality of trained data recognition models, the model evaluator 1310-5 evaluates whether each learned video recognition model satisfies a predetermined criterion, and recognizes a model satisfying the predetermined criterion for final data. Can be determined as a model. In this case, when there are a plurality of models satisfying a predetermined criterion, the model evaluator 1310-5 may determine any one or a predetermined number of models that are preset in the order of the highest evaluation score as the final data recognition model.

Meanwhile, the data acquisition unit 1310-1, the preprocessor 1310-2, the training data selector 1310-3, the model learner 1310-4, and the model evaluator 1310 in the data learner 1310. At least one of -5) may be manufactured in the form of at least one hardware chip and mounted on the electronic device. For example, at least one of the data acquirer 1310-1, the preprocessor 1310-2, the training data selector 1310-3, the model learner 1310-4, and the model evaluator 1310-5. One may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or may be manufactured as a part of an existing general purpose processor (eg, a CPU or application processor) or a graphics dedicated processor (eg, a GPU). It may be mounted on various electronic devices.

In addition, the data obtaining unit 1310-1, the preprocessor 1310-2, the training data selecting unit 1310-3, the model learning unit 1310-4, and the model evaluating unit 1310-5 are electronic components. It may be mounted on the device, or may be mounted on separate electronic devices, respectively. For example, some of the data acquirer 1310-1, the preprocessor 1310-2, the training data selector 1310-3, the model learner 1310-4, and the model evaluator 1310-5. May be included in the electronic device, and the rest may be included in the server.

In addition, at least one of the data acquirer 1310-1, the preprocessor 1310-2, the training data selector 1310-3, the model learner 1310-4, and the model evaluator 1310-5 may be used. It may be implemented as a software module. At least one of the data acquirer 1310-1, the preprocessor 1310-2, the training data selector 1310-3, the model learner 1310-4, and the model evaluator 1310-5 is a software module. (Or a program module including instructions), the software module may be stored in a computer readable non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

16 is a block diagram of a data recognizer 1320 according to some embodiments.

Referring to FIG. 16, a data recognizer 1320 according to an exemplary embodiment may include a data acquirer 1320-1, a preprocessor 1320-2, a recognition data selector 1320-3, and a recognition result provider ( 1320-4) and a model updater 1320-5.

The data acquirer 1320-1 may determine the attributes of the voice input and the reliability thereof, or may acquire data necessary for authenticating the user. The preprocessor 1320-2 may acquire the attributes of the voice input and the reliability thereof. The obtained data may be preprocessed such that the obtained data may be used to determine the data or to authenticate the user. The preprocessor 1320-2 may use the acquired data so that the recognition result provider 1320-4, which will be described later, may use the acquired data for determining the attribute and reliability of the voice input or authenticating the user. Can be processed in a preset format.

The recognition data selector 1320-3 may determine the attributes of the voice input and the reliability thereof from the preprocessed data, or select data necessary for authenticating the user. The selected data may be provided to the recognition result provider 1320-4. The recognition data selector 1320-3 may select some or all of the preprocessed data according to a predetermined criterion for determining the attribute of the voice input and the reliability thereof, or authenticating the user. In addition, the recognition data selector 1320-3 may select data according to a predetermined criterion by learning by the model learner 1310-4 to be described later.

The recognition result providing unit 1320-4 may apply the selected data to the data recognition model to determine the property of the voice input, the reliability thereof, or authenticate the user. The recognition result providing unit 1320-4 may provide a recognition result according to a recognition purpose of data. The recognition result provider 1320-4 may apply the selected data to the data recognition model by using the data selected by the recognition data selector 1320-3 as an input value. In addition, the recognition result may be determined by the data recognition model.

The model updater 1320-5 may cause the data recognition model to be updated based on the evaluation of the recognition result provided by the recognition result provider 1320-4. For example, the model updater 1320-5 provides the model learning unit 1310-4 with the recognition result provided by the recognition result providing unit 1320-4 so that the model learner 1310-4 provides the recognition result. The data recognition model can be updated.

Meanwhile, the data acquisition unit 1320-1, the preprocessor 1320-2, the recognition data selector 1320-3, the recognition result providing unit 1320-4, and the model updater in the data recognition unit 1320 ( At least one of 1320-5 may be manufactured in the form of at least one hardware chip and mounted on the electronic device. For example, among the data acquirer 1320-1, the preprocessor 1320-2, the recognition data selector 1320-3, the recognition result provider 1320-4, and the model updater 1320-5. At least one may be fabricated in the form of a dedicated hardware chip for artificial intelligence (AI), or may be fabricated as part of an existing general purpose processor (e.g., CPU or application processor) or graphics dedicated processor (e.g., GPU). It may be mounted on various electronic devices.

In addition, the data acquisition unit 1320-1, the preprocessor 1320-2, the recognition data selection unit 1320-3, the recognition result providing unit 1320-4, and the model updater 1320-5 It may be mounted on an electronic device, or may be mounted on separate electronic devices, respectively. For example, among the data acquirer 1320-1, the preprocessor 1320-2, the recognition data selector 1320-3, the recognition result provider 1320-4, and the model updater 1320-5. Some may be included in the electronic device, and others may be included in the server.

In addition, at least one of the data acquirer 1320-1, the preprocessor 1320-2, the recognition data selector 1320-3, the recognition result provider 1320-4, and the model updater 1320-5. May be implemented as a software module. At least one of the data acquirer 1320-1, the preprocessor 1320-2, the recognition data selector 1320-3, the recognition result provider 1320-4, and the model updater 1320-5 is software. When implemented as a module (or a program module including instructions), the software module may be stored on a computer readable non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

17 illustrates an example in which the device 1000 and the server 2000 learn and recognize data by interworking with each other, according to an exemplary embodiment.

Referring to FIG. 17, the server 2000 may determine a property of a voice input, a reliability thereof, or learn a criterion for authenticating a user, and the device 1000 may be based on a learning result by the server 2000. By determining the attributes of the voice input and the reliability thereof, it is possible to authenticate the user.

In this case, the model learner 2340 of the server 2000 may perform a function of the data learner 1310 illustrated in FIG. 15. The model learner 2340 of the server 2000 may learn a criterion about what data to use for voice activation, inactivity condition determination, and recommendation speech text information generation. In addition, the model learner 2340 of the server may use the data to determine a criterion for determining voice activation and inactivation conditions and to generate recommended speech text information. The model learner 2340 acquires data to be used for training and applies the acquired data to a data recognition model to be described later, so as to determine the attributes of the voice input and its reliability, or to learn a criterion for authenticating the user. Can be.

In addition, the recognition result providing unit 1320-4 of the device 1000 applies the data selected by the recognition data selecting unit 1320-3 to the data recognition model generated by the server 2000, and thus the attribute of the voice input and the like. The credibility can be determined or the user can be authenticated. For example, the recognition result provider 1320-4 transmits the data selected by the recognition data selector 1320-3 to the server 2000, and the server 2000 transmits the recognition data selector 1320-3. The data selected by may be applied to the recognition model to determine the attributes of the voice input and the reliability thereof, or to request to authenticate the user. In addition, the recognition result providing unit 1320-4 may determine the attribute of the voice input determined by the server 2000 and the reliability thereof, or receive information from the server 2000 about authenticating the user.

Alternatively, the recognition result providing unit 1320-4 of the device 1000 receives a recognition model generated by the server 2000 from the server 2000, and uses the received recognition model to determine the attributes of the voice input and the like. The reliability can be determined or the user can be authenticated. In this case, the recognition result providing unit 1320-4 of the device 1000 applies the data selected by the recognition data selecting unit 1320-3 to the data recognition model received from the server 2000, and thus the attributes of the voice input and the like. The reliability can be determined or a user can be authenticated.

In addition, the device 1000 and the server 2000 may effectively distribute and perform a task for learning the data recognition model and data recognition, thereby efficiently processing data in order to provide a service corresponding to the user's intention. To protect the user's privacy.

Some embodiments may be implemented as S / W programs that include instructions stored on computer-readable storage media.

For example, a computer may be a device capable of calling stored instructions from a storage medium and operating according to the disclosed embodiments according to the called instructions, and may include a device according to the disclosed embodiments or an external server connected to the device. .

The computer readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-temporary' means that the storage medium does not include a signal or a current, and is tangible, but does not distinguish that the data is semi-permanently or temporarily stored in the storage medium. For example, non-transitory storage media may be stored temporarily such as registers, caches, buffers, as well as non-transitory readable storage media such as CD, DVD, hard disk, Blu-ray disc, USB, internal memory, memory card, ROM, or RAM. Media may be included.

In addition, the method according to the disclosed embodiments may be provided as a computer program product.

The computer program product may include a S / W program, a computer readable storage medium on which the S / W program is stored, or a product traded between a seller and a buyer.

For example, a computer program product may include a product (eg, a downloadable app) in the form of a S / W program distributed electronically through a device manufacturer or an electronic market (eg, Google Play Store, App Store). For electronic distribution, at least a part of the S / W program may be stored in a storage medium or temporarily generated. In this case, the storage medium may be a server of a manufacturer or an electronic market, or a storage medium of a relay server.

Claims

A device for authenticating voice input provided from a user:

A microphone to receive the voice input;

Memory for storing one or more instructions; And

A processor executing the one or more instructions;

Including;

The processor, by executing the one or more instructions, obtains the signal characteristic data representing the signal characteristic of the speech input from the speech input and uses the acquired signal characteristic data to determine a first attribute of the speech input. Apply to the model to authenticate the voice input,

And wherein the first learning model is trained to determine attributes of the speech input based on speech spoken by a person and speech output from the device.
The method of claim 1,

The signal characteristic data includes information regarding per-frequecy cumulative power of the voice input.
The method of claim 2,

Wherein the first learning model is trained to differently determine an attribute of the voice input according to attenuation form of cumulative power for each frequency of the voice spoken by the person and the voice output from the device.
The method of claim 1,

The processor executes the one or more instructions,

And authenticate the user by applying the voice input to a second learning model for authenticating the user who has spoken the voice input based on the voice input pattern of the user.
The method of claim 4, wherein

The processor executes the one or more instructions,

And determining whether first user authentication is required by applying the obtained signal characteristic data to the first learning model, and selectively applying the voice input to the second learning model based on the determination result.
The method of claim 4, wherein

The processor executes the one or more instructions,

Obtaining context information including at least one of surrounding environment information of the device, state information of the device, device usage history information of the user, and schedule information of the user; Do more,

Authenticating the user includes inputting the context information into the second learning model along with the voice input.
The method of claim 4, wherein

The processor executes the one or more instructions,

Determining whether a second user authentication is required by applying the voice input to the second learning model, and additionally authenticating the user who has selectively spoken the voice input based on the determination result.
In the method for authenticating a voice input provided from a user,

Receiving the voice input;

Obtaining signal characteristic data representing signal characteristics of the speech input from the speech input; And

Authenticating the speech input by applying the obtained signal characteristic data to a first learning model for determining the attribute of the speech input;

Including;

And the first learning model is trained to determine attributes of the speech input based on speech spoken by a person and speech output from the device.
The method of claim 8,

The signal characteristic data includes information about per-frequecy cumulative power of the voice input.
The method of claim 9,

And wherein the first learning model is trained to determine different attributes of the voice input according to attenuation patterns of cumulative power for each frequency of voice spoken by the person and voice output from the device.
The method of claim 8,

Authenticating the user by applying the voice input to a second learning model for authenticating the user who has spoken the voice input based on the user's voice input pattern.
The method of claim 11,

Determining whether first user authentication is required by applying the acquired signal characteristic data to the first learning model,

Selectively applying the speech input to the second learning model based on the determination result.
The method of claim 11,

Obtaining context information including at least one of surrounding environment information of a device, state information of the device, device usage history information of the user, and schedule information of the user; More,

Authenticating the user comprises inputting the context information into the second learning model along with the voice input.
The method of claim 11,

Determining whether a second user authentication is required by applying the voice input to the second learning model; And

And further authenticating the user who has selectively spoken the voice input based on the determination result.
The method of claim 14,

Obtaining context information including at least one of surrounding environment information of a device, state information of the device, device usage history information of the user, and schedule information of the user; More,

The additionally authenticating the user further includes further authenticating the user who spoke the voice input by applying the obtained context information to a third learning model for authenticating the user through a query response.