WO2021172641A1

WO2021172641A1 - Device for generating control information on basis of utterance state of user, and control method therefor

Info

Publication number: WO2021172641A1
Application number: PCT/KR2020/003007
Authority: WO
Inventors: 파벨 그르제시아크그르제고르츠
Original assignee: 삼성전자 주식회사
Priority date: 2020-02-27
Filing date: 2020-03-03
Publication date: 2021-09-02
Also published as: KR20210109722A

Abstract

The present disclosure provides: a device for more accurately recognizing a speech input of a user, and performing a control operation by using the speech input of the user, wherein the speech input is recognized by analyzing a signal generated when the user makes an utterance, or by using vibration information to identify the signal generated by the utterance of the user from among audio signals input to the device; and a control method therefor.

Description

A device for generating control information based on a user's speech state and a control method therefor

The present disclosure relates to a device that performs voice recognition based on a user's utterance state, and generates a control command according to voice recognition, and a control method of the device, and more particularly, the device performs voice recognition according to the user's actual utterance. Recognizing a voice command and generating a control command for controlling the device according to the recognized voice command, or generating a control command for the device using motion information according to the movement of a user's body part and the recognized voice command .

Due to the development of IT technology, device types, services, contents, and interfaces between devices and users are changing in various ways. Conventionally, as an interface between a device and a person, a contact-type interface by a user's direct touch through a predetermined input means has been mainly used. For example, a user input interface using a keyboard or mouse is used in a PC, and an interface in which a user intuitively touches a screen with a finger is mainly used in a smart phone. Recently, speech recognition as an interface between a device and a person has been attracting attention. Speech recognition is a technology that converts an acoustic speech signal obtained through a sound sensor such as a microphone into words or sentences. As voice recognition technology develops, a user inputs a voice into a device, and it becomes possible to control an operation of the device according to the voice input.

The conventional voice recognition technology is vulnerable to external noise caused by a conversation of an external user other than the user of the device, and thus it may be difficult to accurately recognize the user's voice. In addition, in the conventional voice recognition technology, an operation according to the user's voice input is performed after the user's voice input is completed, and proper feedback is not made during the user's voice input. Accordingly, there is a need for a technology capable of more accurately recognizing a user's voice input and performing a device control operation intended by the user even during the user's voice input.

An embodiment of the present disclosure analyzes a signal generated when a user's speech or uses vibration information to identify a signal generated by the user's speech among audio signals input to a device to more accurately recognize the user's voice input and It is possible to provide a device for performing a control operation based on a voice input and a method for controlling the same.

An embodiment of the present disclosure provides a device and a control method for generating a control command related to a user's voice input, when generating a control command related to a user's voice input, using the user's motion information to generate the control command and controlling the control command based on the generated control command can do.

According to embodiments of the present disclosure, it is possible to more accurately recognize a user's voice input and perform a device control operation based on the user's movement information.

1 is a diagram for explaining a method of controlling a device according to an embodiment of the present disclosure.

2 is a block diagram of a device according to an embodiment of the present disclosure.

3 is a diagram for explaining a voice recognition process of a device according to an embodiment of the present disclosure.

4 is a diagram for explaining user movement information according to an embodiment of the present disclosure.

5 is a diagram for explaining a voice recognition process and generation of control information using user's motion information according to an embodiment of the present disclosure.

6 and 7 are diagrams for explaining a process of determining an attribute value of control information based on motion information according to an embodiment.

8 is a diagram for explaining a process of determining an attribute value of control information based on motion information according to another embodiment of the present disclosure.

9A is a diagram illustrating a control system including a device generating control information and an external device to be controlled according to an exemplary embodiment.

9B is a diagram illustrating a control system including a device for generating control information and an external device to be controlled according to another exemplary embodiment.

10 is a flowchart of a method for a device to provide control information according to an embodiment.

11 is a detailed flowchart of a method for a device to provide control information according to an embodiment of the present disclosure.

12 is a flowchart illustrating a method of controlling an external device according to an embodiment of the present disclosure.

13 is a flowchart illustrating a method of controlling an external device according to another embodiment of the present disclosure.

14 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure.

As a technical means for achieving the above technical problem, a device according to an embodiment of the present disclosure includes a memory in which at least one program is stored; a microphone for receiving an audio signal; a sensor module for acquiring vibration information according to the user's utterance state; and at least one processor configured to generate control information corresponding to the user's voice input identified from the audio signal by executing the at least one program, wherein the at least one program includes, based on the vibration information, identifying the user's voice input from the audio signal received through the microphone; and generating control information corresponding to the user's voice input based on the identified user's voice input.

In addition, according to an embodiment of the present disclosure, a method for a device to provide control information for a user's voice input includes: receiving an audio signal; acquiring vibration information according to the user's utterance state; identifying the user's voice input included in the audio signal based on the vibration information; and generating control information corresponding to the user's voice input based on the identified user's voice input.

The terms used in the present disclosure are selected as currently widely used general terms as possible while considering the functions in the present disclosure, which may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

When a part "includes" a certain element throughout the specification, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated. In addition, terms such as "... unit" and "module" described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software. .

Hereinafter, embodiments will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. However, the present disclosure may be implemented in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

Various operations, blocks, steps, and the like in the flowcharts in the present disclosure may be performed according to the illustrated order or may be performed according to a different order. Also, at least some of the steps may be performed or concurrently. In addition, in some embodiments, some of the operations, blocks, steps, etc. may be omitted, added, or modified without departing from the scope of the present disclosure.

In addition, in the present disclosure, the voice assistant service provides automated speech recognition (ASR) processing, natural language understanding (NLU) processing, dialogue management (DM: Dialogue Manager) processing, natural language generation to an audio signal. It is a service that provides a response to a user's voice command through natural language generation (NLG) processing and text to speech (TTS) processing. In particular, in the present disclosure, the voice assistant service may be a service that recognizes a user's voice command and controls the operation of the device according to the corresponding voice command.

The artificial intelligence model is an artificial intelligence algorithm, and may be a model learned using at least one of machine learning, neural networks, genes, deep learning, and classification algorithms.

The model of the voice assistant service may be an artificial intelligence model in which standards and methods for providing feedback according to a user's voice command in the voice assistant service are learned. The model of the voice assistant service may include, for example, a model for recognizing a user's input voice, a model for interpreting the user's input voice, and a model for generating a control command according to the user's input voice. , but not limited thereto. The models constituting the model of the voice assistant service may be an artificial intelligence model to which an artificial intelligence algorithm is applied.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Referring to FIG. 1 , the device 1000 is in contact with the user's body, and detects a voice input by the user's utterance and the user's movement. 1 illustrates a smart earphone attached to a user's ear as an example of the device 1000 . In the present disclosure, a smart earphone is a voice assistant service capable of acquiring a user's voice input and user's movement information, and performing a control operation according to the user's voice input and user's movement information, in addition to a function of outputting an audio signal. A device that can provide Like a pair of smart earphones attached to and detached from the user's left and right ears, the device 1000 may represent a set of a plurality of devices that are in contact with a plurality of body parts without contacting only one part of the user's body.

It is not limited to the earphone illustrated in FIG. 1 , and in the present disclosure, the device 1000 may refer to an electronic device that is in contact with the user's body and can obtain the user's voice input and movement information. For example, the device 1000 may be a wearable device such as augmented reality (AR) glasses, a smart watch, a smart lens, a smart bracelet, or smart clothing. Also, the device 1000 may be a mobile device such as a smart phone, a smart tablet, a computer, a notebook computer, etc. used by a user. Without being limited to the above-described example, the device 1000 may include various electronic devices capable of detecting a voice input by a user's utterance and user's movement information.

The device 1000 generates corresponding control information based on the user's voice input. Also, the device 1000 may generate control information by using the user's movement information in addition to the user's voice input. The control information may be a command for controlling the device 1000 itself. For example, when the device 1000 is a smart earphone, the control information of the device 1000 is volume up/down, mute, and forward for changing the track of music output through the smart earphone. ) and a control command for controlling a backward operation, a track play, and a pause operation. Also, the control information may be a control command for controlling an external device. The control information for controlling the external device may be information for controlling the movement of the external device or information for controlling an output signal output from the external device. The control command of the device 1000 may be changed according to the type and function of the device to be controlled.

The speech recognition process performed by the device 1000 may be divided into an embedded method and a non-embedded method according to the subject of the speech recognition process. In the embedded method, a control command may be generated based on a user's voice input and movement information by a voice assistant program installed by default in the device 1000 . The generated control information may be used to control the device 1000 itself, or may be transmitted to an external device connected through a network and used to control the operation of the external device. In the non-embedded method, the device 1000 may transmit the user's voice input and motion information to an external device connected through a network, and the external device may generate control information based on the user's voice input and motion information. The control command generated by the external device may be used again to control the device 1000 or another external device. Specifically, in the non-embedded method, the device 1000 transmits the user's voice input and motion information to an external device, and the external device generates control information based on the user's voice input and motion information, and the device 1000 . may receive a control command generated from an external device and perform a control operation based on the received control information. In addition, the device 1000 transmits the user's voice input and motion information for controlling the second external device to the first external device, and the first external device generates control information based on the user's voice input and motion information. and the second external device may receive a control command generated by the first external device and perform a control operation based on the received control information.

Referring to FIG. 1 , a device 1000 receives a user voice input by utterance from a user. Specifically, the device 1000 receives an audio signal through an input means such as a microphone, excludes a non-voice section not caused by the user's utterance from the received audio signal, and identifies the user's voice input by the actual user's utterance do. The device 1000 may detect vibrations generated in the larynx according to the user's utterance, and determine the audio signal input during the period in which the vibration is sensed among the input audio signals as the user's voice input due to the user's utterance. . Also, in addition to sensing vibration, the device 1000 may analyze an audio signal input through a microphone to identify an utterance section caused by the user's utterance. In an embodiment, the device 1000 includes a cepstrum, a linear predictive coefficient (LPC), a mel frequency cepstral coefficient (MFCC) and a filter bank energy in the received audio signal. (Filter Bank Energy) may be used to extract a feature vector of an input audio signal using any one of the feature vector extraction techniques, and the feature vector may be analyzed to identify the user's voice input by the user's utterance. The above-described speech feature vector extraction technique is merely an example, and the feature vector extraction technique used in the present disclosure is not limited to the above-described example.

The device 1000 may extract a feature vector due to a user's utterance by applying a deep neural network model (DNN) to the feature vector extracted from the audio signal. The user's voice input signal feature may be expressed as a user feature vector. Specifically, the device 1000 may extract the user's feature vector by applying a deep neural network (DNN) to the speech feature vector extracted from the input audio signal. The device 1000 may obtain a user feature vector by inputting a speech feature vector as an input value to a deep neural network model (DNN) and a feature value related to a user as an output to the deep neural network model, respectively, and training. The deep neural network model may include at least one of a convolutional neural network (CNN), a recurrent neural network (RNN), and a generative adversarial network (GAN), but is not limited to the examples listed above. The deep neural network model used by the device 1000 of the present disclosure may include all types of currently known neural network models.

In an embodiment, the device 1000 may include an Automatic Speech Recognition (ASR) model. The ASR model is a speech recognition model that recognizes speech using an integrated neural network, and may output text from a user's speech input. The ASR model may be, for example, an artificial intelligence model including an acoustic model, a pronunciation dictionary, and a language model. Alternatively, the ASR model may be, for example, an end-to-end speech recognition model having a structure including an integrated neural network without separately including an acoustic model, a pronunciation dictionary, and a language model. The end-to-end ASR model uses an integrated neural network to convert speech into text without a process of converting phonemes into text after recognizing phonemes from speech. The text may include at least one character. Characters refer to symbols used to express and write human language in a visible form. For example, the characters may include Hangul, alphabets, Chinese characters, numbers, diacritics, punctuation marks, and other symbols. Also, for example, the text may include a character string. A character string refers to a sequence of characters. For example, the text may include at least one alphabet. A grapheme is the smallest unit of sound, composed of at least one letter. For example, in the case of an alphabetic writing system, one letter may be a letter element, and a character string may mean an arrangement of letter elements. For example, text may include morphemes or words. A morpheme is the smallest unit having a meaning, which is composed of at least one grammeme. A word is a basic unit of a language that can be used independently or exhibits a grammatical function, consisting of at least one morpheme.

The device 1000 may receive the user's voice input from the audio signal and obtain text from the user's voice input using the ASR model. The device 1000 may analyze the meaning of the acquired user's voice input to generate a corresponding control command.

The device 1000 may include a sensor module capable of detecting a user's movement state, and may obtain user movement information. According to an embodiment, the sensor module provided in the device 1000 includes at least one of a gesture sensor, a gyroscope sensor, and an accelerometer sensor. The device 1000 may detect a user's movement, rotation, etc. through an provided sensor, and may generate an electrical signal or data value related to the sensed user's movement. The device 1000 measures the amount of change in pitch, roll, and yaw based on three axes of x, y, and z to obtain tilt information of the device 1000 and acceleration in each axis direction. and the user's movement information may be obtained using the tilt information and acceleration obtained based on the three axes of the device 1000 . A roll represents a rotational movement about the x-axis, a pitch indicates a rotational movement about the y-axis, and a yaw indicates a rotational movement about the z-axis. Also, the device 1000 may identify the user's utterance state by detecting changes in pitch, roll, and yaw caused by the movement of the user's jaw.

Referring to FIG. 2 , a device 2000 may include an input unit 2100 , a memory 2200 , a sensing unit 2300 , and a processor 2400 . The device 2000 may further include a communication unit 2500 and an output unit 2600 . It is not limited to the illustrated block diagram, and the electronic device 2000 may include more components or some components may be excluded. For example, when the device 2000 is a smart earphone, the device 2000 may include an output module such as a speaker for outputting an audio signal.

The input unit 2100 receives an external audio signal of the device 2000 . For example, the input unit 2100 may include a microphone. The external audio signal may include a user's voice input. The input unit 2100 may include other means for inputting data for controlling the device 2000 according to the type of the device 2000 . For example, the input unit 2100 includes a key pad, a switch, a touch pad (contact capacitive method, pressure resistance film method, infrared sensing method, surface ultrasonic conduction method, integral tension measurement method, piezo effect method, etc.), It may include, but is not limited to, a jog wheel, a jog switch, and the like.

The processor 2400 controls the overall operation of the device 2000 . In addition, the processor 2400 may be configured to process instructions of a computer program by performing arithmetic, logic and input/output operations and signal processing. The instructions of the computer program are stored in the memory 2200 , and the instructions may be provided to the processor 2400 from the memory 2200 . In the following embodiments, functions and/or operations performed by the processor 2400 may be implemented by the processor 2400 executing an instruction received according to computer program code stored in a recording device such as the memory 2200. .

The processor 2400 is, for example, a central processing unit (Central Processing Unit), a microprocessor (microprocessor), a graphic processor (Graphic Processing Unit), ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), DSPDs (Digital Signal Processing Devices), PLDs (Programmable Logic Devices), and FPGAs (Field Programmable Gate Arrays) may be configured as at least one, but is not limited thereto. In an embodiment, when the device 2000 is a mobile device, the processor 2400 may be an application processor (AP) that executes an application.

The processor 2400 may obtain the user's voice input from the audio signal received through the input unit 2100 . The processor 2400 may execute an application that performs an operation of the device 2000 or an external device based on the user's voice input, and the user for additionally controlling the device 2000 or the external device through the executed application can receive voice input from The processor 2400, for example, when a voice input for executing a predetermined voice assistant application such as “S Voice” or “bixby” is received, the corresponding voice assistant application and may generate a control command based on the additionally input user's voice input and user's movement information.

The processor 2400 identifies the user's voice input by the actual user's utterance, except for the non-voice section not caused by the user's utterance from the received audio signal. The processor 2400 may analyze an audio signal input through the input unit 2100 to identify an utterance section by the user's utterance. The processor 2400 extracts a feature vector of the received audio signal and analyzes the feature vector to identify the user's voice input by the user's utterance. Also, the processor 2400 may extract a feature vector due to the user's utterance by applying a deep neural network (DNN) to the feature vector extracted from the audio signal.

The processor 2400 compares the feature vector input during the user's speech section with each model using an acoustic model, a language model, and a pronunciation lexicon, and scores the input speech signal You can get a word string for .

The processor 2400 combines the user's voice signal input through the input unit 2100 with movement information according to the user's movement obtained from the sensing unit 2300 to control the device 2000 or an external device. can create The processor 2400 may determine the type of the control command based on the user's voice signal, and determine the attribute value of the control command based on the user's motion information. For example, when a voice input such as "volume" is input, the processor 2400 identifies a user's intention to control the volume during a control operation of the device 2000 or an external device, and The attribute value may be determined using the user's motion information.

The processor 2400 may obtain information on the user's up, down, left, right, or front and rear movement information through the sensing unit 2300 , and may generate a control command according to the user's movement. Specifically, when a user movement in an upper direction or a forward direction is sensed through the sensing unit 2300 , the processor 2400 may increase an attribute value related to a control command. Also, when a user's movement in a lower direction or a backward direction is sensed through the sensing unit 2300 , the processor 2400 may decrease an attribute value related to a control command. For example, when a voice input such as "volume" is input, the processor 2400 identifies a user's intention to control the volume during a control operation of the device 2000 or an external device, and The attribute value may be determined using the user's motion information. When the device 2000 is a device mounted on the user's ear, such as a smart earphone, when a voice input of "volume" or "volume up" is recognized according to the user's utterance, the processor 2400 is When the user's head moves from the bottom to the top according to the movement of the user's head, a control command to perform a volume up operation is generated, and the voice input "volume" or "volume down" is recognized When the user's head moves from top to bottom, a control command for performing a volume down operation may be generated. The direction in which the attribute value is to be changed according to the user's motion information is not limited thereto and may be changed.

In addition, when the device 2000 or the external device to be controlled is a physically movable device (eg, a robot vacuum cleaner, a pet robot, or a housekeeping robot), the processor 2400 controls the user through the sensing unit 2300 . A control command for controlling the movement of the controlled target may be generated according to up, down, left and right, or forward and backward motion information.

In addition, when the device 2000 or the external device to be controlled includes a display module for visually outputting information, the processor 2400 through the sensing unit 2300 according to the user's up, down, left, right, or forward and backward movement information A control command for controlling the scrolling of an image output through the display module may be generated.

The sensing unit 2300 includes an acceleration sensor 2310 and a gyroscope sensor 2320 , and may acquire user movement information. The sensing unit 2300 may detect a user's motion and generate an electrical signal or data value related to the sensed user's motion. The sensing unit 2300 measures the amount of change in pitch, roll, and yaw based on three axes of x, y, and z, and obtains tilt information of the device 2000 and acceleration in each axis direction. and the user's movement information may be obtained by using the tilt information and acceleration obtained based on the three axes of the device 2000 . Also, the sensing unit 1000 may detect a vibration caused by the user's utterance to identify the user's utterance state.

The memory 2200 may store commands set so that the processor 2400 generates a control command for controlling the device 2000 or an external device based on the user's voice input and motion information. The memory 2200 may include, for example, random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), and programmable read memory (PROM). -Only Memory), but is not limited to the above-described example.

The device 2000 may communicate with an external device through a predetermined network using the communication unit 2500 . The communication unit 2500 may include one or more communication processors supporting wired communication or wireless communication. Networks include Local Area Networks (LANs), Wide Area Networks (WANs), Value Added Networks (VANs), mobile radio communication networks, satellite networks, and combinations thereof. It is a data communication network in a comprehensive sense that enables each network constituent entity to communicate smoothly with each other, and may include a wired Internet, a wireless Internet, and a mobile wireless communication network. Wireless communication is, for example, wireless LAN (Wi-Fi), Bluetooth, Bluetooth low energy, Zigbee, WFD (Wi-Fi Direct), UWB (ultra wideband), infrared communication (IrDA, infrared Data Association) ), NFC (Near Field Communication), etc. may be there, but is not limited thereto.

The output unit 2600 outputs a sound signal or a video signal to the outside. Depending on the type of the device 2000 , the output unit 2600 may include a speaker or a receiver that outputs a sound signal to the outside, or a display module that visually provides information to the outside.

3 is a diagram for explaining a voice recognition process of the device 2000 according to an embodiment of the present disclosure.

Referring to FIG. 3 , the processor 2400 of the device 2000 detects a vibration signal 310 caused by the user's utterance from the audio signal 330 obtained from the input unit 2100 , or detects a vibration signal 310 by the user from the sensing unit 2300 . The vibration signal 310 due to ignition may be detected. The processor 2400 identifies a time interval T1 of t1 to t2, a time interval T2 from t3 to t4, and a time interval T3 from t5 to t6, which are intervals in which the vibration is detected, as the interval in which the user actually uttered. . The processor 2400 may remove noise in a section in which vibration is not detected among the audio signals 330 and identify the user voice input 320 in word units using only the audio signal during a time section in which vibration is detected. . The external noise signal may be removed by applying the ASR algorithm.

Specifically, the user utters the word "Bixby" during the time period T1 of t1 to t2, "volume" during the time period T2 of t3 to t4, and "UP" during the time period T3 of t5 to t6. assume you did The processor 2400 may use vibration information generated during the user's speech in order to identify the user's voice input by the actual user's speech from the received audio signal 330 . When the device 2000 is a device attached to the user's ear, such as a smart earphone, the vibration generated when the user's utterance is to be detected by the input unit 2100 or the sensing unit 2300 of the device 2000 attached to the ear through the bone. In addition, the processor 2400 may analyze the vibration signal 310 to identify whether the user is uttering and a utterance section by the user's utterance among the input audio signals 330 . The processor 2400 may determine the audio signal input in T1 , T2 , and T3 , which is a section in which vibration is detected among the audio signal 330 , as the user's voice input by the actual user's utterance. The processor 2400 may determine that the audio signal received in sections T1, T2, and T3 in which no vibration is detected is noise or an audio signal generated by another external user. In the above-described example, the processor 2400 analyzes the voice input input in the sections T1, T2, and T3 in which vibration is sensed by the user's utterance through the vibration information 310, and in the time period T1 of t1 to t2 A user voice input such as “Bixby” uttered, “volume” uttered during a time period T2 of t3 to t4, and “UP” uttered during a time period T3 of t5 to t6 may be recognized.

In addition, the sensing unit 2300 of the device 2000 acquires the user's motion information 340 after the time t1 when it is determined that the user's voice input is started, and controls the user's voice input using the user's voice input and the user's motion information. information can be generated.

As described above, when the device 1000 is carried by the user or attached to at least a portion of the user's body, the device 1000 detects the user's movement, rotation, etc. through a sensor, and the detected user's movement It is possible to generate an electrical signal or data value for The motion information may be information on roll, pitch, and yaw obtained based on three axes of x, y, and z. A roll represents a rotational movement about the x-axis, a pitch indicates a rotational movement about the y-axis, and a yaw indicates a rotational movement about the z-axis. 4 shows motion information according to the user's motion in the pitch direction among roll, pitch, and yaw.

The device 1000 may acquire user movement information by detecting changes in roll, pitch, and yaw acquired for a predetermined time. For example, if the reference angle value of the pitch angle when the user is facing the front is 0, when the user raises his/her head relative to the front, the pitch angle increases from 0 degrees (deg) to the + direction, Conversely, when the user drops his head down with respect to the front, the pitch angle may be set to decrease from 0 degrees to the - direction. Similarly, when the user performs a rolling operation of rotating the head clockwise or counterclockwise while looking at the front, the clockwise direction of rolling the head in the direction of the user's right ear with respect to the user is + direction, The counterclockwise direction of rolling the head in the direction of the user's left ear may be set to the - direction. Similarly, in the case of a yaw motion in which the user rotates the head relative to the crown, when the user's crown is viewed from above, the direction in which the user turns his head clockwise from the front direction to the right ear is the + direction, and the user's The direction of turning the head counterclockwise from the frontal direction to the left ear direction may be set to the - direction. The pitch angle, roll angle, and the + direction and the - direction of the yaw angle can be changed. The device 1000 may estimate the user's movement after the predetermined reference time by measuring changes in the pitch angle, the roll angle, and the yaw angle obtained after the predetermined reference time based on the predetermined reference time.

According to an embodiment, the device 2000 determines the user's movement direction and movement size from the user's movement information obtained after the utterance period start point based on the utterance period start point, and based on the determined movement direction and movement size, Control information can be created. Specifically, the device 2000 may determine the amount of change in the specific attribute value according to the control command determined according to the user's voice input, based on the movement direction and the movement size.

The attribute value is a parameter related to the control command, and may be determined according to the type of the device to be controlled and the type of the control command. For example, when the device to be controlled is a speaker or earphone outputting sound and the control command is related to the volume, the attribute value determined based on the user's movement may be the volume level. In addition, when a device to be controlled, such as a robot cleaner, a drone, or a robot pet, can be driven by itself and the control command is related to movement (eg, "Move"), the attribute value determined based on the user's movement is to avoid It may be a movement direction and a movement speed of the xyz axis of the control target. The processor 2400 may generate a control command to control the front, back, left, right, and movement speed of the drivable device to be controlled according to the user's movement. When the device to be controlled is a smart light bulb and the control command is a command related to a change in illuminance such as “change lighting”, similarly to the volume control described above, the processor 2400 generates a light bulb based on the user's movement information. can increase or decrease the brightness of When the device to be controlled is a display device that outputs an image and the control command is related to movement of the output image, such as “move image” or “scroll image”, the attribute value may indicate the direction and size of the output image. For example, the processor 2400 may generate a control command to move the output image in a direction consistent with the user's movement according to the user's head movement. Also, when the control command is related to the power of the controlled device, the attribute value may be an on/off value of the controlled device. When a user's "power" voice input is input, the processor 2400 may turn on/off the power of the controlled device based on the user's movement information. Which of the xyz axis directions is set as the on/off direction may be changed.

When a predefined keyword, for example, a keyword such as “Bixby” is identified, the device 2000 executes a voice assistant application, and then executes a voice assistant application of a control command to be controlled through a user's voice input. type can be determined. Referring to FIG. 5 , when a user voice input of “Volume” after the keyword “Bixby” is recognized, the device 2000 may determine that the user's intention is to control the volume. The device 2000 may determine the type of the control command by directly identifying a word such as “Volume” excluding a keyword such as “Bixby” that is input in advance. When vibration is sensed by the user's utterance through the sensing unit 2300 , the device 2000 may recognize the user's voice input after the vibration is sensed and perform a voice processing process.

Meanwhile, the device 2000 presets a control command corresponding to the user's small voice input specified in advance, and generates a control command based on the user's movement information without a separate voice recognition process when the small voice input of the user is input. can do. For example, a control command corresponding to a small murmur preset by the user may be preset. When the murmur of the user is input through the input unit 2100, the device 2000 determines whether it corresponds to the murmur preset by the user by comparing it with a preset pattern, and the murmur signal preset by the user is received. If it is determined, a separate voice recognition process may be skipped thereafter, and a preset attribute value of a control command may be determined based on the user's motion information. Assuming that the control command corresponding to the small murmur preset by the user is "volume", the device 2000 receives the user's murmur signal, the user's murmur signal is a preset signal for volume control. Determine if the pattern is the same. And, when it is determined that the user's muttering signal is a signal for volume control, the device 2000 may determine the volume control as a control command type, and then determine the volume value based on the acquired user's movement information. As such, when a preset small voice signal or a control command corresponding to a murmur is preset, the user speaks only a preset small voice signal without speaking the voice command in a loud voice, and controls the device 2000 through movement. can

As described above, the device 2000 may determine an attribute value related to the user's control command by using the user's movement information input after the user's utterance section. The sensing unit 2300 of the device 2000 determines the user's movement direction and movement size through the user's pitch, roll, and yaw movement acquired based on the xyz axis, and is related to the control command based on the movement direction and the size of the movement. An attribute value may be determined.

The processor 2400 may determine the user's movement by using the movement direction and movement magnitude information obtained in the xyz axis direction. According to an embodiment, in order to determine the user's movement, the processor 2400 may use the movement direction input in each axis direction or an extreme value of an angle obtained based on the xyz axis.

As shown in Fig. 4 above, a case in which the user moves his/her head up and down while looking at the front will be described as an example. The pitch angle at the reference time at which the user's motion information is acquired is referred to as the reference pitch angle 0 degrees, and when the user raises his head, the pitch angle increases from 0 degrees to the + direction, and conversely, when the user lowers his head Assume that the pitch angle decreases from 0 degrees in the negative direction. A change in which the pitch angle increases or decreases in the + or - direction may be referred to as a movement direction. By analyzing the amount of change in the pitch angle, the processor 2400 may determine a movement in which the user raises his/her head or lowers his/her head. In the above example, it may be defined as a pitch movement in the + direction in which the pitch angle increases when the user raises his head upward, and a pitch movement in the - direction in which the pitch angle decreases when the user lowers the whale.

The processor 2400 may determine the user's movement through the motion information in which the pitch angle is increased when the user performs an operation to raise the head up after the user's utterance together with the control command "volume" or "volume up". Also, when the pitch angle is increased after the user's utterance time, the volume of the current device may be increased. In addition, when the processor 2400 performs an operation of bending the user's head down after the user's utterance together with a control command such as "volume" or "volume down", the user's movement through motion information with a reduced pitch angle may be determined, and when the pitch angle is decreased after the user's utterance point, the volume of the current device may be decreased.

Meanwhile, since the actual user's movement continues to be minutely changed in the xyz axis direction, the movement information obtained in the xyz axis direction may be repeatedly increased and decreased. When the movement in the pitch, roll, and yaw directions is changed for a predetermined time, the user's movement may be analyzed using the extreme value.

Referring to FIG. 5 , when motion information 510 having two maximum values P1 and P2 is obtained with respect to a pitch angle, the processor 2400 determines that the user has performed the action of raising his/her head twice. can do. That is, when the pitch angle increases and then decreases, the processor 2400 may determine that the user's intention is to perform an operation of raising the head upward, and the user's intention is that the amount of change in the pitch angle is two maximum values. It can be expressed in the form of a graph with When it is determined that the user raises his/her head twice together with the control command "volume" or "volume up", the processor 2400 may increase the volume of the current device by two steps.

Similarly, when the motion information 520 having two minimum values P3 and P4 with respect to the pitch angle is obtained, the processor 2400 may determine that the user performs the action of bowing the head down twice. When it is determined that the user lowers the head twice together with the control command "volume" or "volume down", the processor 2400 may decrease the volume of the current device by two steps.

The processor 2400 may detect the user's movement only when the user's movement is greater than or equal to a predetermined threshold in order to prevent the control operation from reacting sensitively to the minute user's movement. In the above-described example, the processor 2400 may determine that the user has raised his head twice only when the pitch angles θ1 and θ2 of the two maximum values P1 and P2 are greater than a predetermined upper threshold value. If the pitch value of the extreme value is smaller than the predetermined upper limit threshold, the processor 2400 may determine that the user does not raise his/her head in order to perform the volume up operation. Similarly, the processor 2400 may similarly determine that the user has lowered the head twice only when the pitch angles θ3 and θ4 of the two minimum values P3 and P4 are smaller than a predetermined lower limit threshold value. As the absolute value of the upper limit threshold value or the lower limit threshold value is set to a smaller value, control information may be generated to more sensitively respond to a user's movement. In order to prevent a case in which the user reacts too sensitively to a small movement of the user, the absolute value of the upper limit threshold value or the lower limit threshold value may be increased.

Meanwhile, the device 2000 may determine an attribute value related to a control command based on the size of the movement as well as the direction of the movement. Specifically, the device 2000 may determine an attribute value related to the control command to be linearly proportional to or inversely proportional to the motion size. For example, the processor 2400 increases the volume of the current device by one step when motion information of a user having a pitch angle of 30 degrees is obtained along with a control command of “volume” or “volume up”, and the pitch angle is 60 When the movement information of the user is obtained, the volume of the current device may be increased by two steps.

Meanwhile, as the user's motion information used to generate the control information, even if the motion information acquired during the utterance section in which the user's utterance is identified is used or the user's utterance is no longer detected, a predetermined threshold time from the end of the user's utterance Motion information acquired before this elapses may be used. For example, when the motion information 530 is acquired at time ta before a predetermined threshold time has elapsed after the user has spoken the voice command "Bixby volume", the processor 2400 is Control information may be generated using the motion information 530 . As shown in FIG. 5 , the processor 2400 determines the volume of the current device when the motion information 530 in the pitch direction having the maximum value P5 is obtained within a predetermined threshold time after the voice command “Bixby volume” is terminated. can increase

Also, the processor 2400 may generate a control command by using motion information obtained before a preset end keyword voice is input. The processor 2400 may include, for example, the user's movement information acquired before the preset end keywords such as "OK", "Finished", "Thanks", "Done", "Stop", and "End" are recognized. can be used to generate control commands. In the above example, when the user continues to raise his head up after speaking the voice command "Bixby volume" or "volume", the processor 2400 recognizes the predetermined end keyword from the time of the user's utterance. The volume level of the current device may be continuously increased as many times as the number of motions of raising the user's head up until it is reached.

Also, the processor 2400 may analyze the user's movement information without using the end keyword, and determine that the control operation is stopped when the user's movement returns to the initial position at the start of the utterance. In the above example, the processor 2400 controls the volume value according to the movement of the user's head after the user's voice command "Bixby volume" or "volume" When returning, the volume control operation can be stopped.

As described above, the processor 2400 may determine an attribute value related to a control command by using the motion information detected by the sensing unit 2300 . The processor 2400 may change the attribute value in stages in consideration of the direction and magnitude of the movement, or may change the attribute value to be linearly proportional or inversely proportional to the movement.

Specifically, when the type of the control command is determined according to the user's voice input, the processor 2400 determines whether to increase or decrease the attribute value related to the control command in consideration of the detected movement direction. The processor 2400 may determine an attribute value related to a control command by using motion information in the direction of at least one of the xyz axes. For example, the processor 2400 uses a pitch movement rotating about the y-axis to determine that the property value related to the control command is increased when the pitch angle has a positive value, and when the pitch angle has a - value, the property It can be decided to decrease the value. As shown in FIG. 6 , when a motion having a value greater than a predetermined threshold value Th is detected during time T1 to time T2 ( 610 ), the processor 2400 may change the attribute value stepwise during time period T1 to T2. (620). In addition, as shown in FIG. 7 , when a motion having a value greater than a predetermined threshold Th is detected during time T1 to T2 ( 710 ), the processor 2400 linearly changes the attribute value during time T1 to T2 It can be done (720). For example, the processor 2400 increases the volume of the device stepwise or linearly while maintaining a state in which the pitch angle is greater than a predetermined threshold with a control command of “volume” or “volume up”. can do it

The processor 2400 may determine an amount of change in motion information based on a threshold value in order to prevent a control operation from reacting sensitively to a minute user's movement. Specifically, the processor 2400 determines a case in which motion information on any one of the xyz axes is changed from a value smaller than the threshold value Th to a larger value in a + direction, and a case in which the motion information is changed from a value larger than Th to a smaller value in a negative direction is movement in the - direction. is determined to have occurred, and the user's movement may be determined by analyzing the number of movements in the + direction and the number of movements in the - direction.

For example, referring to FIG. 8 , the processor 2400 analyzes the motion information 800 obtained from the sensing unit 2300 , and when a motion having a larger value from a value smaller than the threshold Th1 is obtained, the + direction , and a motion having a small value from a value greater than the threshold value Th1 is obtained, it may be determined that a movement in the -direction has occurred. In FIG. 8 , when the movement in the + direction and the - direction is determined based on the Th1 threshold, the processor 2400 determines three + direction movements and two - direction movements having the maximum values of P1, P3, and P4. can do. Since the change in motion between the P1 maximum value and the P2 maximum value is within a range larger than the Th1 threshold value, it is not determined that an additional + direction movement has occurred and may be ignored.

In addition, the processor 2400 may set a threshold value for determining movement in the + direction and a threshold value for determining movement in the - direction differently. Assuming that the threshold value for determining the movement in the + direction is Th1 and the threshold value for determining the movement in the - direction is Th2, the processor 2400 determines a case in which a motion having a larger value from a value smaller than the threshold value Th1 is obtained. It may be determined that a movement in the + direction has occurred, and a case in which a movement having a small value from a value greater than the threshold value Th2 is obtained may be determined as occurring a movement in the - direction. In FIG. 8 , when the movement in the + direction and the - direction is determined based on the two threshold values of Th1 and Th2, the processor 2400 performs three +-direction movements with the maximum values of P1, P3, and P4 and one - Able to judge directional movement. Since the movement change between the P2 maximum value and the P3 maximum value is changed within a range larger than the Th2 threshold value, it may be determined that the -direction movement does not occur among the movement changes between the P2 maximum value and the P3 maximum value. As such, when determining the movement in the + direction and the - direction based on the two threshold values, the processor 2400 may prevent the control operation from reacting sensitively to the minute user's movement.

When the number of movements in the + direction and the - direction is determined, the processor 2400 may determine an attribute value related to control information. For example, when the control information is the volume, the volume level may be increased according to the number of movements in the + direction, or the volume level may be decreased according to the number of movements in the - direction.

Referring to FIG. 9A , the device 910 generates control information based on the user's voice input and motion information, and transmits the control information to the external device 930 connected through the network 920 to the external device 930 . can control The device 910 may generate a control command based on a user's voice input and motion information by a voice assistant program installed by default, and use the control command to control the device 1000 itself.

The device 910 may be a wearable device such as AR (Augmented Reality) glasses, a smart watch, a smart lens, a smart bracelet, and smart clothing, or a mobile device such as a smart phone, a smart tablet, a computer, or a notebook computer, It is not limited thereto.

The external device 930 is a smart light bulb, a smart pet, a robot cleaner, a display device, a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a media player, a server, a micro server, a GPS ( global positioning system) devices, e-book terminals, digital broadcast terminals, navigation devices, kiosks, MP3 players, digital cameras, home appliances, and other mobile or non-mobile computing devices, but are not limited thereto.

The network 920 includes a local area network (LAN), a wide area network (WAN), a value added network (VAN), a mobile radio communication network, a satellite communication network, and a mutual network thereof. It is a data communication network in a comprehensive sense that includes a combination and enables each network constituent entity shown in FIG. 9A to communicate smoothly with each other, and may include a wired Internet, a wireless Internet, and a mobile wireless communication network.

Referring to FIG. 9B , the device 910 transmits the user's voice input and motion information to the first external device 940 connected through the network 920, and the first external device 940 transmits the user's voice input and motion information. Control information may be generated based on the motion information. The control command generated by the first external device 940 may be used again to control the device 910 or another second external device 950 , or may be used to control the first external device 940 . Specifically, the device 910 transmits the user's voice input and motion information to the first external device 940, and the first external device 940 generates control information based on the user's voice input and motion information, and , the device 910 may receive a control command generated by the first external device 940 and perform a control operation based on the received control information. In addition, the device 910 transmits the user's voice input and motion information for controlling the second external device 950 to the first external device 940 , and the first external device 940 receives the user's voice input and motion information. The control information may be generated based on the motion information, and the second external device 950 may receive the control information generated by the first external device 950 and perform a control operation based on the received control information.

In operation 1010 , the input unit 2100 receives an external audio signal. When a voice input for executing a predetermined voice assistant application is received, the processor 2400 executes the corresponding voice assistant application, and the input unit 2100 receives an external audio signal including the user's voice input. can be controlled to receive.

In operation 1020 , the input unit 2100 or the sensing unit 2300 acquires vibration information caused by the user's utterance. The processor 2400 analyzes the audio signal input through the input unit 2100 to identify a speech signal pattern by the user's utterance to identify the utterance section, or is generated by the user's utterance detected by the sensing unit 2300 . By detecting the vibration to identify the speech section, vibration information for identifying the user's speech section may be obtained.

In operation 1030, the processor 2400 identifies the user's voice input by the actual user's utterance, excluding the non-voice section not caused by the user's utterance from the received audio signal.

In operation 1040 , the processor 2400 controls the device 2000 or an external device by combining the user's voice signal input through the input unit 2100 and movement information according to the user's movement obtained from the sensing unit 2300 . It is possible to generate control information for The processor 2400 may determine the type of the control command based on the user's voice signal, and determine the attribute value of the control command based on the user's motion information. In order to determine the attribute value of the control command, the processor 2400 determines the user's movement direction and movement size from the user's movement information obtained after the utterance period start point on the basis of the utterance period start point, and the determined movement direction and movement An attribute value of the control information may be determined based on the size. Specifically, the processor 2400 determines the user's movement direction and movement size through the user's pitch, roll, and yaw movements obtained from the sensing unit 2300 on the basis of the xyz axis, and controls based on the movement direction and the movement size. An attribute value related to the command may be determined. The processor 2400 may use the generated control command to control the device 2000 itself, or transmit the generated control command to an external device connected through a network to control the external device.

In operation 1110 , the input unit 2100 receives an audio signal. In operation 1120, the input unit 2100 or the sensing unit 2300 acquires vibration information by the user's utterance, and the sensing unit 2300 includes an acceleration sensor 2310 and a gyroscope sensor 2320, and the user's Motion information can be obtained. The sensing unit 2300 may detect a user's motion and generate an electrical signal or data value related to the sensed user's motion.

In operation 1130 , the processor 2400 analyzes the audio signal input through the input unit 2100 to identify a speech signal pattern by the user's speech to identify the speech section, or the user's speech detected by the sensing unit 2300 . The user's speech state is determined by detecting the vibration generated by the

In operation 1140 , the processor 2400 identifies the user's voice input due to the actual user's speech, except for the non-voice section that is not caused by the user's speech from the received audio signal.

In operation 1150 , the processor 2400 controls the device 2000 or an external device by combining the user's voice signal input through the input unit 2100 and movement information according to the user's movement obtained from the sensing unit 2300 . It is possible to generate control information for The processor 2400 may determine the type of the control command based on the user's voice signal, and determine the attribute value of the control command based on the user's motion information. In order to determine the attribute value of the control command, the processor 2400 determines the user's movement direction and movement size from the user's movement information obtained after the utterance period start point on the basis of the utterance period start point, and the determined movement direction and movement An attribute value of the control information may be determined based on the size.

In operation S1210 , the device 1200 may receive an audio signal, and in operation S1220 , the device 1200 may obtain vibration information and movement information of the user due to the user's utterance.

In operation S1230, the device 1200 identifies a section of the input audio signal due to the user's utterance, and identifies the user's voice input due to the actual user's utterance. In addition, the device 1200 generates control information for controlling the external device 1250 by combining the user's voice signal and movement information according to the user's movement, and in operation S1240, the device 1200 connects to a predetermined network. The generated control information is transmitted to the connected external device 1250 through the In operation S1260 , the processor included in the external device 1250 changes the state of the external device by performing a control operation according to the received control information.

In operation S1310 , the device 1300 receives an audio signal, and in operation S1311 , the device 1300 acquires vibration information and user movement information due to a user's utterance.

In operation S1312 , the device 1300 transmits the acquired audio signal, vibration information, and user movement information to the external device 1350 .

In operation S1320, the external device 1350 identifies a section of the input audio signal due to the user's utterance, and identifies the user's voice input due to the actual user's utterance. The external device 1350 identifies a section by the user's utterance among the input audio signals and identifies the user's voice input by the actual user's utterance. In addition, the external device 1350 generates control information by combining the user's voice signal and movement information according to the user's movement.

The control information may be control information for controlling the device 1300 . In operation S1330 , the external device 1350 transmits the generated control information to the device 1300 , and the device 1300 receiving the control information in operation S1331 performs a control operation according to the received control information. ) can be changed.

The control information may be control information for controlling the external device 1380 . In this case, the control information generated by the external device 1350 in operation S1332 is directly transmitted to another external device 1380, or the device 1300 that receives the control information generated by the external device 1350 in operation S1333. may transmit the control information to the external device 1380 again. The processor included in the external device 1380 may perform a control operation according to the received control information.

14 is a block diagram illustrating a configuration of an electronic device 2000 according to an embodiment of the present disclosure. The electronic device 2000 illustrated in FIG. 14 may include the same components as the devices described with reference to FIGS. 1 to 13 , and the same components may perform all of the operations and functions described with reference to FIGS. 1 to 13 . Accordingly, components of the electronic device 2000 that have not been described so far will be described below.

Referring to FIG. 14 , the electronic device 2000 includes a user input unit 1100 , an output unit 1200 , a control unit 1300 , a sensing unit 1400 , a communication unit 1500 , an A/V input unit 1600 , and a memory. (1700) may be included.

The user input unit 1100 means a means for a user to input data for controlling the electronic device 2000 . For example, the user input unit 1100 includes a key pad, a dome switch, and a touch pad (contact capacitive method, pressure resistance film method, infrared sensing method, surface ultrasonic conduction method, integral type). There may be a tension measurement method, a piezo effect method, etc.), a jog wheel, a jog switch, and the like, but is not limited thereto. The user input unit 1100 may receive a user input necessary to generate conversation information to be provided to the user.

The output unit 1200 may output an audio signal, a video signal, or a vibration signal, and the output unit 1200 may include a display unit 1210 , a sound output unit 1220 , and a vibration motor 1230 . .

The vibration motor 1230 may output a vibration signal. For example, the vibration motor 1230 may output a vibration signal corresponding to the output of audio data or video data (eg, a call signal reception sound, a message reception sound, etc.).

The sensing unit 1400 may detect a state of the electronic device 2000 or a state around the electronic device 2000 , and transmit the sensed information to the controller 1300 .

The sensing unit 1400 includes a magnetic sensor 1410 , an acceleration sensor 1420 , a temperature/humidity sensor 1430 , an infrared sensor 1440 , a gyroscope sensor 1450 , and a position sensor. (eg, GPS) 1460 , a barometric pressure sensor 1470 , a proximity sensor 1480 , and at least one of an illuminance sensor 1490 , but is not limited thereto. Since a function of each sensor can be intuitively inferred from the name of a person skilled in the art, a detailed description thereof will be omitted.

The communication unit 1500 may include components for performing communication with other devices. For example, the communication unit 1500 may include a short-range communication unit 1510 , a mobile communication unit 1520 , and a broadcast receiving unit 1530 .

Short-range wireless communication unit 151, Bluetooth communication unit, BLE (Bluetooth Low Energy) communication unit, short-range wireless communication unit (Near Field Communication unit), WLAN (Wi-Fi) communication unit, Zigbee (Zigbee) communication unit, infrared ( It may include an IrDA, infrared Data Association) communication unit, a Wi-Fi Direct (WFD) communication unit, an ultra wideband (UWB) communication unit, an Ant+ communication unit, and the like, but is not limited thereto.

The mobile communication unit 1520 transmits/receives a radio signal to and from at least one of a base station, an external terminal, and a server on a mobile communication network. Here, the wireless signal may include various types of data according to transmission/reception of a voice call signal, a video call signal, or a text/multimedia message.

The broadcast receiver 1530 receives a broadcast signal and/or broadcast-related information from the outside through a broadcast channel. The broadcast channel may include a satellite channel and a terrestrial channel. According to an embodiment, the electronic device 2000 may not include the broadcast receiver 1530 .

Also, the communication unit 1500 may transmit/receive information necessary to generate conversation information to be provided to the first user with the second interactive electronic device 3000 , other devices, and the server.

The A/V (Audio/Video) input unit 1600 is for inputting an audio signal or a video signal, and may include a camera 1610 , a microphone 1620 , and the like. The camera 1610 may obtain an image frame such as a still image or a moving picture through an image sensor in a video call mode or a shooting mode. The image captured through the image sensor may be processed through the processor 1300 or a separate image processing unit (not shown).

The image frame processed by the camera 1610 may be stored in the memory 1700 or transmitted to the outside through the communication unit 1500 . Two or more cameras 1610 may be provided according to the configuration of the terminal.

The microphone 1620 receives an external sound signal and processes it as electrical voice data. For example, the microphone 1620 may receive an acoustic signal from an external device or a speaker. The microphone 1620 may use various noise removal algorithms for removing noise generated in the process of receiving an external sound signal.

The memory 1700 may store a program for processing and control of the controller 1300 , and may store data input to or output from the electronic device 2000 .

The memory 1700 may include a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg, SD or XD memory), and a RAM. (RAM, Random Access Memory) SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk , may include at least one type of storage medium among optical disks.

Programs stored in the memory 1700 may be classified into a plurality of modules according to their functions, for example, may be classified into a UI module 1710 , a touch screen module 1720 , a notification module 1730 , and the like. .

The UI module 1710 may provide a specialized UI, GUI, or the like that interworks with the electronic device 2000 for each application. The touch screen module 1720 may detect a touch gesture on the user's touch screen and transmit information about the touch gesture to the controller 1300 . The touch screen module 1720 according to some embodiments may recognize and analyze a touch code. The touch screen module 1720 may be configured as separate hardware including a controller.

The notification module 1730 may generate a signal for notifying the occurrence of an event in the electronic device 2000 . Examples of events generated in the electronic device 2000 include call signal reception, message reception, key signal input, schedule notification, and the like. The notification module 1730 may output a notification signal in the form of a video signal through the display unit 1210 , may output a notification signal in the form of an audio signal through the sound output unit 1220 , and the vibration motor 1230 . It is also possible to output a notification signal in the form of a vibration signal through

The electronic device 2000 described in the present disclosure may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the electronic device 2000 includes a processor, arithmetic logic unit (ALU), application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), and microcontrollers. It may be implemented using one or more general purpose computers or special purpose computers, such as a computer, microprocessor, or any other device capable of executing and responding to instructions.

Software may comprise a computer program, code, instructions, or a combination of one or more of these, which configures a processing device to operate as desired or is independently or collectively processed You can command the device.

The software may be implemented as a computer program including instructions stored in a computer-readable storage medium. The computer-readable recording medium includes, for example, a magnetic storage medium (eg, read-only memory (ROM), random-access memory (RAM), floppy disk, hard disk, etc.) and an optically readable medium (eg, CD-ROM). (CD-ROM), DVD (Digital Versatile Disc), etc. The computer-readable recording medium is distributed among computer systems connected through a network, so that the computer-readable code can be stored and executed in a distributed manner. The medium may be readable by a computer, stored in a memory, and executed on a processor.

The computer is an apparatus capable of calling a stored instruction from a storage medium and operating according to the disclosed embodiment according to the called instruction, and may include the

electronic devices

1000 and 2000 according to the disclosed embodiment.

The computer-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' means that the storage medium does not include a signal and is tangible, and does not distinguish that data is semi-permanently or temporarily stored in the storage medium.

In addition, the

electronic devices

1000 and 2000 or the method according to the disclosed embodiments may be provided as included in a computer program product. Computer program products may be traded between sellers and buyers as commodities.

The computer program product may include a software program, a computer-readable storage medium in which the software program is stored. For example, the computer program product is a product in the form of a software program distributed electronically through a manufacturer of the

electronic device

1000 or 2000 or an electronic market (eg, Google Play Store, App Store) (eg, downloadable products). application (downloadable application)). For electronic distribution, at least a portion of the software program may be stored in a storage medium or may be temporarily generated. In this case, the storage medium may be a server of a manufacturer, a server of an electronic market, or a storage medium of a relay server temporarily storing a software program.

The computer program product, in a system consisting of a server and a terminal, may include a storage medium of a server or a storage medium of a terminal. Alternatively, when there is a third device (eg, a smart phone) that is communicatively connected to the server or terminal, the computer program product may include a storage medium of the third device. Alternatively, the computer program product may include the software program itself transmitted from the server to the terminal or third device, or transmitted from the third device to the terminal.

In this case, one of the server, the terminal, and the third device may execute the computer program product to perform the method according to the disclosed embodiments. Alternatively, two or more of the server, the terminal, and the third device may execute the computer program product to distribute the method according to the disclosed embodiments.

For example, a server (eg, a cloud server or an artificial intelligence server, etc.) may execute a computer program product stored in the server, and may control a terminal communicatively connected with the server to perform the method according to the disclosed embodiments.

As another example, the third device may execute a computer program product to control the terminal communicatively connected to the third device to perform the method according to the disclosed embodiment.

When the third device executes the computer program product, the third device may download the computer program product from the server and execute the downloaded computer program product. Alternatively, the third device may execute the computer program product provided in a preloaded state to perform the method according to the disclosed embodiments.

In addition, although the embodiments of the present disclosure have been illustrated and described above, the present disclosure is not limited to the specific embodiments described above, and in the technical field to which the present invention pertains without departing from the gist of the present invention as claimed in the claims. Various modifications may be made by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or perspective of the present disclosure.

As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of an electronic device, structure, circuit, etc. are combined or combined in a different form than the described method, or other components or Substituted or substituted for equivalent results may be obtained.

Claims

a memory in which at least one program is stored;

a microphone for receiving an audio signal;

a sensor module for acquiring vibration information according to the user's utterance state; and

at least one processor for generating control information corresponding to a user's voice input identified from the audio signal by executing the at least one program;

the at least one program,

identifying the user's voice input from the audio signal received through the microphone based on the vibration information; and

generating control information corresponding to the user's voice input based on the identified user's voice input;

A device comprising instructions for executing
The method of claim 1,

The step of identifying the user's voice input from the audio signal received through the microphone comprises:

determining a user's speech section by the user's speech based on the vibration information; and

and determining an audio signal obtained within the user's utterance section as the user's voice input signal.
The method of claim 1,

The sensor module further acquires movement information according to the user's movement,

The step of generating control information corresponding to the user's voice input includes:

The device, characterized in that generating the control information by combining the user's voice signal and the motion information.
4. The method of claim 3,

The control information includes a control command for controlling the device determined based on the user's voice signal, and an attribute value related to the control command,

The device, characterized in that the attribute value is changed based on the user's motion information.
5. The method of claim 4,

The user's movement information is information obtained based on three axes of a pitch axis, a roll axis, and a yaw axis,

The device, characterized in that the attribute value is determined based on the movement direction and movement size determined based on the three axes.
4. The method of claim 3,

The step of generating control information corresponding to the user's voice input includes:

determining a starting point of an utterance section of the user's voice input based on the vibration information; and

Based on the starting point of the speech section, the user's motion direction and motion size are determined from the user's motion information obtained after the speech section start point, and the control information is generated based on the determined motion direction and motion size. Device characterized in that.
The method of claim 1,

The device further includes a communication unit for communicating with an external device,

The control information device, characterized in that it comprises a control command for the external device.
A method for a device to provide control information for a user's voice input, the method comprising:

receiving an audio signal;

acquiring vibration information according to the user's utterance state;

identifying the user's voice input included in the audio signal based on the vibration information; and

and generating control information corresponding to the user's voice input based on the identified user's voice input.
9. The method of claim 8,

The step of identifying the user's voice input comprises:

determining a user's speech section by the user's speech based on the vibration information; and

and determining an audio signal acquired within the user's utterance section as the user's voice input signal.
9. The method of claim 8,

Further comprising the step of obtaining movement information according to the user's movement,

The control information is a method, characterized in that generated by combining the user's voice signal and the motion information.
11. The method of claim 10,

The control information includes a control command for controlling the device determined based on the user's voice signal, and an attribute value related to the control command,

The method, characterized in that the attribute value is changed based on the user's motion information.
12. The method of claim 11,

The user's movement information is information obtained based on three axes of a pitch axis, a roll axis, and a yaw axis,

The method of claim 1, wherein the attribute value is determined based on a movement direction and a movement magnitude determined based on the three axes.
11. The method of claim 10,

Based on the vibration information, the starting point of the utterance section of the user's voice input is determined,

Based on the starting point of the speech section, the user's movement direction and motion size are determined from the user's motion information obtained after the speech section start point, and the control information is generated based on the determined motion direction and motion size A method characterized in that.
9. The method of claim 8,

The method further comprising transmitting control information corresponding to the generated user's voice input to an external device.
A computer-readable recording medium in which a program for executing the method of claim 8 in a computer is recorded.