WO2021125784A1

WO2021125784A1 - Electronic device and control method therefor

Info

Publication number: WO2021125784A1
Application number: PCT/KR2020/018442
Authority: WO
Inventors: 김가을; 최찬희
Original assignee: 삼성전자(주)
Priority date: 2019-12-19
Filing date: 2020-12-16
Publication date: 2021-06-24
Also published as: KR20210078682A

Abstract

The present invention relates to an electronic device and a control method therefor. The electronic device comprises: a sound reception unit; and a processor that, when a noise characteristic acquired from a sound signal received via the sound reception unit is greater than a first threshold value, and an utterance characteristic acquired from the sound signal is greater than a second threshold value, performs a recognition operation for a user utterance on the basis of the sound signal, and adjusts the second threshold value to be greater.

Description

Electronic device and its control method

The present invention relates to an electronic device and a control method thereof, and more particularly, to an electronic device for processing a voice uttered by a user and a control method thereof.

Electronic devices such as artificial intelligence (AI) speakers, mobile devices such as smart phones or tablets, and smart TVs recognize the voice uttered by the user and perform a function according to the voice recognition can do.

The electronic device may operate to activate the voice recognition function by recognizing that a predetermined start word, that is, a trigger word, is input from the user.

The start word recognition may include a process of determining the similarity between the audio signal of the user's voice and the start word. For example, when the similarity between the pattern of the audio signal and the start word is greater than a predetermined criterion, the input voice is It can be identified by including the starting word.

In the process of recognizing the starting word as described above, there are cases in which misrecognition may occur due to the influence of the surrounding environment of the electronic device such as noise, so an attempt is made to improve the accuracy of recognizing the starting word.

According to the present invention, in an electronic device capable of receiving and processing a user's voice, the threshold value for identifying the speech characteristic of a sound signal is reset according to whether the user's speech property is in a noisy environment in response to the user's speech property, so that the accuracy of starting word recognition To provide an electronic device and a method for controlling the same.

An electronic device according to an embodiment of the present invention includes: a sound receiver; and when the value indicating the noise characteristic of the sound signal received through the sound receiver is greater than the first threshold value and the value indicating the speech characteristic of the sound signal is greater than the second threshold value, the recognition operation regarding the user's utterance based on the sound signal and a processor for adjusting the second threshold to increase.

The speech characteristic may include a signal-to-noise ratio of the sound signal.

The processor may calculate a ratio of the noise to the sound signal for each frame of the sound signal, and determine an average value of the calculated ratio for each frame as the value of the speech characteristic.

The processor may identify whether a predefined start word is included in the sound signal, and identify whether a noise characteristic of the sound signal identified as including the start word is greater than a first threshold value.

The processor may identify whether a start word is included in the sound signal based on a similarity between a waveform of the sound signal and a predefined start word pattern.

The threshold of similarity may be preset based on a learning algorithm using an acoustic model.

If the value of the speech characteristic of the sound signal having the similarity greater than the third threshold value is equal to or less than the second threshold value, the processor is configured to provide information regarding the user's speech based on the sound signal satisfying a fourth threshold value having the similarity greater than the third threshold value. A recognition operation may be performed.

The processor may identify whether the value of the noise characteristic of the sound signal received in a section having a predefined time length before the section including the start word is greater than the first threshold value.

The processor may compare the power value of the sound signal received in the section of the predefined time length with the first threshold value.

If the value of the noise characteristic is equal to or less than the first threshold, the processor may adjust the second threshold to decrease.

Meanwhile, a control method of an electronic device according to an embodiment of the present invention includes: acquiring a noise characteristic from a sound signal received through a sound receiver; obtaining a speech characteristic from the sound signal; and when the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation for the user's utterance is performed based on the sound signal, and the second threshold value is adjusted to increase including the steps of

The method may further include calculating a ratio of the noise to the sound signal for each frame of the sound signal, and determining an average value of the calculated ratio for each frame as a value of the speech characteristic.

identifying whether the sound signal includes a predefined starting word; and, identifying whether the value of the noise characteristic of the sound signal identified as including the starting word is greater than a first threshold value.

The step of identifying whether the start word is included may include identifying whether the start word is included in the sound signal based on a similarity between the waveform of the sound signal and a predefined start word pattern.

If the value of the speech characteristic of the sound signal having a similarity greater than the third threshold is less than or equal to the second threshold, a recognition operation regarding the user's speech is performed based on the sound signal satisfying the fourth threshold having the similarity greater than the third threshold. It may further include the step of performing.

The method may further include the step of identifying whether a value of a noise characteristic of a sound signal received in a section having a predefined time length before the section identified as including the starting word is greater than a first threshold value.

If the value of the noise characteristic is less than or equal to the first threshold, the method may further include adjusting the second threshold to be lowered.

Meanwhile, in a computer-readable code according to an embodiment of the present invention, in a recording medium storing a computer program including a code for performing a control method of an electronic device, the control method of the electronic device is received through a sound receiver. acquiring noise characteristics from the sound signal being obtaining a speech characteristic from the sound signal; and when the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation for the user's utterance is performed based on the sound signal, and the second threshold value is adjusted to increase including the steps of

According to the electronic device and the control method of the present invention as described above, by resetting the threshold value for identifying the user's speech characteristics with respect to the sound signal in a noisy environment, the user induces the user to utter the starting word in a loud voice, The effect of improving the accuracy of motion can be expected.

In addition, according to the electronic device and the control method thereof of the present invention, the occurrence of a malfunction in which the electronic device incorrectly recognizes a sound signal including ambient noise rather than an actual utterance of a user in a noisy environment as a starting word is reduced, so that It has the effect of improving the accuracy.

1 illustrates a voice recognition system including an electronic device according to an embodiment of the present invention.

2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present invention.

3 is a block diagram illustrating a configuration of a voice recognition module of an electronic device according to an embodiment of the present invention.

4 is a flowchart illustrating a method of controlling an electronic device according to an embodiment of the present invention.

5 is a diagram for explaining pattern matching for activating a voice recognition function in an electronic device according to an embodiment of the present invention.

6 is a view for explaining the identification of noise characteristics of an electronic device according to an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numbers or symbols refer to components that perform substantially the same function, and the size of each component in the drawings may be exaggerated for clarity and convenience of description. However, the technical spirit of the present invention and its core configuration and operation are not limited to the configuration or operation described in the following embodiments. In describing the present invention, if it is determined that a detailed description of a known technology or configuration related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

In an embodiment of the present invention, terms including an ordinal number such as first, second, etc. are used only for the purpose of distinguishing one element from another element, and the expression of the singular is plural unless the context clearly indicates otherwise. includes the expression of In addition, in an embodiment of the present invention, terms such as 'consisting', 'comprising', 'having' and the like are one or more other features or the presence of numbers, steps, operations, components, parts, or combinations thereof. Or it should be understood that the possibility of addition is not excluded in advance. In addition, in an embodiment of the present invention, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software or a combination of hardware and software, and is integrated into at least one module. and can be implemented. Further, in an embodiment of the present invention, at least one of the plurality of elements refers to all of the plurality of elements as well as each one or a combination thereof excluding the rest of the plurality of elements.

In one embodiment, as shown in FIG. 1 , the voice recognition system includes an electronic device 10 capable of receiving a sound signal as a voice uttered by a user, that is, a sound, and the electronic device 10 and a network. It may include a server 20 that can communicate through.

The electronic device 10 may receive a voice uttered by a user (hereinafter, also referred to as a user voice), process a sound signal corresponding to the voice, and perform a corresponding operation.

In an embodiment, the electronic device 10 provides audio content to the user by outputting a sound corresponding to the processing result of the user's voice, ie, the sound, through the output unit ( 110 of FIG. 2 ) as an operation corresponding to the received voice. can provide At least one loudspeaker may be provided in the electronic device 10 as the output unit 110 capable of outputting sound, and the number, shape, and installation of the speakers provided in the electronic device 10 in the present invention The location is not limited.

In an embodiment, the electronic device 10 may be provided with a sound receiver ( 120 in FIG. 3 ) capable of receiving a sound signal as a user's voice. The sound receiver 120 may be implemented as at least one microphone, and the number, shape, and installation location of the microphones provided in the electronic device 10 are not limited.

The implementation form of the electronic device 10 is not limited, and for example, as shown in FIG. 1 , an artificial intelligence speaker (hereinafter also referred to as an AI speaker or a smart speaker) 10a, a smart TV A display device 10b including a television such as , and a mobile device 10c such as a smart phone or tablet may be implemented as various devices capable of receiving a sound signal.

The electronic device 10 implemented as the AI speaker 10a may receive a voice from a user and perform various functions, such as listening to music and searching for information, through voice recognition for the received voice. The AI speaker is not a device that simply outputs sound by utilizing the voice recognition function and the cloud, but is a device with a built-in virtual assistant/voice assistant that allows interaction with the user. It can be implemented to provide a service. In this case, an application for the AI speaker function may be installed and driven in the electronic device 10 .

The electronic device 10 implemented as the display device 10b processes an image signal provided from an external signal supply source, ie, an image source, according to a preset process, and displays the image as an image.

In an embodiment, the display device 10b includes a television (TV) capable of processing a broadcast signal based on at least one of a broadcast signal, broadcast information, or broadcast data provided from a transmission device of a broadcast station and displaying the same as an image.

Since the type of image source providing content is not limited in the present invention, the display device 10b may include, for example, a set-top box, an optical disc playback device such as a Blu-ray or digital versatile disc (DVD); From a computer (PC) including a desktop or laptop, a console game machine, a mobile device including a smart pad such as a smart phone or a tablet, etc. A video signal can be received.

When the display device 10b is a television, the display device 10b may wirelessly receive a radio frequency (RF) signal, that is, a broadcast signal transmitted from a broadcasting station, and for this purpose, an antenna for receiving a broadcast signal and a broadcast signal are used. A tuner for tuning for each channel may be provided.

In the display device 10b, a broadcast signal can be received through a terrestrial wave, cable, satellite, or the like, and the signal source is not limited to an external device or a broadcasting station. That is, any device or station capable of transmitting and receiving data may be included in the image source of the present invention.

The standard of the signal received from the display device 10b may be configured in various ways corresponding to the implementation form of the device. For example, the display device 10b may be configured in an implementation form of the interface unit 140 (see FIG. 2 ) to be described later. Corresponding to HDMI (High Definition Multimedia Interface), HDMI-CEC (Consumer Electronics Control), display port (DP), DVI, composite video, component video, super video , DVI (Digital Visual Interface), Thunderbolt, RGB cable, SCART (Syndicat des Constructeurs d'Appareils Radiorecepteurs et Televiseurs), USB, etc. can be received as video content by wire.

The display apparatus 10b may receive image content from a server or the like provided for content provision through wired or wireless network communication, and the type of communication is not limited. For example, the display device 10b corresponds to an implementation form of the interface unit 140 to be described later, such as Wi-Fi, Wi-Fi Direct, Bluetooth, and Bluetooth low energy. ), Zigbee, Ultra-Wideband (UWB), Near Field Communication (NFC), and the like may be received as video content through wireless network communication. As another example, the display apparatus 10b may receive a content signal through wired network communication such as Ethernet.

In an embodiment, the display apparatus 10b may serve as an AP that allows various peripheral devices such as a smartphone to perform wireless communication.

The display apparatus 10b may receive the content provided in the form of a file according to real-time streaming through the wired or wireless network as described above.

In addition, the display apparatus 10b includes a user interface for controlling a video, a still image, an application, an on-screen display (OSD), and various operations based on signals/data stored in internal/external storage media. A signal may be processed to display a UI (hereinafter, also referred to as a graphic user interface (GUI)) on the screen.

In an embodiment, the display device 10b may operate as a smart TV or an Internet Protocol TV (IP TV). Smart TV can receive and display broadcast signals in real time, and has a web browsing function, so that it is possible to search and consume various contents through the Internet at the same time as displaying real-time broadcast signals, and for this purpose, it is possible to provide a convenient user environment. it is television In addition, since the smart TV includes an open software platform, it can provide interactive services to users. Accordingly, the smart TV may provide a user with various contents, for example, an application providing a predetermined service through an open software platform. These applications are applications that can provide various types of services, and include, for example, applications that provide services such as SNS, finance, news, weather, maps, music, movies, games, and e-books.

In an embodiment, an application for providing a voice recognition function may be installed on the display device 10b.

When the electronic device 10 is the display device 10b or the mobile device 10c, a display capable of displaying an image may be provided in the electronic device 10 . The implementation method of the display is not limited, and for example, liquid crystal, plasma, light-emitting diode, organic light-emitting diode, and surface-conduction gun. electron-emitter), carbon nano-tube, nano-crystal, and the like, may be implemented in various display methods.

The electronic device 10 may communicate with various external devices including the server 20 through the interface unit 140 .

In the present invention, since the communication method between the electronic device 10 and the external device is not limited, the electronic device 10 can be connected to an external device through various types of wired or wireless connection (eg, Bluetooth, Wi-Fi, or Wi-Fi Direct). It is implemented to be able to communicate with the device.

The server 20 is provided to perform wired or wireless communication with the electronic device 10 . The server 20, for example, is implemented in a cloud type, and an electronic device 10 and/or an additional device associated with the electronic device 10 (eg, a smart phone in which a corresponding application is installed to interwork with an AI speaker, etc.) ) of user accounts can be stored and managed.

The implementation form of the server 20 is not limited, and as an example, it is implemented as an STT (Speech to Text) server that converts a sound signal related to voice into text, or to perform the function of the STT server as a main server related to voice recognition. can be implemented. In addition, the server 20 may be provided in plurality, such as the STT server and the main server, so that the electronic device 10 may communicate with the plurality of servers.

In an embodiment, the server 20 may be provided with data for recognizing a voice uttered by a user, that is, a database (DB) in which information is stored. The database may include, for example, a plurality of acoustic models predetermined by modeling signal characteristics of a voice. In addition, the database may further include a language model determined in advance by modeling a linguistic order relationship such as words or syllables corresponding to the recognition target vocabulary. The acoustic model and/or the language model may be configured by performing learning in advance.

By accessing the database by accessing the server 20 through a wired or wireless network, the electronic device 10 can identify and process the received user voice, and output the processing result through sound or image.

Hereinafter, a more specific configuration of an electronic device according to an embodiment of the present invention and an operation thereof will be described.

As shown in FIG. 2 , the electronic device 10 according to an embodiment of the present invention includes an output unit 110 , a sound receiving unit 120 , a signal processing unit 161 , an interface unit 140 , a storage unit 150 , and a processor ( 160).

However, the configuration of the electronic device 10 according to an embodiment of the present invention shown in FIG. 2 is only an example, and the electronic device according to another embodiment may be implemented in a configuration other than the configuration shown in FIG. 2 . have. That is, the electronic device of the present invention may be implemented in a form in which a configuration other than the configuration shown in FIG. 2 is added or at least one of the configuration shown in FIG. 2 is excluded.

The output unit 110 outputs a sound, that is, a sound. The output unit 110 may include, for example, at least one speaker capable of outputting sound in an audible frequency band of 20 Hz to 20 KHz. The output unit 110 may output a sound corresponding to an audio signal/sound signal of a plurality of channels.

In an embodiment, the output unit 110 may output a sound according to the processing of the sound signal as a user voice received through the sound receiving unit 120 .

The sound receiver 120 may receive a voice uttered by a user, that is, a sound wave.

The sound wave input through the sound receiving unit 120 is converted into an electrical signal by the signal converting unit. In an embodiment, the signal converter may include an AD converter that converts analog sound waves into digital signals. In addition, in an embodiment, the signal conversion unit may be included in a signal processing unit 161 to be described later.

In an embodiment of the present invention, the sound receiver 120 is implemented to be provided in the electronic device 10 by itself.

However, in another embodiment, the sound receiving unit 120 may be implemented as a form provided in a separate device, not a component included in the electronic device 10 .

For example, when the electronic device 10 is a display device such as a television, a user voice is received through a microphone, that is, a sound receiver installed in a remote control provided as an input device capable of user manipulation, and the corresponding user voice is received. A sound signal may be transmitted from the remote control to the electronic device 10 . Here, the analog sound wave received through the microphone of the remote control may be converted into a digital signal and transmitted to the electronic device 10 .

In an embodiment, the input device includes a terminal device, such as a smartphone, on which a remote control application is installed.

The interface unit 140 allows the electronic device 10 to transmit or receive signals with various external devices including the server 20 and the terminal device.

The interface unit 140 may include a wired interface unit 141 . The wired interface unit 141 includes a connection unit for transmitting/receiving signals/data according to standards such as HDMI, HDMI-CEC, USB, Component, Display Port (DP), DVI, Thunderbolt, RGB cable, etc. can Here, the wired interface unit 141 may include at least one connector, terminal, or port corresponding to each of these standards.

The wired interface unit 141 may be implemented in a form including an input port for receiving a signal from an image source, etc., and may further include an output port in some cases to transmit/receive signals in both directions.

The wired interface unit 141 is configured to connect an antenna capable of receiving a broadcast signal according to a broadcasting standard such as terrestrial/satellite broadcasting, or a cable capable of receiving a broadcast signal according to a cable broadcasting standard to be connected, an HDMI port, a DisplayPort , DVI port, Thunderbolt, composite video, component video, super video, SCART, etc. may include a connector or port according to the video and / or audio transmission standard. As another example, the electronic device 10 may have a built-in antenna capable of receiving a broadcast signal.

The wired interface unit 141 may include a connector or port according to a universal data transmission standard such as a USB port. The wired interface unit 141 may include a connector or a port to which an optical cable can be connected according to an optical transmission standard. The wired interface unit 141 is connected to an external microphone or an external audio device having a microphone, and may include a connector or a port capable of receiving or inputting an audio signal from the audio device. The interface unit 111 is connected to an audio device such as a headset, earphone, or external speaker, and may include a connector or port capable of transmitting or outputting an audio signal to the audio device. The wired interface unit 141 may include a connector or port according to a network transmission standard such as Ethernet. For example, the wired interface unit 141 may be implemented as a LAN card connected to a router or a gateway by wire.

The wired interface unit 141 is wired through the connector or port in a 1:1 or 1:N (N is a natural number) method such as an external device such as a set-top box, an optical media playback device, or an external display device, speaker, server, etc. By being connected, a video/audio signal is received from the corresponding external device or a video/audio signal is transmitted to the corresponding external device. The wired interface unit 141 may include a connector or a port for separately transmitting video/audio signals.

In an embodiment, the wired interface unit 141 is embedded in the electronic device 10 , but may be implemented in the form of a dongle or a module to be detachably attached to the connector of the electronic device 10 .

The interface unit 140 may include a wireless interface unit 142 . The wireless interface unit 142 may be implemented in various ways corresponding to the implementation form of the electronic device 10 . For example, the wireless interface unit 142 is a communication method such as RF (radio frequency), Zigbee (Zigbee), Bluetooth (bluetooth), Wi-Fi (Wi-Fi), UWB (Ultra WideBand) and NFC (Near Field Communication), etc. Wireless communication can be used.

The wireless interface unit 142 may be implemented as a communication circuitry including a wireless communication module (S/W module, chip, etc.) corresponding to various types of communication protocols.

In an embodiment, the wireless interface unit 142 includes a wireless LAN unit. The wireless LAN unit may be wirelessly connected to an external device through an access point (AP) under the control of the processor 160 . The wireless LAN unit includes a WiFi module.

In an embodiment, the wireless interface unit 142 includes a wireless communication module that supports one-to-one direct communication between the electronic device 10 and an external device wirelessly without an access point. The wireless communication module may be implemented to support communication methods such as Wi-Fi Direct, Bluetooth, and Bluetooth low energy. When the electronic device 10 directly communicates with the external device, the storage unit 150 may store identification information (eg, MAC address or IP address) on the external device, which is a communication target device.

In the electronic device 10 according to an embodiment of the present invention, the wireless interface unit 142 is provided to perform wireless communication with an external device by at least one of a wireless LAN unit and a wireless communication module according to performance.

In another embodiment, the air interface unit 142 may further include a communication module using various communication methods such as mobile communication such as LTE, EM communication including a magnetic field, and visible light communication.

The wireless interface unit 142 may transmit and receive data packets to and from the server by wirelessly communicating with the server on the network.

The wireless interface unit 142 may include an IR transmitter and/or an IR receiver capable of transmitting and/or receiving an IR (Infrared) signal according to an infrared communication standard. The wireless interface unit 142 may receive or input a remote control signal from the remote control or other external device through the IR transmitter and/or the IR receiver, or may transmit or output a remote control signal to another external device. As another example, the electronic device 10 may transmit/receive a remote control signal to and from the remote control or other external device through the wireless interface unit 142 of another method such as Wi-Fi or Bluetooth.

When the video/audio signal received through the interface unit 140 is a broadcast signal, the electronic device 10 may further include a tuner for tuning the received broadcast signal for each channel.

In an embodiment, the wireless interface unit 142 may transmit predetermined data as information of the user's voice received through the sound receiving unit 120 to an external device, that is, the server 20 . Here, the form/type of the transmitted data is not limited, and for example, an audio signal corresponding to a voice uttered by a user or a voice characteristic extracted from the audio signal may be included.

Also, the wireless interface unit 142 may receive data of the processing result of the user's voice from the server 20 . The electronic device 10 outputs a sound corresponding to the voice processing result through the output unit 110 based on the received data.

However, the above-described embodiment is an example, and the user's voice may not be transmitted to the server 20 , but may be processed by itself within the electronic device 10 . That is, in another embodiment, the electronic device 10 can be implemented to perform the role of the STT server.

The electronic device 10 may communicate with an input device such as a remote control through the wireless interface unit 142 to receive a sound signal corresponding to the user's voice from the input device.

In the electronic device 10 according to an embodiment, the communication module communicating with the server 20 and the communication module communicating with the remote controller may be different from each other. For example, the electronic device 10 may communicate with the server 20 through an Ethernet modem or Wi-Fi module, and may communicate through a remote controller and a Bluetooth module.

In the electronic device 10 of another embodiment, the communication module communicating with the server 20 and the communication module communicating with the remote control may be the same. For example, the electronic device 10 may communicate with the server 20 and the remote controller through the Bluetooth module.

The storage unit 150 is configured to store various data of the electronic device 10 . The storage unit 150 should retain data even when power supplied to the electronic device 10 is cut off, and may be provided as a writable nonvolatile memory (writable ROM) to reflect changes. That is, the storage unit 150 may be provided with any one of a flash memory, an EPROM, or an EEPROM.

The storage unit 150 may further include a volatile memory such as DRAM or SRAM, in which the read or write speed of the electronic device 10 is faster than that of the nonvolatile memory.

The data stored in the storage 150 includes, for example, an operating system for driving the electronic device 10 , and various software, programs, applications, and additional data executable on the operating system.

An application stored and installed in the storage unit 150 in the electronic device 10 according to an embodiment of the present invention recognizes a user voice received through the sound receiver 120 and performs an operation according to the AI speaker. It can contain applications.

In an embodiment, when the AI speaker application is identified as a predetermined keyword through the sound receiver 120 , that is, an input of a trigger word, a user operation on a specific button of the electronic device 10 , etc. are identified. By being executed or activated, it is possible to perform a voice recognition function for the voice uttered by the user. Here, the activation of the application may include switching the execution state of the application from the background mode to the foreground mode.

In the electronic device 10 according to an embodiment, the storage unit 150, as shown in FIG. 2, is a database (database) in which data for recognizing a user voice that can be received through the sound receiving unit 120, that is, information is stored. 151) may be included.

The database 151 may include, for example, a plurality of acoustic models determined in advance by modeling signal characteristics of speech. In addition, the database 151 may further include a language model determined in advance by modeling a linguistic order relationship such as words or syllables corresponding to the recognition target vocabulary.

In another embodiment, the database in which information for recognizing a user's voice is stored may be provided in the server 20, which is an example of an external device accessible by a wired or wireless network through the wireless interface unit 142 as described above. have. The server 20 may be implemented, for example, in a cloud type.

The processor 160 controls all components of the electronic device 10 to operate.

The processor 160 executes instructions included in a control program to perform such a control operation. The processor 160 includes at least one general-purpose processor that loads at least a part of the control program from the non-volatile memory in which the control program is installed into the volatile memory, and executes the loaded control program, for example, CPU (Central Processing). Unit) or an application processor (AP).

The processor 160 may include a single core, a dual core, a triple core, a quad core, and multiple cores thereof. The processor 160 operates in a plurality of processors, for example, a main processor and a sleep mode (for example, only standby power is supplied and does not operate as an electronic device receiving a sound signal). It may include a sub-processor. In addition, the processor, the ROM, and the RAM are interconnected through an internal bus, and the ROM and the RAM are included in the storage unit 150 .

In the present invention, a CPU or an application processor, which is an example of implementing the processor 160 , may be implemented as a form included in a main SoC mounted on a PCB embedded in the electronic device 10 .

The control program may include program(s) implemented in the form of at least one of a BIOS, a device driver, an operating system, firmware, a platform, and an application program (application). As an embodiment, the application program is pre-installed or stored in the electronic device 10 when the electronic device 10 is manufactured, or receives data of the application program from the outside when used later, based on the received data. It may be installed in the electronic device 10 . Data of the application program may be downloaded to the electronic device 10 from, for example, an external server such as an application market. Such an application program, an external server, etc. is an example of the computer program product of the present invention, but is not limited thereto.

In an embodiment, the processor 160 may include a signal processing unit 161 as shown in FIG. 2 .

The signal processing unit 161 processes an audio signal, that is, a sound signal. The sound signal processed by the signal processing unit 161 may be output as sound through the output unit 110 to provide audio content to the user.

In an embodiment, the signal processing unit 161 is a software block of the processor 160 , and may be implemented in a form that performs one function of the processor 160 .

In another embodiment, the signal processing unit 161 is a separate configuration separated from the CPU or application processor (AP), which is an example of implementing the processor 160, for example, a microprocessor such as a digital signal processor (DSP) or It may be implemented as an integrated circuit (IC), or may be implemented by a combination of hardware and software.

In an embodiment, the processor 160 may include a voice recognition module 162 capable of recognizing a voice signal uttered by a user, as shown in FIG. 2 .

In one embodiment, the voice recognition module 162 receives the user's utterance as an input, and in response to an input of a predetermined start word (hereinafter, also referred to as a trigger word or a wake-up word (WUW)), a voice It may be implemented to initiate an action for recognition.

In the electronic device 10 according to an embodiment of the present invention, the voice recognition module 162 includes a preprocessor 301 , a start word engine 302 , a threshold value determiner 304 and a voice, as shown in FIG. 3 . A recognition engine 304 may be included.

The preprocessor 301 may receive a voice signal according to the user's utterance from the sound receiver 120 and perform preprocessing for removing ambient noise, that is, noise.

In an embodiment, the pre-processing may include processes such as digital signal conversion, filtering, framing, and the like, and a meaningful voice signal can be extracted by removing unnecessary ambient noise from the voice signal according to the above processes.

The start word engine 302 performs pattern matching by comparing features extracted from the pre-processed speech signal with a predetermined pattern.

In an embodiment, the start word engine 302 may perform pattern matching using an acoustic model configured by performing pre-learning.

Specifically, the start word engine 302 determines whether the input speech includes a start word based on the similarity between the input speech, that is, the waveform of the voice signal (sound signal) according to the user's speech, and the start word pattern of the acoustic model. can be identified.

The start word engine 302 determines that the input utterance includes the start word when, as a result of the comparison by pattern matching, the score of the input utterance, that is, the utterance score is greater than a predetermined start word threshold (WUW Threshold). can be identified.

Here, a threshold of similarity, that is, a starting word threshold (WUW Threshold) may be preset based on a learning algorithm using an acoustic model.

In the present invention, the starting word threshold (WUW Threshold) is defined as a condition for activating the voice recognition function of the electronic device 1 . In other words, the starting word threshold (WUW Threshold) is distinguished from the noise threshold and SNR threshold respectively used for comparison with the noise characteristic and noise-to-speech characteristic of a sound signal, which will be described later.

The electronic device 1 according to an embodiment of the present invention may be implemented to use two starting word thresholds set to have different values when the user's utterance is made in a noisy environment. A specific example of applying the two starting word thresholds in such a noisy environment will be described in more detail in the embodiment of FIG. 4 to be described later.

The threshold value determining unit 303 identifies whether the user's utterance is made in a noisy environment using a predetermined noise threshold. Here, the identification of the noise environment, as a noise characteristic of a sound signal according to a user's utterance, may be made based on a comparison between power and a noise threshold in a specific section.

In addition, the threshold value determining unit 303 identifies whether the ratio of the sound signal to the noise is equal to or greater than a specific level as the speech characteristic of the sound signal according to the user's speech using a predetermined SNR threshold.

In one embodiment, the threshold value determining unit 303 is, based on the comparison result of the noise characteristic of the sound signal and the noise threshold as described above, or the comparison result of the speech characteristic of the sound signal and the threshold value of the SNR threshold, The SNR threshold can be changed. Changing the SNR threshold may include, for example, adjusting the value upwards or downwards. A specific example of changing the SNR threshold will be described in more detail in the embodiment of FIG. 4 to be described later.

The voice recognition engine 304 may be implemented to include a voice recognition function for a voice signal received as a user's utterance, that is, a sound signal, so as to perform a recognition operation on the user's utterance.

In the electronic device 10 according to an embodiment of the present invention, the voice recognition engine 304 satisfies the activation condition of the second stage based on the two threshold values of the starting words when the user's utterance is made in a noisy environment, The voice recognition function may be activated, and the electronic device 10 may be implemented to perform a recognition operation regarding the user's utterance based on the received sound signal. A specific example in which the voice recognition function is activated according to the activation conditions of these two steps will be described in more detail in the embodiment of FIG. 4 to be described later.

In one embodiment, the voice recognition function of the voice recognition engine 304 may be performed using one or more voice recognition algorithms. For example, the voice recognition engine 304 extracts a vector representing a voice feature from a voice signal uttered by a user, and compares the extracted vector with an acoustic model of the database 151 or the server 20 to perform voice recognition. can be done Here, it is assumed that the acoustic model is a model according to previously performed learning as an example.

As described above, the voice recognition module 162 comprising the preprocessor 301, the starting word engine 302, the threshold value determining unit 304 and the voice recognition engine 304 is a An example implemented as an embedded type is described as an example, but the present invention is not limited thereto. Accordingly, the voice recognition module 162 may be implemented as a configuration of the electronic device 10 separate from the CPU, for example, a separate chip such as a microcomputer provided as a dedicated processor for a voice recognition function. .

In addition, as each component of the voice recognition module 162, the preprocessor 301, the start word engine 302, the threshold value determination unit 304, and the voice recognition engine 304 may be implemented as a software block as an example, In some cases, at least one configuration may be implemented in an excluded form, or at least one other configuration may be added.

In the following embodiment, in order for the electronic device 10 to perform the voice recognition function, the aforementioned preprocessor 301 , the start word engine 302 , the threshold value determiner 304 , and the voice recognition engine 304 . It will be understood that operations performed by at least one of these are performed by the processor 160 of the electronic device 10 .

In an embodiment, the processor 160 identifies whether a value representing the noise characteristic of the sound signal received through the sound receiver 120 is greater than a noise threshold (hereinafter, also referred to as a first threshold), and Whether or not the value indicating the ignition characteristic is greater than the SNR threshold (hereinafter also referred to as the second threshold value) is identified, and the value of the noise characteristic is greater than the first threshold value and the value of the ignition characteristic is greater than the second threshold value Once identified, a recognition operation on the user's utterance may be performed based on the received sound signal, and the second threshold value, that is, the SNR threshold may be adjusted upward. Here, the processor 160 determines that the similarity between the waveform of the sound signal and the predefined start word pattern is greater than a first start word threshold (hereinafter, also referred to as a third threshold), that is, a sound signal that satisfies the first activation condition. With respect to the sound signal, it may be identified whether the value of the noise characteristic and the value of the ignition characteristic are greater than a first threshold value and a second threshold value, respectively.

In addition, when it is identified that the value of the noise characteristic of the received sound signal is equal to or less than the first threshold, the processor 160 performs a recognition operation on the user's utterance based on the received sound signal, and performs a recognition operation on the second threshold, that is, the SNR. The threshold can be adjusted downward.

In addition, when the value of the speech characteristic of the sound signal is identified as being less than or equal to the second threshold, the processor 160 determines that the similarity between the waveform of the sound signal and the starting word pattern is greater than the first starting word threshold ( Hereinafter, it is also referred to as a fourth threshold value), that is, when the second activation condition is satisfied, a recognition operation regarding the user's utterance may be performed based on the received sound signal.

As an embodiment, the operation of the processor 160 may be implemented as a computer program stored in a computer program product (not shown) provided separately from the electronic device 10 . In this case, the computer program product includes a memory in which instructions corresponding to the computer program are stored, and a processor. When the instruction is executed by the processor 160 , the value indicating the noise characteristic of the sound signal received through the sound receiving unit 120 is greater than the first threshold value, and the value indicating the speech characteristic of the sound signal is the second threshold value if greater than, performing a recognition operation on the user's utterance based on the received sound signal, and allowing the second threshold to be adjusted upward. In addition, the instruction includes, if the value representing the noise characteristic of the received sound signal is equal to or less than the first threshold, performing a recognition operation on the user's utterance based on the received sound signal and lowering the second threshold. include

Accordingly, the processor 160 of the electronic device 10 may download and execute a computer program stored in a separate computer program product to perform the above-described operation of the instruction.

Hereinafter, with reference to the drawings, embodiments in which the recognition operation for the user's utterance is improved in the electronic device of the present invention will be described.

4 is a flowchart illustrating a control method of an electronic device according to an embodiment of the present invention, and FIG. 5 is a diagram for explaining pattern matching for activating a voice recognition function in an electronic device according to an embodiment of the present invention; 6 is a view for explaining the identification of noise characteristics of an electronic device according to an embodiment of the present invention.

As shown in FIG. 7 , the electronic device 10 may receive a sound signal through the sound receiver 120 ( 401 ). Here, the received sound signal may be a signal according to the user's utterance.

The processor 160 may identify whether the sound signal received in step 401 satisfies the first activation condition for the voice recognition function (step 402).

In an embodiment, the processor 160 performs pattern matching between a sound signal from which ambient noise, that is, noise has been removed, and a predefined start word signal as shown in FIG. 5 , as shown in FIG. 1 It can be identified whether the activation condition is satisfied.

_{Specifically, the processor 160 derives a score speech} as a similarity between the user's speech, that is, a waveform of a sound signal and a pattern of a start word signal, based on the pattern matching as shown in FIG. 5 , and uses Equation 1 below It can be identified whether the derived speech score, that is, the degree of similarity, is greater than the predetermined first start word threshold WUW Threshold1, that is, the third threshold value.

Here, the first starting word threshold (third threshold) is for identifying whether the sound signal satisfies the first activation condition for the voice recognition function, and is applied regardless of whether the user's utterance is made in a noisy environment. do.

In an embodiment, the first starting word threshold may be preset to, for example, 0.1, but the value is not limited thereto as an example.

The processor 160 may determine that the sound signal input in step 401 satisfies the first activation condition when it is identified that the utterance score is greater than the first starting word threshold by Equation (1).

If it is determined in step 402 that the sound signal satisfies the first activation condition, the processor 160 determines whether the value representing the noise characteristic of the sound signal according to the user's utterance is greater than a predetermined noise threshold, that is, the first threshold. can be identified (403). Here, the first threshold value is for identifying whether the user's surroundings are a noisy environment, and may be preset to correspond to a power value of a sound signal when a sufficiently loud noise is present in the surroundings.

Here, with respect to the sound signal received in step 401, the processor 160 identifies a section including a start word uttered by the user (hereinafter, also referred to as a start word section), and a predefined time before the start word section It can be identified whether a value indicating a noise characteristic received in a length section (hereinafter, also referred to as a noise characteristic confirmation section) is greater than a first threshold value.

In the electronic device 10 according to an embodiment of the present invention, the sound signal received by the sound receiving unit 120 in a streaming manner is, as shown in FIG. 6 , a First In First Out (FIFO) queue. It can be temporarily stored in units of consecutive frames in a (queue) type data structure. That is, when the next frame is received, the streaming sound signal is stored in such a way that the first stored frame is pushed out. Here, the length of the sound signal to be stored may be preset to correspond to the storage space, and for example, it may be implemented such that a signal having a length of 2.5 seconds is stored.

In an embodiment, the processor 160 may monitor whether a start word according to the user's utterance is included in each frame of the streaming sound signal received and stored in units of consecutive frames as described above. The processor 160, based on the monitoring, for example, as described in step 402, when it is detected that the speech score in a specific signal frame is greater than the first start word threshold, the corresponding signal frame is a user speech, that is, start It can be identified as containing

The processor 160 may identify a predetermined time length from the signal frame identified in step 402, for example, a time period up to about 1 second before the start word period. In addition, the processor 160 may identify a predetermined length of time before the identified start word section, for example, a time section of about 1.5 seconds as the noise characteristic confirmation section.

Here, the noise characteristic check section may be defined to correspond to a time obtained by subtracting the time of the start word section from the time of the entire sound signal being stored, and in the present invention, the length of time corresponding to the start word section and the time of the noise property check section The length is not limited to the examples presented.

As a noise characteristic of the sound signal, the processor 160 compares the signal power of the noise characteristic confirmation section with a first threshold value to determine whether the surrounding environment is sufficiently noisy when uttering, that is, whether the user's utterance is made in a noisy environment. can be identified.

In step 403, if it is identified that the noise characteristic of the sound signal, that is, the signal power is greater than the first threshold, the processor 160 determines whether the speech characteristic of the sound signal is greater than a predetermined SNR threshold, that is, the second threshold. can be identified (404). Here, the speech characteristic may include a signal to noise ratio (SNR) of a sound signal.

_{In one embodiment, the processor 160 calculates a posteriori SNR (SNR post} ) corresponding to the noise ratio to the total sound signal as the speech characteristic of the sound signal, and using Equation 2 below, It may be identified whether the calculated posterior SNR is greater than a predetermined second threshold, that is, an SNR threshold.

Here, the post SNR (SNR _post ) may be calculated using Equations 3 and 4 below.

Here, for the k-th spectrum of frame p, X(p,k) represents the total sound signal including noise, S(p,k) represents the speech signal, and N(p,k) represents the noise signal, respectively. .

Accordingly, the received input sound signal (voice signal) X may be expressed as the sum of the k-th spectral element for each frame p of the speech element S and the noise element N, as shown in Equation (3).

The post SNR (SNR _post ) is the ratio of the magnitude of the noise, that is, the noise N(p,k), to the total sound signal X(p,k) including the noise for each frame (p), as expressed in Equation 4 below. can be calculated by

And, the final posterior SNR for all frames may be calculated as an average value of posterior SNRs for each frame p.

According to the electronic device 10 of an embodiment, the processor 160 determines the final posterior SNR calculated in this way as the speech characteristic of the sound signal in step 404, and compares the speech characteristic with a second threshold (SNR Threshold). , it is possible to identify whether the user's speech is sufficiently loud in a noisy environment.

Here, the second threshold, that is, the SNR threshold, is a predetermined value corresponding to a level at which an input sound signal can be recognized as generated by a user's speech in a noisy environment, and the initial value is to be set in advance. In one embodiment of the present invention, the initial SNR threshold may be set to, for example, 4, but is not limited thereto.

As a result of the identification in step 403, if the electronic device 10 operates in a sufficiently noisy noise environment (YES in step 430), in step 402, a sound signal including ambient noise, not the user's actual speech, is regarded as including the starting word. Misrecognition may occur.

In consideration of this, in the electronic device 10 according to an embodiment of the present invention, with respect to the sound signal identified as satisfying the first activation condition in step 402 , in step 403 , the value of the noise characteristic of the signal is set as a first threshold After identifying whether there is a noisy environment by comparing the value with the value, in the case of a noisy environment, the ignition characteristic of the sound signal is further compared with a second threshold value (initial SNR threshold) in step 404 .

Accordingly, it is further determined whether the user's utterance is sufficiently loud in a noisy environment by step 404, and based on the result, it is possible to control execution of a trigger for performing a voice recognition operation.

In step 404 , when it is identified that the value of the speech characteristic of the sound signal is greater than a predetermined second threshold, for example, an initial SNR threshold, the processor 160 executes a trigger to receive the received electronic device 10 . A control is performed to perform a voice recognition operation on the user's utterance based on the sound signal ( 405 ). Here, when the final posterior SNR calculated in step 404 is, for example, 5, which is greater than the initial SNR threshold of 4, a trigger may be executed.

That is, in a noisy environment (YES in step 403), if it is identified that the user's speech is sufficiently loud (YES in step 404), the processor 160 immediately executes a trigger, thereby activating the voice recognition function in the electronic device 10 Thus, an operation can be performed in response to the received sound signal.

Then, the processor 160 may adjust the second threshold upward from the predetermined initial SNR threshold ( 406 ).

In other words, in the electronic device 10 according to an embodiment of the present invention, the processor 160 may reset the second threshold to reflect the noise environment as a change in the surrounding environment after executing the trigger in step 405 . .

In one embodiment, the processor 160 uses the initial SNR threshold (SNR _Th _{_} _init ) and the post SNR (SNR _post ) calculated in step 404 according to Equation 5 below to generate a new second threshold value (SNR). Threshold) can be derived.

For example, when the initial SNR threshold (SNR _Th _{_} _init ) is 4 and the post SNR (SNR _post ) calculated in step 404 is 5, the new second threshold (SNR Threshold) is calculated according to Equation (5). As 4*log_4 (5) = 4*1.16 = 4.64, it is increased to have a value greater than 4, that is, it is adjusted upward.

The second threshold value adjusted upward as described above becomes a value applied in step 404 to the corresponding sound signal in response to reception of the next sound signal.

In the electronic device 10 according to an embodiment of the present invention, as described above, when the surrounding is a noisy environment, the second threshold value (SNR threshold) as a trigger execution condition is increased correspondingly by adjusting the second threshold value (SNR threshold) by the user in the noisy environment. It can induce utterance in a loud voice.

Meanwhile, as shown in FIG. 4 , when it is identified that the value of the ignition characteristic of the sound signal is less than or equal to a predetermined second threshold, for example, an initial SNR threshold, in step 404, the processor 160, the corresponding sound signal Whether or not satisfies the second activation condition may be further identified ( 407 ).

In an embodiment, the processor 160, using Equation 6 below, the utterance score derived as a similarity between the waveform of the sound signal derived in step 401 and the start word pattern is a predetermined second start word threshold (WUW). Threshold1), that is, when it is greater than the fourth threshold, it may be identified that the sound signal satisfies the second activation condition.

Here, the second starting word threshold WUW Threshold2 (fourth threshold) is for identifying whether the sound signal satisfies the second activation condition for the voice recognition function, and when the user's speech is a noisy environment (step 403 to YES).

The second starting word threshold WUW Threshold2 (the fourth threshold) is set to a value greater than the first starting word threshold WUW Threshold1 (third threshold) in step 401 as shown in Equation 7 below. can be

In one embodiment, for example, the first starting word threshold value (third threshold value) may be preset to 0.1, and the second starting word threshold value (fourth threshold value) may be preset to 0.15, but this is presented as an example , so the value is not limited.

The processor 160 may determine that the sound signal input in step 401 satisfies the second activation condition when it is identified that the utterance score is greater than the second starting word threshold by Equation (6).

If it is determined in step 407 that the sound signal satisfies the second activation condition, the processor 160 executes a trigger to control the electronic device 10 to perform a voice recognition operation related to the user's utterance based on the received sound signal do (408).

On the other hand, if it is determined in step 407 that the sound signal does not satisfy the second activation condition, that is, if the utterance score is identified as being equal to or less than the second starting word threshold by Equation 6, since the processor 160 does not execute the trigger, The electronic device 10 is controlled to keep the voice recognition inactive ( 409 ).

Accordingly, in the electronic device 10 according to an embodiment of the present invention, when the processor 160 is in a noisy environment (YES in step 403), the similarity between the waveform of the input sound signal and the pattern of the start word signal is the first start. is identified to be greater than the threshold value, so that even if the input sound signal meets the first activation condition (YES in step 402), the voice recognition function is activated only when the sound signal satisfies up to the second activation condition (YES in step 407) , control based on the activation conditions of the second stage is made.

In other words, in a noisy environment, since the speech recognition function is activated only when the speech score indicating the similarity according to the pattern matching between the sound signal and the starting word signal is greater than the second starting word threshold, the user's actual speech in step 402 Even if the sound signal including the ambient noise other than , is erroneously recognized as including the starting word, the possibility of an erroneous operation is reduced by step 407 .

Meanwhile, in step 403, when it is identified that the noise characteristic of the sound signal, that is, the signal power is equal to or less than the first threshold, the processor 160 controls the electronic device 10 to perform a voice recognition operation by executing a trigger (410). ).

Then, the processor 160 may adjust the second threshold downward from a predetermined initial SNR threshold ( 411 ).

In other words, in the electronic device 10 according to an embodiment of the present invention, after the processor 160 executes the trigger in step 410 , in response to the identification result in step 403 , the electronic device 10 determines that the surroundings are not noisy. When it is determined that the operation is performed in the environment (NO in step 403), that is, when the surrounding is not a noisy environment, the second threshold value may be reset to reflect this.

_{In one embodiment, the processor 160 calculates a post SNR (SNR post} ) corresponding to the noise ratio to the total sound signal as the speech characteristic of the sound signal as described in step 404, and according to Equation 5 above, A new second threshold (SNR Threshold) may be derived using the calculated post SNR and the initial SNR threshold. Here, since the surrounding is not a noisy environment, the calculated final posterior SNR is derived to be smaller than the case in step 404, and may be, for example, 2

As an example, when the initial SNR threshold (SNR _Th _{_} _init ) is 4.16 and the calculated post SNR (SNR _post ) is 2, the new second threshold (SNR Threshold) is 4.16*log_4. 16 (2) = 4.16*0.49 = 2.02, that is, it is decreased to have a value less than 4, that is, it is adjusted downward.

In the electronic device 10 according to an embodiment of the present invention, as described above, when the environment is not in a noisy environment, the second threshold value (SNR threshold) as a trigger execution condition is lowered correspondingly to the case where the environment is not in a noisy environment , even when a small sound is uttered by the user, it can operate to enable immediate voice recognition.

Meanwhile, as shown in FIG. 4 , if it is determined in step 402 that the sound signal does not satisfy the first activation condition, that is, an utterance score indicating the similarity between the sound signal and the start word signal by Equation 1 is the first start If it is identified as being equal to or less than the threshold value, the processor 160 does not execute a trigger, and thus the electronic device 10 may be controlled to maintain the voice recognition deactivation ( 412 ).

In the electronic device 10 according to the embodiment of the present invention as described above, even if the sound signal according to the user's utterance satisfies the first activation condition, in a noisy environment, the noise-to-signal ratio (SNR), which is the utterance characteristic of the sound signal, When the user's speech is not sufficiently loud compared to the noise as less than the SNR threshold, by additionally identifying whether the sound signal satisfies the second activation condition, the activation condition of the second step is applied to the voice recognition function.

Accordingly, it is possible to reduce the occurrence of erroneous operation, such as when the electronic device 10 erroneously recognizes a sound signal including ambient noise instead of an actual utterance of the user in a noisy environment as including a starting word.

In addition, in the electronic device 10 according to an embodiment of the present invention, when a signal-to-noise ratio (SNR), which is an utterance characteristic of a sound signal in a noisy environment, is greater than the SNR threshold, that is, when user utterance is sufficiently large compared to noise, By adjusting the SNR threshold to be higher, inducing the user to utter the starting word loudly in a noisy environment, the effect of improving the accuracy of motion can be expected.

In addition, in the electronic device 10 according to an embodiment of the present invention, when the surrounding environment is not a noisy environment, by adjusting the SNR threshold to be lowered, immediate operation of the electronic device 10 according to the environmental change in a quiet environment can make this happen.

As mentioned above, although the present invention has been described in detail through preferred embodiments, the present invention is not limited thereto and may be practiced in various ways within the scope of the claims.

Claims

In an electronic device,

sound receiver; and

A value representing the noise characteristic of the sound signal received through the sound receiver is greater than a first threshold value,

If the value representing the ignition characteristic of the sound signal is greater than the second threshold,

performing a recognition operation on the user's utterance based on the sound signal, and adjusting the second threshold to increase

An electronic device comprising a processor.
According to claim 1,

The ignition characteristic may include a noise-to-signal ratio of the sound signal.
3. The method of claim 2,

The processor is

An electronic device for calculating a noise level ratio for each frame of the sound signal, and determining an average value of the calculated ratio for each frame as the value of the speech characteristic.
According to claim 1,

The processor is

Identifies whether a predefined start word is included in the sound signal,

An electronic device for identifying whether the value of the noise characteristic of the sound signal identified as including the starting word is greater than the first threshold value.
5. The method of claim 4,

The processor is

An electronic device for identifying whether the start word is included in the sound signal based on a similarity between a waveform of the sound signal and a predefined start word pattern.
6. The method of claim 5,

The threshold value of the similarity is preset based on a learning algorithm using an acoustic model.
6. The method of claim 5,

The processor is

If the value of the ignition characteristic of the sound signal having the similarity greater than the third threshold value is equal to or less than the second threshold value, the similarity level is greater than the third threshold value based on the sound signal satisfying a fourth threshold value. An electronic device that performs a recognition operation on a user's utterance.
6. The method of claim 5,

The processor is

An electronic device for identifying whether the value of the noise characteristic of a sound signal received in a section having a predefined time length before the section including the start word is greater than the first threshold value.
8. The method of claim 7,

The processor is

The electronic device compares the power value of the sound signal received during the predetermined time length with the first threshold value.
According to claim 1,

When the value of the noise characteristic is equal to or less than the first threshold, the processor adjusts the second threshold to decrease.
A method for controlling an electronic device, comprising:

acquiring a noise characteristic from a sound signal received through a sound receiver;

acquiring a speech characteristic from the sound signal; and

When the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation on the user's utterance is performed based on the sound signal, and the second threshold value adjusting the value to increase;

A control method of an electronic device comprising a.
12. The method of claim 11,

The ignition characteristic is a control method of an electronic device including a noise-to-signal ratio of the sound signal.
13. The method of claim 12,

The method of controlling an electronic device further comprising: calculating a ratio of noise to the sound signal for each frame of the sound signal, and determining an average value of the calculated ratio for each frame as the value of the speech characteristic.
12. The method of claim 11,

identifying whether a predefined start word is included in the sound signal; and

and identifying whether the value of the noise characteristic of the sound signal identified as including the start word is greater than the first threshold value.
15. The method of claim 14,

The step of identifying whether the starting word is included is,

A control method of an electronic device for identifying whether the start word is included in the sound signal based on a similarity between a waveform of the sound signal and a predefined start word pattern.