WO2021125784A1 - Electronic device and control method therefor - Google Patents

Electronic device and control method therefor Download PDF

Info

Publication number
WO2021125784A1
WO2021125784A1 PCT/KR2020/018442 KR2020018442W WO2021125784A1 WO 2021125784 A1 WO2021125784 A1 WO 2021125784A1 KR 2020018442 W KR2020018442 W KR 2020018442W WO 2021125784 A1 WO2021125784 A1 WO 2021125784A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound signal
electronic device
value
threshold
threshold value
Prior art date
Application number
PCT/KR2020/018442
Other languages
French (fr)
Korean (ko)
Inventor
김가을
최찬희
Original Assignee
삼성전자(주)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 삼성전자(주) filed Critical 삼성전자(주)
Publication of WO2021125784A1 publication Critical patent/WO2021125784A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present invention relates to an electronic device and a control method thereof, and more particularly, to an electronic device for processing a voice uttered by a user and a control method thereof.
  • Electronic devices such as artificial intelligence (AI) speakers, mobile devices such as smart phones or tablets, and smart TVs recognize the voice uttered by the user and perform a function according to the voice recognition can do.
  • AI artificial intelligence
  • mobile devices such as smart phones or tablets
  • smart TVs recognize the voice uttered by the user and perform a function according to the voice recognition can do.
  • the electronic device may operate to activate the voice recognition function by recognizing that a predetermined start word, that is, a trigger word, is input from the user.
  • the start word recognition may include a process of determining the similarity between the audio signal of the user's voice and the start word. For example, when the similarity between the pattern of the audio signal and the start word is greater than a predetermined criterion, the input voice is It can be identified by including the starting word.
  • the threshold value for identifying the speech characteristic of a sound signal is reset according to whether the user's speech property is in a noisy environment in response to the user's speech property, so that the accuracy of starting word recognition
  • An electronic device includes: a sound receiver; and when the value indicating the noise characteristic of the sound signal received through the sound receiver is greater than the first threshold value and the value indicating the speech characteristic of the sound signal is greater than the second threshold value, the recognition operation regarding the user's utterance based on the sound signal and a processor for adjusting the second threshold to increase.
  • the speech characteristic may include a signal-to-noise ratio of the sound signal.
  • the processor may calculate a ratio of the noise to the sound signal for each frame of the sound signal, and determine an average value of the calculated ratio for each frame as the value of the speech characteristic.
  • the processor may identify whether a predefined start word is included in the sound signal, and identify whether a noise characteristic of the sound signal identified as including the start word is greater than a first threshold value.
  • the processor may identify whether a start word is included in the sound signal based on a similarity between a waveform of the sound signal and a predefined start word pattern.
  • the threshold of similarity may be preset based on a learning algorithm using an acoustic model.
  • the processor is configured to provide information regarding the user's speech based on the sound signal satisfying a fourth threshold value having the similarity greater than the third threshold value.
  • a recognition operation may be performed.
  • the processor may identify whether the value of the noise characteristic of the sound signal received in a section having a predefined time length before the section including the start word is greater than the first threshold value.
  • the processor may compare the power value of the sound signal received in the section of the predefined time length with the first threshold value.
  • the processor may adjust the second threshold to decrease.
  • a control method of an electronic device includes: acquiring a noise characteristic from a sound signal received through a sound receiver; obtaining a speech characteristic from the sound signal; and when the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation for the user's utterance is performed based on the sound signal, and the second threshold value is adjusted to increase including the steps of
  • the speech characteristic may include a signal-to-noise ratio of the sound signal.
  • the method may further include calculating a ratio of the noise to the sound signal for each frame of the sound signal, and determining an average value of the calculated ratio for each frame as a value of the speech characteristic.
  • identifying whether the sound signal includes a predefined starting word and, identifying whether the value of the noise characteristic of the sound signal identified as including the starting word is greater than a first threshold value.
  • the step of identifying whether the start word is included may include identifying whether the start word is included in the sound signal based on a similarity between the waveform of the sound signal and a predefined start word pattern.
  • the threshold of similarity may be preset based on a learning algorithm using an acoustic model.
  • a recognition operation regarding the user's speech is performed based on the sound signal satisfying the fourth threshold having the similarity greater than the third threshold. It may further include the step of performing.
  • the method may further include the step of identifying whether a value of a noise characteristic of a sound signal received in a section having a predefined time length before the section identified as including the starting word is greater than a first threshold value.
  • the method may further include adjusting the second threshold to be lowered.
  • a computer-readable code in a recording medium storing a computer program including a code for performing a control method of an electronic device, the control method of the electronic device is received through a sound receiver. acquiring noise characteristics from the sound signal being obtaining a speech characteristic from the sound signal; and when the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation for the user's utterance is performed based on the sound signal, and the second threshold value is adjusted to increase including the steps of
  • the electronic device and the control method of the present invention by resetting the threshold value for identifying the user's speech characteristics with respect to the sound signal in a noisy environment, the user induces the user to utter the starting word in a loud voice, The effect of improving the accuracy of motion can be expected.
  • the electronic device and the control method thereof of the present invention the occurrence of a malfunction in which the electronic device incorrectly recognizes a sound signal including ambient noise rather than an actual utterance of a user in a noisy environment as a starting word is reduced, so that It has the effect of improving the accuracy.
  • FIG. 1 illustrates a voice recognition system including an electronic device according to an embodiment of the present invention.
  • FIG. 2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present invention.
  • FIG. 3 is a block diagram illustrating a configuration of a voice recognition module of an electronic device according to an embodiment of the present invention.
  • FIG. 4 is a flowchart illustrating a method of controlling an electronic device according to an embodiment of the present invention.
  • FIG. 5 is a diagram for explaining pattern matching for activating a voice recognition function in an electronic device according to an embodiment of the present invention.
  • FIG. 6 is a view for explaining the identification of noise characteristics of an electronic device according to an embodiment of the present invention.
  • a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software or a combination of hardware and software, and is integrated into at least one module. and can be implemented.
  • at least one of the plurality of elements refers to all of the plurality of elements as well as each one or a combination thereof excluding the rest of the plurality of elements.
  • FIG. 1 illustrates a voice recognition system including an electronic device according to an embodiment of the present invention.
  • the voice recognition system includes an electronic device 10 capable of receiving a sound signal as a voice uttered by a user, that is, a sound, and the electronic device 10 and a network. It may include a server 20 that can communicate through.
  • the electronic device 10 may receive a voice uttered by a user (hereinafter, also referred to as a user voice), process a sound signal corresponding to the voice, and perform a corresponding operation.
  • a user voice a voice uttered by a user
  • process a sound signal corresponding to the voice and perform a corresponding operation.
  • the electronic device 10 provides audio content to the user by outputting a sound corresponding to the processing result of the user's voice, ie, the sound, through the output unit ( 110 of FIG. 2 ) as an operation corresponding to the received voice.
  • can provide At least one loudspeaker may be provided in the electronic device 10 as the output unit 110 capable of outputting sound, and the number, shape, and installation of the speakers provided in the electronic device 10 in the present invention The location is not limited.
  • the electronic device 10 may be provided with a sound receiver ( 120 in FIG. 3 ) capable of receiving a sound signal as a user's voice.
  • the sound receiver 120 may be implemented as at least one microphone, and the number, shape, and installation location of the microphones provided in the electronic device 10 are not limited.
  • an artificial intelligence speaker hereinafter also referred to as an AI speaker or a smart speaker
  • a smart TV A display device 10b including a television such as and a mobile device 10c such as a smart phone or tablet may be implemented as various devices capable of receiving a sound signal.
  • the electronic device 10 implemented as the AI speaker 10a may receive a voice from a user and perform various functions, such as listening to music and searching for information, through voice recognition for the received voice.
  • the AI speaker is not a device that simply outputs sound by utilizing the voice recognition function and the cloud, but is a device with a built-in virtual assistant/voice assistant that allows interaction with the user. It can be implemented to provide a service.
  • an application for the AI speaker function may be installed and driven in the electronic device 10 .
  • the electronic device 10 implemented as the display device 10b processes an image signal provided from an external signal supply source, ie, an image source, according to a preset process, and displays the image as an image.
  • an external signal supply source ie, an image source
  • the display device 10b includes a television (TV) capable of processing a broadcast signal based on at least one of a broadcast signal, broadcast information, or broadcast data provided from a transmission device of a broadcast station and displaying the same as an image.
  • TV television
  • the display device 10b may include, for example, a set-top box, an optical disc playback device such as a Blu-ray or digital versatile disc (DVD); From a computer (PC) including a desktop or laptop, a console game machine, a mobile device including a smart pad such as a smart phone or a tablet, etc. A video signal can be received.
  • a set-top box an optical disc playback device such as a Blu-ray or digital versatile disc (DVD)
  • PC computer
  • PC including a desktop or laptop
  • console game machine a console game machine
  • a mobile device including a smart pad such as a smart phone or a tablet, etc.
  • a video signal can be received.
  • the display device 10b When the display device 10b is a television, the display device 10b may wirelessly receive a radio frequency (RF) signal, that is, a broadcast signal transmitted from a broadcasting station, and for this purpose, an antenna for receiving a broadcast signal and a broadcast signal are used.
  • RF radio frequency
  • a tuner for tuning for each channel may be provided.
  • a broadcast signal can be received through a terrestrial wave, cable, satellite, or the like, and the signal source is not limited to an external device or a broadcasting station. That is, any device or station capable of transmitting and receiving data may be included in the image source of the present invention.
  • the standard of the signal received from the display device 10b may be configured in various ways corresponding to the implementation form of the device.
  • the display device 10b may be configured in an implementation form of the interface unit 140 (see FIG. 2 ) to be described later.
  • HDMI High Definition Multimedia Interface
  • HDMI-CEC Consumer Electronics Control
  • DP display port
  • DVI composite video
  • component video super video
  • DVI Digital Visual Interface
  • Thunderbolt RGB cable
  • SCART Syndicat des Constructeurs d'Appareils Radiorecepteurs et Televiseurs
  • USB etc.
  • the display apparatus 10b may receive image content from a server or the like provided for content provision through wired or wireless network communication, and the type of communication is not limited.
  • the display device 10b corresponds to an implementation form of the interface unit 140 to be described later, such as Wi-Fi, Wi-Fi Direct, Bluetooth, and Bluetooth low energy.
  • Wi-Fi Wi-Fi Direct
  • Bluetooth Bluetooth low energy
  • Zigbee Ultra-Wideband
  • NFC Near Field Communication
  • the display apparatus 10b may receive a content signal through wired network communication such as Ethernet.
  • the display apparatus 10b may serve as an AP that allows various peripheral devices such as a smartphone to perform wireless communication.
  • the display apparatus 10b may receive the content provided in the form of a file according to real-time streaming through the wired or wireless network as described above.
  • the display apparatus 10b includes a user interface for controlling a video, a still image, an application, an on-screen display (OSD), and various operations based on signals/data stored in internal/external storage media.
  • a signal may be processed to display a UI (hereinafter, also referred to as a graphic user interface (GUI)) on the screen.
  • GUI graphic user interface
  • the display device 10b may operate as a smart TV or an Internet Protocol TV (IP TV).
  • Smart TV can receive and display broadcast signals in real time, and has a web browsing function, so that it is possible to search and consume various contents through the Internet at the same time as displaying real-time broadcast signals, and for this purpose, it is possible to provide a convenient user environment.
  • it is television
  • the smart TV since the smart TV includes an open software platform, it can provide interactive services to users. Accordingly, the smart TV may provide a user with various contents, for example, an application providing a predetermined service through an open software platform.
  • These applications are applications that can provide various types of services, and include, for example, applications that provide services such as SNS, finance, news, weather, maps, music, movies, games, and e-books.
  • an application for providing a voice recognition function may be installed on the display device 10b.
  • a display capable of displaying an image may be provided in the electronic device 10 .
  • the implementation method of the display is not limited, and for example, liquid crystal, plasma, light-emitting diode, organic light-emitting diode, and surface-conduction gun. electron-emitter), carbon nano-tube, nano-crystal, and the like, may be implemented in various display methods.
  • the electronic device 10 may communicate with various external devices including the server 20 through the interface unit 140 .
  • the electronic device 10 can be connected to an external device through various types of wired or wireless connection (eg, Bluetooth, Wi-Fi, or Wi-Fi Direct). It is implemented to be able to communicate with the device.
  • wired or wireless connection eg, Bluetooth, Wi-Fi, or Wi-Fi Direct.
  • the server 20 is provided to perform wired or wireless communication with the electronic device 10 .
  • the server 20, for example, is implemented in a cloud type, and an electronic device 10 and/or an additional device associated with the electronic device 10 (eg, a smart phone in which a corresponding application is installed to interwork with an AI speaker, etc.) ) of user accounts can be stored and managed.
  • an electronic device 10 and/or an additional device associated with the electronic device 10 eg, a smart phone in which a corresponding application is installed to interwork with an AI speaker, etc.
  • the implementation form of the server 20 is not limited, and as an example, it is implemented as an STT (Speech to Text) server that converts a sound signal related to voice into text, or to perform the function of the STT server as a main server related to voice recognition. can be implemented.
  • the server 20 may be provided in plurality, such as the STT server and the main server, so that the electronic device 10 may communicate with the plurality of servers.
  • the server 20 may be provided with data for recognizing a voice uttered by a user, that is, a database (DB) in which information is stored.
  • the database may include, for example, a plurality of acoustic models predetermined by modeling signal characteristics of a voice.
  • the database may further include a language model determined in advance by modeling a linguistic order relationship such as words or syllables corresponding to the recognition target vocabulary.
  • the acoustic model and/or the language model may be configured by performing learning in advance.
  • the electronic device 10 can identify and process the received user voice, and output the processing result through sound or image.
  • FIG. 2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present invention.
  • the electronic device 10 includes an output unit 110 , a sound receiving unit 120 , a signal processing unit 161 , an interface unit 140 , a storage unit 150 , and a processor ( 160).
  • the configuration of the electronic device 10 according to an embodiment of the present invention shown in FIG. 2 is only an example, and the electronic device according to another embodiment may be implemented in a configuration other than the configuration shown in FIG. 2 . have. That is, the electronic device of the present invention may be implemented in a form in which a configuration other than the configuration shown in FIG. 2 is added or at least one of the configuration shown in FIG. 2 is excluded.
  • the output unit 110 outputs a sound, that is, a sound.
  • the output unit 110 may include, for example, at least one speaker capable of outputting sound in an audible frequency band of 20 Hz to 20 KHz.
  • the output unit 110 may output a sound corresponding to an audio signal/sound signal of a plurality of channels.
  • the output unit 110 may output a sound according to the processing of the sound signal as a user voice received through the sound receiving unit 120 .
  • the sound receiver 120 may receive a voice uttered by a user, that is, a sound wave.
  • the sound wave input through the sound receiving unit 120 is converted into an electrical signal by the signal converting unit.
  • the signal converter may include an AD converter that converts analog sound waves into digital signals.
  • the signal conversion unit may be included in a signal processing unit 161 to be described later.
  • the sound receiver 120 is implemented to be provided in the electronic device 10 by itself.
  • the sound receiving unit 120 may be implemented as a form provided in a separate device, not a component included in the electronic device 10 .
  • the electronic device 10 when the electronic device 10 is a display device such as a television, a user voice is received through a microphone, that is, a sound receiver installed in a remote control provided as an input device capable of user manipulation, and the corresponding user voice is received.
  • a sound signal may be transmitted from the remote control to the electronic device 10 .
  • the analog sound wave received through the microphone of the remote control may be converted into a digital signal and transmitted to the electronic device 10 .
  • the input device includes a terminal device, such as a smartphone, on which a remote control application is installed.
  • the interface unit 140 allows the electronic device 10 to transmit or receive signals with various external devices including the server 20 and the terminal device.
  • the interface unit 140 may include a wired interface unit 141 .
  • the wired interface unit 141 includes a connection unit for transmitting/receiving signals/data according to standards such as HDMI, HDMI-CEC, USB, Component, Display Port (DP), DVI, Thunderbolt, RGB cable, etc. can Here, the wired interface unit 141 may include at least one connector, terminal, or port corresponding to each of these standards.
  • the wired interface unit 141 may be implemented in a form including an input port for receiving a signal from an image source, etc., and may further include an output port in some cases to transmit/receive signals in both directions.
  • the wired interface unit 141 is configured to connect an antenna capable of receiving a broadcast signal according to a broadcasting standard such as terrestrial/satellite broadcasting, or a cable capable of receiving a broadcast signal according to a cable broadcasting standard to be connected, an HDMI port, a DisplayPort , DVI port, Thunderbolt, composite video, component video, super video, SCART, etc. may include a connector or port according to the video and / or audio transmission standard.
  • the electronic device 10 may have a built-in antenna capable of receiving a broadcast signal.
  • the wired interface unit 141 may include a connector or port according to a universal data transmission standard such as a USB port.
  • the wired interface unit 141 may include a connector or a port to which an optical cable can be connected according to an optical transmission standard.
  • the wired interface unit 141 is connected to an external microphone or an external audio device having a microphone, and may include a connector or a port capable of receiving or inputting an audio signal from the audio device.
  • the interface unit 111 is connected to an audio device such as a headset, earphone, or external speaker, and may include a connector or port capable of transmitting or outputting an audio signal to the audio device.
  • the wired interface unit 141 may include a connector or port according to a network transmission standard such as Ethernet.
  • the wired interface unit 141 may be implemented as a LAN card connected to a router or a gateway by wire.
  • the wired interface unit 141 is wired through the connector or port in a 1:1 or 1:N (N is a natural number) method such as an external device such as a set-top box, an optical media playback device, or an external display device, speaker, server, etc. By being connected, a video/audio signal is received from the corresponding external device or a video/audio signal is transmitted to the corresponding external device.
  • the wired interface unit 141 may include a connector or a port for separately transmitting video/audio signals.
  • the wired interface unit 141 is embedded in the electronic device 10 , but may be implemented in the form of a dongle or a module to be detachably attached to the connector of the electronic device 10 .
  • the interface unit 140 may include a wireless interface unit 142 .
  • the wireless interface unit 142 may be implemented in various ways corresponding to the implementation form of the electronic device 10 .
  • the wireless interface unit 142 is a communication method such as RF (radio frequency), Zigbee (Zigbee), Bluetooth (bluetooth), Wi-Fi (Wi-Fi), UWB (Ultra WideBand) and NFC (Near Field Communication), etc. Wireless communication can be used.
  • the wireless interface unit 142 may be implemented as a communication circuitry including a wireless communication module (S/W module, chip, etc.) corresponding to various types of communication protocols.
  • a wireless communication module S/W module, chip, etc.
  • the wireless interface unit 142 includes a wireless LAN unit.
  • the wireless LAN unit may be wirelessly connected to an external device through an access point (AP) under the control of the processor 160 .
  • the wireless LAN unit includes a WiFi module.
  • the wireless interface unit 142 includes a wireless communication module that supports one-to-one direct communication between the electronic device 10 and an external device wirelessly without an access point.
  • the wireless communication module may be implemented to support communication methods such as Wi-Fi Direct, Bluetooth, and Bluetooth low energy.
  • the storage unit 150 may store identification information (eg, MAC address or IP address) on the external device, which is a communication target device.
  • the wireless interface unit 142 is provided to perform wireless communication with an external device by at least one of a wireless LAN unit and a wireless communication module according to performance.
  • the air interface unit 142 may further include a communication module using various communication methods such as mobile communication such as LTE, EM communication including a magnetic field, and visible light communication.
  • the wireless interface unit 142 may transmit and receive data packets to and from the server by wirelessly communicating with the server on the network.
  • the wireless interface unit 142 may include an IR transmitter and/or an IR receiver capable of transmitting and/or receiving an IR (Infrared) signal according to an infrared communication standard.
  • the wireless interface unit 142 may receive or input a remote control signal from the remote control or other external device through the IR transmitter and/or the IR receiver, or may transmit or output a remote control signal to another external device.
  • the electronic device 10 may transmit/receive a remote control signal to and from the remote control or other external device through the wireless interface unit 142 of another method such as Wi-Fi or Bluetooth.
  • the electronic device 10 may further include a tuner for tuning the received broadcast signal for each channel.
  • the wireless interface unit 142 may transmit predetermined data as information of the user's voice received through the sound receiving unit 120 to an external device, that is, the server 20 .
  • the form/type of the transmitted data is not limited, and for example, an audio signal corresponding to a voice uttered by a user or a voice characteristic extracted from the audio signal may be included.
  • the wireless interface unit 142 may receive data of the processing result of the user's voice from the server 20 .
  • the electronic device 10 outputs a sound corresponding to the voice processing result through the output unit 110 based on the received data.
  • the above-described embodiment is an example, and the user's voice may not be transmitted to the server 20 , but may be processed by itself within the electronic device 10 . That is, in another embodiment, the electronic device 10 can be implemented to perform the role of the STT server.
  • the electronic device 10 may communicate with an input device such as a remote control through the wireless interface unit 142 to receive a sound signal corresponding to the user's voice from the input device.
  • an input device such as a remote control
  • the wireless interface unit 142 may receive a sound signal corresponding to the user's voice from the input device.
  • the communication module communicating with the server 20 and the communication module communicating with the remote controller may be different from each other.
  • the electronic device 10 may communicate with the server 20 through an Ethernet modem or Wi-Fi module, and may communicate through a remote controller and a Bluetooth module.
  • the communication module communicating with the server 20 and the communication module communicating with the remote control may be the same.
  • the electronic device 10 may communicate with the server 20 and the remote controller through the Bluetooth module.
  • the storage unit 150 is configured to store various data of the electronic device 10 .
  • the storage unit 150 should retain data even when power supplied to the electronic device 10 is cut off, and may be provided as a writable nonvolatile memory (writable ROM) to reflect changes. That is, the storage unit 150 may be provided with any one of a flash memory, an EPROM, or an EEPROM.
  • writable ROM writable nonvolatile memory
  • the storage unit 150 may further include a volatile memory such as DRAM or SRAM, in which the read or write speed of the electronic device 10 is faster than that of the nonvolatile memory.
  • a volatile memory such as DRAM or SRAM
  • the data stored in the storage 150 includes, for example, an operating system for driving the electronic device 10 , and various software, programs, applications, and additional data executable on the operating system.
  • An application stored and installed in the storage unit 150 in the electronic device 10 recognizes a user voice received through the sound receiver 120 and performs an operation according to the AI speaker. It can contain applications.
  • the AI speaker application when the AI speaker application is identified as a predetermined keyword through the sound receiver 120 , that is, an input of a trigger word, a user operation on a specific button of the electronic device 10 , etc. are identified.
  • the activation of the application may include switching the execution state of the application from the background mode to the foreground mode.
  • the storage unit 150 is a database (database) in which data for recognizing a user voice that can be received through the sound receiving unit 120, that is, information is stored. 151) may be included.
  • the database 151 may include, for example, a plurality of acoustic models determined in advance by modeling signal characteristics of speech.
  • the database 151 may further include a language model determined in advance by modeling a linguistic order relationship such as words or syllables corresponding to the recognition target vocabulary.
  • the database in which information for recognizing a user's voice is stored may be provided in the server 20, which is an example of an external device accessible by a wired or wireless network through the wireless interface unit 142 as described above.
  • the server 20 may be implemented, for example, in a cloud type.
  • the processor 160 controls all components of the electronic device 10 to operate.
  • the processor 160 executes instructions included in a control program to perform such a control operation.
  • the processor 160 includes at least one general-purpose processor that loads at least a part of the control program from the non-volatile memory in which the control program is installed into the volatile memory, and executes the loaded control program, for example, CPU (Central Processing). Unit) or an application processor (AP).
  • CPU Central Processing
  • Unit Central Processing
  • AP application processor
  • the processor 160 may include a single core, a dual core, a triple core, a quad core, and multiple cores thereof.
  • the processor 160 operates in a plurality of processors, for example, a main processor and a sleep mode (for example, only standby power is supplied and does not operate as an electronic device receiving a sound signal). It may include a sub-processor.
  • the processor, the ROM, and the RAM are interconnected through an internal bus, and the ROM and the RAM are included in the storage unit 150 .
  • a CPU or an application processor which is an example of implementing the processor 160 , may be implemented as a form included in a main SoC mounted on a PCB embedded in the electronic device 10 .
  • the control program may include program(s) implemented in the form of at least one of a BIOS, a device driver, an operating system, firmware, a platform, and an application program (application).
  • the application program is pre-installed or stored in the electronic device 10 when the electronic device 10 is manufactured, or receives data of the application program from the outside when used later, based on the received data. It may be installed in the electronic device 10 . Data of the application program may be downloaded to the electronic device 10 from, for example, an external server such as an application market. Such an application program, an external server, etc. is an example of the computer program product of the present invention, but is not limited thereto.
  • the processor 160 may include a signal processing unit 161 as shown in FIG. 2 .
  • the signal processing unit 161 processes an audio signal, that is, a sound signal.
  • the sound signal processed by the signal processing unit 161 may be output as sound through the output unit 110 to provide audio content to the user.
  • the signal processing unit 161 is a software block of the processor 160 , and may be implemented in a form that performs one function of the processor 160 .
  • the signal processing unit 161 is a separate configuration separated from the CPU or application processor (AP), which is an example of implementing the processor 160, for example, a microprocessor such as a digital signal processor (DSP) or It may be implemented as an integrated circuit (IC), or may be implemented by a combination of hardware and software.
  • AP application processor
  • DSP digital signal processor
  • IC integrated circuit
  • the processor 160 may include a voice recognition module 162 capable of recognizing a voice signal uttered by a user, as shown in FIG. 2 .
  • FIG. 3 is a block diagram illustrating a configuration of a voice recognition module of an electronic device according to an embodiment of the present invention.
  • the voice recognition module 162 receives the user's utterance as an input, and in response to an input of a predetermined start word (hereinafter, also referred to as a trigger word or a wake-up word (WUW)), a voice It may be implemented to initiate an action for recognition.
  • a predetermined start word hereinafter, also referred to as a trigger word or a wake-up word (WUW)
  • WUW wake-up word
  • the voice recognition module 162 includes a preprocessor 301 , a start word engine 302 , a threshold value determiner 304 and a voice, as shown in FIG. 3 .
  • a recognition engine 304 may be included.
  • the preprocessor 301 may receive a voice signal according to the user's utterance from the sound receiver 120 and perform preprocessing for removing ambient noise, that is, noise.
  • the pre-processing may include processes such as digital signal conversion, filtering, framing, and the like, and a meaningful voice signal can be extracted by removing unnecessary ambient noise from the voice signal according to the above processes.
  • the start word engine 302 performs pattern matching by comparing features extracted from the pre-processed speech signal with a predetermined pattern.
  • the start word engine 302 may perform pattern matching using an acoustic model configured by performing pre-learning.
  • the start word engine 302 determines whether the input speech includes a start word based on the similarity between the input speech, that is, the waveform of the voice signal (sound signal) according to the user's speech, and the start word pattern of the acoustic model. can be identified.
  • the start word engine 302 determines that the input utterance includes the start word when, as a result of the comparison by pattern matching, the score of the input utterance, that is, the utterance score is greater than a predetermined start word threshold (WUW Threshold). can be identified.
  • WUW Threshold a predetermined start word threshold
  • a threshold of similarity that is, a starting word threshold (WUW Threshold) may be preset based on a learning algorithm using an acoustic model.
  • the starting word threshold (WUW Threshold) is defined as a condition for activating the voice recognition function of the electronic device 1 .
  • the starting word threshold (WUW Threshold) is distinguished from the noise threshold and SNR threshold respectively used for comparison with the noise characteristic and noise-to-speech characteristic of a sound signal, which will be described later.
  • the electronic device 1 may be implemented to use two starting word thresholds set to have different values when the user's utterance is made in a noisy environment.
  • a specific example of applying the two starting word thresholds in such a noisy environment will be described in more detail in the embodiment of FIG. 4 to be described later.
  • the threshold value determining unit 303 identifies whether the user's utterance is made in a noisy environment using a predetermined noise threshold.
  • the identification of the noise environment as a noise characteristic of a sound signal according to a user's utterance, may be made based on a comparison between power and a noise threshold in a specific section.
  • the threshold value determining unit 303 identifies whether the ratio of the sound signal to the noise is equal to or greater than a specific level as the speech characteristic of the sound signal according to the user's speech using a predetermined SNR threshold.
  • the threshold value determining unit 303 is, based on the comparison result of the noise characteristic of the sound signal and the noise threshold as described above, or the comparison result of the speech characteristic of the sound signal and the threshold value of the SNR threshold.
  • the SNR threshold can be changed. Changing the SNR threshold may include, for example, adjusting the value upwards or downwards. A specific example of changing the SNR threshold will be described in more detail in the embodiment of FIG. 4 to be described later.
  • the voice recognition engine 304 may be implemented to include a voice recognition function for a voice signal received as a user's utterance, that is, a sound signal, so as to perform a recognition operation on the user's utterance.
  • the voice recognition engine 304 satisfies the activation condition of the second stage based on the two threshold values of the starting words when the user's utterance is made in a noisy environment
  • the voice recognition function may be activated, and the electronic device 10 may be implemented to perform a recognition operation regarding the user's utterance based on the received sound signal.
  • a specific example in which the voice recognition function is activated according to the activation conditions of these two steps will be described in more detail in the embodiment of FIG. 4 to be described later.
  • the voice recognition function of the voice recognition engine 304 may be performed using one or more voice recognition algorithms.
  • the voice recognition engine 304 extracts a vector representing a voice feature from a voice signal uttered by a user, and compares the extracted vector with an acoustic model of the database 151 or the server 20 to perform voice recognition.
  • the acoustic model is a model according to previously performed learning as an example.
  • the voice recognition module 162 comprising the preprocessor 301, the starting word engine 302, the threshold value determining unit 304 and the voice recognition engine 304 is a An example implemented as an embedded type is described as an example, but the present invention is not limited thereto. Accordingly, the voice recognition module 162 may be implemented as a configuration of the electronic device 10 separate from the CPU, for example, a separate chip such as a microcomputer provided as a dedicated processor for a voice recognition function. .
  • each component of the voice recognition module 162, the preprocessor 301, the start word engine 302, the threshold value determination unit 304, and the voice recognition engine 304 may be implemented as a software block as an example, In some cases, at least one configuration may be implemented in an excluded form, or at least one other configuration may be added.
  • the aforementioned preprocessor 301 in order for the electronic device 10 to perform the voice recognition function, the aforementioned preprocessor 301 , the start word engine 302 , the threshold value determiner 304 , and the voice recognition engine 304 . It will be understood that operations performed by at least one of these are performed by the processor 160 of the electronic device 10 .
  • the processor 160 identifies whether a value representing the noise characteristic of the sound signal received through the sound receiver 120 is greater than a noise threshold (hereinafter, also referred to as a first threshold), and Whether or not the value indicating the ignition characteristic is greater than the SNR threshold (hereinafter also referred to as the second threshold value) is identified, and the value of the noise characteristic is greater than the first threshold value and the value of the ignition characteristic is greater than the second threshold value
  • a recognition operation on the user's utterance may be performed based on the received sound signal, and the second threshold value, that is, the SNR threshold may be adjusted upward.
  • the processor 160 determines that the similarity between the waveform of the sound signal and the predefined start word pattern is greater than a first start word threshold (hereinafter, also referred to as a third threshold), that is, a sound signal that satisfies the first activation condition. With respect to the sound signal, it may be identified whether the value of the noise characteristic and the value of the ignition characteristic are greater than a first threshold value and a second threshold value, respectively.
  • a first start word threshold hereinafter, also referred to as a third threshold
  • the processor 160 when it is identified that the value of the noise characteristic of the received sound signal is equal to or less than the first threshold, the processor 160 performs a recognition operation on the user's utterance based on the received sound signal, and performs a recognition operation on the second threshold, that is, the SNR.
  • the threshold can be adjusted downward.
  • the processor 160 determines that the similarity between the waveform of the sound signal and the starting word pattern is greater than the first starting word threshold ( Hereinafter, it is also referred to as a fourth threshold value), that is, when the second activation condition is satisfied, a recognition operation regarding the user's utterance may be performed based on the received sound signal.
  • the operation of the processor 160 may be implemented as a computer program stored in a computer program product (not shown) provided separately from the electronic device 10 .
  • the computer program product includes a memory in which instructions corresponding to the computer program are stored, and a processor.
  • the instruction is executed by the processor 160 , the value indicating the noise characteristic of the sound signal received through the sound receiving unit 120 is greater than the first threshold value, and the value indicating the speech characteristic of the sound signal is the second threshold value if greater than, performing a recognition operation on the user's utterance based on the received sound signal, and allowing the second threshold to be adjusted upward.
  • the instruction includes, if the value representing the noise characteristic of the received sound signal is equal to or less than the first threshold, performing a recognition operation on the user's utterance based on the received sound signal and lowering the second threshold.
  • the processor 160 of the electronic device 10 may download and execute a computer program stored in a separate computer program product to perform the above-described operation of the instruction.
  • FIG. 4 is a flowchart illustrating a control method of an electronic device according to an embodiment of the present invention
  • FIG. 5 is a diagram for explaining pattern matching for activating a voice recognition function in an electronic device according to an embodiment of the present invention
  • 6 is a view for explaining the identification of noise characteristics of an electronic device according to an embodiment of the present invention.
  • the electronic device 10 may receive a sound signal through the sound receiver 120 ( 401 ).
  • the received sound signal may be a signal according to the user's utterance.
  • the processor 160 may identify whether the sound signal received in step 401 satisfies the first activation condition for the voice recognition function (step 402).
  • the processor 160 performs pattern matching between a sound signal from which ambient noise, that is, noise has been removed, and a predefined start word signal as shown in FIG. 5 , as shown in FIG. 1 It can be identified whether the activation condition is satisfied.
  • the processor 160 derives a score speech as a similarity between the user's speech, that is, a waveform of a sound signal and a pattern of a start word signal, based on the pattern matching as shown in FIG. 5 , and uses Equation 1 below It can be identified whether the derived speech score, that is, the degree of similarity, is greater than the predetermined first start word threshold WUW Threshold1, that is, the third threshold value.
  • the first starting word threshold (third threshold) is for identifying whether the sound signal satisfies the first activation condition for the voice recognition function, and is applied regardless of whether the user's utterance is made in a noisy environment. do.
  • the first starting word threshold may be preset to, for example, 0.1, but the value is not limited thereto as an example.
  • the processor 160 may determine that the sound signal input in step 401 satisfies the first activation condition when it is identified that the utterance score is greater than the first starting word threshold by Equation (1).
  • the processor 160 determines whether the value representing the noise characteristic of the sound signal according to the user's utterance is greater than a predetermined noise threshold, that is, the first threshold. can be identified (403).
  • the first threshold value is for identifying whether the user's surroundings are a noisy environment, and may be preset to correspond to a power value of a sound signal when a sufficiently loud noise is present in the surroundings.
  • the processor 160 identifies a section including a start word uttered by the user (hereinafter, also referred to as a start word section), and a predefined time before the start word section It can be identified whether a value indicating a noise characteristic received in a length section (hereinafter, also referred to as a noise characteristic confirmation section) is greater than a first threshold value.
  • the sound signal received by the sound receiving unit 120 in a streaming manner is, as shown in FIG. 6 , a First In First Out (FIFO) queue. It can be temporarily stored in units of consecutive frames in a (queue) type data structure. That is, when the next frame is received, the streaming sound signal is stored in such a way that the first stored frame is pushed out.
  • the length of the sound signal to be stored may be preset to correspond to the storage space, and for example, it may be implemented such that a signal having a length of 2.5 seconds is stored.
  • the processor 160 may monitor whether a start word according to the user's utterance is included in each frame of the streaming sound signal received and stored in units of consecutive frames as described above. The processor 160, based on the monitoring, for example, as described in step 402, when it is detected that the speech score in a specific signal frame is greater than the first start word threshold, the corresponding signal frame is a user speech, that is, start It can be identified as containing
  • the processor 160 may identify a predetermined time length from the signal frame identified in step 402, for example, a time period up to about 1 second before the start word period. In addition, the processor 160 may identify a predetermined length of time before the identified start word section, for example, a time section of about 1.5 seconds as the noise characteristic confirmation section.
  • the noise characteristic check section may be defined to correspond to a time obtained by subtracting the time of the start word section from the time of the entire sound signal being stored, and in the present invention, the length of time corresponding to the start word section and the time of the noise property check section The length is not limited to the examples presented.
  • the processor 160 compares the signal power of the noise characteristic confirmation section with a first threshold value to determine whether the surrounding environment is sufficiently noisy when uttering, that is, whether the user's utterance is made in a noisy environment. can be identified.
  • step 403 if it is identified that the noise characteristic of the sound signal, that is, the signal power is greater than the first threshold, the processor 160 determines whether the speech characteristic of the sound signal is greater than a predetermined SNR threshold, that is, the second threshold. can be identified (404).
  • the speech characteristic may include a signal to noise ratio (SNR) of a sound signal.
  • the processor 160 calculates a posteriori SNR (SNR post ) corresponding to the noise ratio to the total sound signal as the speech characteristic of the sound signal, and using Equation 2 below, It may be identified whether the calculated posterior SNR is greater than a predetermined second threshold, that is, an SNR threshold.
  • a predetermined second threshold that is, an SNR threshold.
  • the post SNR (SNR post ) may be calculated using Equations 3 and 4 below.
  • X(p,k) represents the total sound signal including noise
  • S(p,k) represents the speech signal
  • N(p,k) represents the noise signal, respectively.
  • the received input sound signal (voice signal) X may be expressed as the sum of the k-th spectral element for each frame p of the speech element S and the noise element N, as shown in Equation (3).
  • the post SNR (SNR post ) is the ratio of the magnitude of the noise, that is, the noise N(p,k), to the total sound signal X(p,k) including the noise for each frame (p), as expressed in Equation 4 below. can be calculated by
  • the final posterior SNR for all frames may be calculated as an average value of posterior SNRs for each frame p.
  • the processor 160 determines the final posterior SNR calculated in this way as the speech characteristic of the sound signal in step 404, and compares the speech characteristic with a second threshold (SNR Threshold). , it is possible to identify whether the user's speech is sufficiently loud in a noisy environment.
  • SNR Threshold a second threshold
  • the second threshold that is, the SNR threshold
  • the SNR threshold is a predetermined value corresponding to a level at which an input sound signal can be recognized as generated by a user's speech in a noisy environment, and the initial value is to be set in advance.
  • the initial SNR threshold may be set to, for example, 4, but is not limited thereto.
  • step 403 if the electronic device 10 operates in a sufficiently noisy noise environment (YES in step 430), in step 402, a sound signal including ambient noise, not the user's actual speech, is regarded as including the starting word. Misrecognition may occur.
  • the value of the noise characteristic of the signal is set as a first threshold
  • the ignition characteristic of the sound signal is further compared with a second threshold value (initial SNR threshold) in step 404 .
  • step 404 it is further determined whether the user's utterance is sufficiently loud in a noisy environment by step 404, and based on the result, it is possible to control execution of a trigger for performing a voice recognition operation.
  • step 404 when it is identified that the value of the speech characteristic of the sound signal is greater than a predetermined second threshold, for example, an initial SNR threshold, the processor 160 executes a trigger to receive the received electronic device 10 .
  • a predetermined second threshold for example, an initial SNR threshold
  • a control is performed to perform a voice recognition operation on the user's utterance based on the sound signal ( 405 ).
  • the final posterior SNR calculated in step 404 is, for example, 5, which is greater than the initial SNR threshold of 4, a trigger may be executed.
  • step 403 if it is identified that the user's speech is sufficiently loud (YES in step 404), the processor 160 immediately executes a trigger, thereby activating the voice recognition function in the electronic device 10 Thus, an operation can be performed in response to the received sound signal.
  • the processor 160 may adjust the second threshold upward from the predetermined initial SNR threshold ( 406 ).
  • the processor 160 may reset the second threshold to reflect the noise environment as a change in the surrounding environment after executing the trigger in step 405 . .
  • the processor 160 uses the initial SNR threshold (SNR Th _ init ) and the post SNR (SNR post ) calculated in step 404 according to Equation 5 below to generate a new second threshold value (SNR). Threshold) can be derived.
  • the new second threshold (SNR Threshold) is calculated according to Equation (5).
  • the second threshold value adjusted upward as described above becomes a value applied in step 404 to the corresponding sound signal in response to reception of the next sound signal.
  • the second threshold value (SNR threshold) as a trigger execution condition is increased correspondingly by adjusting the second threshold value (SNR threshold) by the user in the noisy environment. It can induce utterance in a loud voice.
  • the processor 160 when it is identified that the value of the ignition characteristic of the sound signal is less than or equal to a predetermined second threshold, for example, an initial SNR threshold, in step 404, the processor 160, the corresponding sound signal Whether or not satisfies the second activation condition may be further identified ( 407 ).
  • a predetermined second threshold for example, an initial SNR threshold
  • the processor 160 using Equation 6 below, the utterance score derived as a similarity between the waveform of the sound signal derived in step 401 and the start word pattern is a predetermined second start word threshold (WUW). Threshold1), that is, when it is greater than the fourth threshold, it may be identified that the sound signal satisfies the second activation condition.
  • WUW second start word threshold
  • the second starting word threshold WUW Threshold2 (fourth threshold) is for identifying whether the sound signal satisfies the second activation condition for the voice recognition function, and when the user's speech is a noisy environment (step 403 to YES).
  • the second starting word threshold WUW Threshold2 (the fourth threshold) is set to a value greater than the first starting word threshold WUW Threshold1 (third threshold) in step 401 as shown in Equation 7 below.
  • the first starting word threshold value (third threshold value) may be preset to 0.1
  • the second starting word threshold value (fourth threshold value) may be preset to 0.15, but this is presented as an example , so the value is not limited.
  • the processor 160 may determine that the sound signal input in step 401 satisfies the second activation condition when it is identified that the utterance score is greater than the second starting word threshold by Equation (6).
  • step 407 If it is determined in step 407 that the sound signal satisfies the second activation condition, the processor 160 executes a trigger to control the electronic device 10 to perform a voice recognition operation related to the user's utterance based on the received sound signal do (408).
  • step 407 if it is determined in step 407 that the sound signal does not satisfy the second activation condition, that is, if the utterance score is identified as being equal to or less than the second starting word threshold by Equation 6, since the processor 160 does not execute the trigger, The electronic device 10 is controlled to keep the voice recognition inactive ( 409 ).
  • the processor 160 when the processor 160 is in a noisy environment (YES in step 403), the similarity between the waveform of the input sound signal and the pattern of the start word signal is the first start. is identified to be greater than the threshold value, so that even if the input sound signal meets the first activation condition (YES in step 402), the voice recognition function is activated only when the sound signal satisfies up to the second activation condition (YES in step 407) , control based on the activation conditions of the second stage is made.
  • step 407 the speech recognition function is activated only when the speech score indicating the similarity according to the pattern matching between the sound signal and the starting word signal is greater than the second starting word threshold, the user's actual speech in step 402 Even if the sound signal including the ambient noise other than , is erroneously recognized as including the starting word, the possibility of an erroneous operation is reduced by step 407 .
  • step 403 when it is identified that the noise characteristic of the sound signal, that is, the signal power is equal to or less than the first threshold, the processor 160 controls the electronic device 10 to perform a voice recognition operation by executing a trigger (410). ).
  • the processor 160 may adjust the second threshold downward from a predetermined initial SNR threshold ( 411 ).
  • the electronic device 10 determines that the surroundings are not noisy.
  • the second threshold value may be reset to reflect this.
  • the processor 160 calculates a post SNR (SNR post ) corresponding to the noise ratio to the total sound signal as the speech characteristic of the sound signal as described in step 404, and according to Equation 5 above, A new second threshold (SNR Threshold) may be derived using the calculated post SNR and the initial SNR threshold.
  • SNR post post SNR
  • SNR Threshold the calculated final posterior SNR is derived to be smaller than the case in step 404, and may be, for example, 2
  • the new second threshold (SNR Threshold) is 4.16*log_4.
  • the second threshold value (SNR threshold) as a trigger execution condition is lowered correspondingly to the case where the environment is not in a noisy environment , even when a small sound is uttered by the user, it can operate to enable immediate voice recognition.
  • step 402 if it is determined in step 402 that the sound signal does not satisfy the first activation condition, that is, an utterance score indicating the similarity between the sound signal and the start word signal by Equation 1 is the first start If it is identified as being equal to or less than the threshold value, the processor 160 does not execute a trigger, and thus the electronic device 10 may be controlled to maintain the voice recognition deactivation ( 412 ).
  • the noise-to-signal ratio which is the utterance characteristic of the sound signal
  • the electronic device 10 when a signal-to-noise ratio (SNR), which is an utterance characteristic of a sound signal in a noisy environment, is greater than the SNR threshold, that is, when user utterance is sufficiently large compared to noise, the SNR threshold to be higher, inducing the user to utter the starting word loudly in a noisy environment, the effect of improving the accuracy of motion can be expected.
  • SNR signal-to-noise ratio
  • the electronic device 10 when the surrounding environment is not a noisy environment, by adjusting the SNR threshold to be lowered, immediate operation of the electronic device 10 according to the environmental change in a quiet environment can make this happen.

Abstract

The present invention relates to an electronic device and a control method therefor. The electronic device comprises: a sound reception unit; and a processor that, when a noise characteristic acquired from a sound signal received via the sound reception unit is greater than a first threshold value, and an utterance characteristic acquired from the sound signal is greater than a second threshold value, performs a recognition operation for a user utterance on the basis of the sound signal, and adjusts the second threshold value to be greater.

Description

전자장치 및 그 제어방법Electronic device and its control method
본 발명은 전자장치 및 그 제어방법에 관한 것으로서, 보다 상세하게는 사용자로부터 발화된 음성을 처리하는 전자장치 및 그 제어방법에 관한 것이다.The present invention relates to an electronic device and a control method thereof, and more particularly, to an electronic device for processing a voice uttered by a user and a control method thereof.
인공지능(artificial intelligence, AI) 스피커, 스마트폰(smart phone)이나 태블릿(tablet)과 같은 모바일 장치, 스마트 TV 등의 전자장치는 사용자로부터 발화된 음성을 인식하고, 그 음성인식에 따른 기능을 수행할 수 있다.Electronic devices such as artificial intelligence (AI) speakers, mobile devices such as smart phones or tablets, and smart TVs recognize the voice uttered by the user and perform a function according to the voice recognition can do.
전자장치는, 사용자로부터 미리 정해진 시작어 즉, 트리거 워드(trigger word)가 입력되는 것을 인식하여, 음성인식 기능이 활성화되도록 동작할 수 있다.The electronic device may operate to activate the voice recognition function by recognizing that a predetermined start word, that is, a trigger word, is input from the user.
시작어 인식은, 사용자음성의 오디오 신호와 시작어의 유사성을 판별하는 과정을 포함할 수 있는데, 예를 들면, 오디오 신호와 시작어의 패턴이 유사한 정도가 미리 정해진 기준 이상인 경우, 입력된 음성이 시작어를 포함하는 것으로 식별할 수 있다.The start word recognition may include a process of determining the similarity between the audio signal of the user's voice and the start word. For example, when the similarity between the pattern of the audio signal and the start word is greater than a predetermined criterion, the input voice is It can be identified by including the starting word.
상기와 같은 시작어 인식 과정에서, 소음과 같은 전자장치의 주변 환경의 영향으로 오인식이 발생하는 경우가 있으므로, 시작어 인식의 정확도를 향상시키고자 하는 시도가 이루어지고 있다.In the process of recognizing the starting word as described above, there are cases in which misrecognition may occur due to the influence of the surrounding environment of the electronic device such as noise, so an attempt is made to improve the accuracy of recognizing the starting word.
본 발명은, 사용자 음성을 수신하여 처리 가능한 전자장치에서, 사용자의 발화 특성에 대응하여, 소음 환경인지 여부에 따라 소리 신호의 발화 특성을 식별하기 위한 임계값을 재설정하도록 함으로써, 시작어 인식의 정확도가 향상되도록 하는 전자장치 및 그 제어방법을 제공하는 것이다.According to the present invention, in an electronic device capable of receiving and processing a user's voice, the threshold value for identifying the speech characteristic of a sound signal is reset according to whether the user's speech property is in a noisy environment in response to the user's speech property, so that the accuracy of starting word recognition To provide an electronic device and a method for controlling the same.
본 발명 일 실시예에 따른 전자장치는, 소리수신부; 및 소리수신부를 통해 수신되는 소리 신호의 소음 특성을 나타내는 값이 제1임계값보다 크고, 소리 신호의 발화 특성을 나타내는 값이 제2임계값보다 크면, 소리 신호에 기초하여 사용자 발화에 관한 인식 동작을 수행하고, 제2임계값이 상향되도록 조정하는 프로세서를 포함한다.An electronic device according to an embodiment of the present invention includes: a sound receiver; and when the value indicating the noise characteristic of the sound signal received through the sound receiver is greater than the first threshold value and the value indicating the speech characteristic of the sound signal is greater than the second threshold value, the recognition operation regarding the user's utterance based on the sound signal and a processor for adjusting the second threshold to increase.
발화 특성은, 소리 신호의 소음 대비 신호비를 포함할 수 있다.The speech characteristic may include a signal-to-noise ratio of the sound signal.
프로세서는, 소리 신호의 각 프레임 별로, 소리 신호에 대한 잡음의 크기 비율을 연산하고, 연산된 각 프레임 별 비율의 평균값을 발화 특성의 값으로 결정할 수 있다.The processor may calculate a ratio of the noise to the sound signal for each frame of the sound signal, and determine an average value of the calculated ratio for each frame as the value of the speech characteristic.
프로세서는, 소리 신호에 미리 정의된 시작어가 포함되어 있는지 여부를 식별하고, 시작어가 포함된 것으로 식별되는 소리 신호의 소음 특성이 제1임계값보다 큰지 여부를 식별할 수 있다.The processor may identify whether a predefined start word is included in the sound signal, and identify whether a noise characteristic of the sound signal identified as including the start word is greater than a first threshold value.
프로세서는, 소리 신호의 파형과, 미리 정의된 시작어 패턴 간의 유사도에 기초하여 소리 신호에 시작어가 포함되어 있는지 여부를 식별할 수 있다.The processor may identify whether a start word is included in the sound signal based on a similarity between a waveform of the sound signal and a predefined start word pattern.
유사도의 임계값은 음향 모델을 이용한 학습 알고리즘에 기반하여 미리 설정될 수 있다.The threshold of similarity may be preset based on a learning algorithm using an acoustic model.
프로세서는, 유사도가 제3임계값보다 큰 소리 신호의 발화 특성의 값이 제2임계값 이하이면, 유사도가 제3임계값보다 큰 제4임계값을 만족하는 소리 신호에 기초하여 사용자 발화에 관한 인식 동작을 수행할 수 있다.If the value of the speech characteristic of the sound signal having the similarity greater than the third threshold value is equal to or less than the second threshold value, the processor is configured to provide information regarding the user's speech based on the sound signal satisfying a fourth threshold value having the similarity greater than the third threshold value. A recognition operation may be performed.
프로세서는, 시작어가 포함된 구간 이전에 기정의된 시간 길이의 구간에 수신되는 소리 신호의 소음 특성의 값이 제1임계값보다 큰지 여부를 식별할 수 있다.The processor may identify whether the value of the noise characteristic of the sound signal received in a section having a predefined time length before the section including the start word is greater than the first threshold value.
프로세서는, 기정의된 시간 길이의 구간에 수신되는 소리 신호의 전력값을 제1임계값과 비교할 수 있다.The processor may compare the power value of the sound signal received in the section of the predefined time length with the first threshold value.
프로세서는, 소음 특성의 값이 제1임계값 이하이면, 제2임계값이 하향되도록 조정할 수 있다.If the value of the noise characteristic is equal to or less than the first threshold, the processor may adjust the second threshold to decrease.
한편, 본 발명 일 실시예에 따른 전자장치의 제어방법은, 소리수신부를 통해 수신되는 소리 신호에서 소음 특성을 획득하는 단계; 소리 신호에서 발화 특성을 획득하는 단계; 및 소음 특성을 나타내는 값이 제1임계값보다 크고, 발화 특성을 나타내는 값이 제2임계값보다 크면, 소리 신호에 기초하여 사용자 발화에 관한 인식 동작을 수행하고, 제2임계값이 상향되도록 조정하는 단계를 포함한다.Meanwhile, a control method of an electronic device according to an embodiment of the present invention includes: acquiring a noise characteristic from a sound signal received through a sound receiver; obtaining a speech characteristic from the sound signal; and when the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation for the user's utterance is performed based on the sound signal, and the second threshold value is adjusted to increase including the steps of
발화 특성은, 소리 신호의 소음 대비 신호비를 포함할 수 있다.The speech characteristic may include a signal-to-noise ratio of the sound signal.
소리 신호의 각 프레임 별로, 소리 신호에 대한 잡음의 크기 비율을 연산하고, 연산된 각 프레임 별 비율의 평균값을 발화 특성의 값으로 결정하는 단계를 더 포함할 수 있다.The method may further include calculating a ratio of the noise to the sound signal for each frame of the sound signal, and determining an average value of the calculated ratio for each frame as a value of the speech characteristic.
소리 신호에 미리 정의된 시작어가 포함되어 있는지 여부를 식별하는 단계; 및하고, 시작어가 포함된 것으로 식별되는 소리 신호의 소음 특성의 값이 제1임계값보다 큰지 여부를 식별하는 단계를 더 포함할 수 있다.identifying whether the sound signal includes a predefined starting word; and, identifying whether the value of the noise characteristic of the sound signal identified as including the starting word is greater than a first threshold value.
시작어가 포함되어 있는지 여부를 식별하는 단계는, 소리 신호의 파형과, 미리 정의된 시작어 패턴 간의 유사도에 기초하여 소리 신호에 시작어가 포함되어 있는지 여부를 식별할 수 있다.The step of identifying whether the start word is included may include identifying whether the start word is included in the sound signal based on a similarity between the waveform of the sound signal and a predefined start word pattern.
유사도의 임계값은 음향 모델을 이용한 학습 알고리즘에 기반하여 미리 설정될 수 있다.The threshold of similarity may be preset based on a learning algorithm using an acoustic model.
유사도가 제3임계값보다 큰 소리 신호의 발화 특성의 값이 제2임계값 이하이면, 유사도가 제3임계값보다 큰 제4임계값을 만족하는 소리 신호에 기초하여 사용자 발화에 관한 인식 동작을 수행하는 단계를 더 포함할 수 있다.If the value of the speech characteristic of the sound signal having a similarity greater than the third threshold is less than or equal to the second threshold, a recognition operation regarding the user's speech is performed based on the sound signal satisfying the fourth threshold having the similarity greater than the third threshold. It may further include the step of performing.
시작어가 포함된 것으로 식별되는 구간 이전에 기정의된 시간 길이의 구간에 수신되는 소리 신호의 소음 특성의 값이 제1임계값보다 큰지 여부를 식별하는 단계를 더 포함할 수 있다.The method may further include the step of identifying whether a value of a noise characteristic of a sound signal received in a section having a predefined time length before the section identified as including the starting word is greater than a first threshold value.
소음 특성의 값이 제1임계값 이하이면, 제2임계값이 하향되도록 조정하는 단계를 더 포함할 수 있다.If the value of the noise characteristic is less than or equal to the first threshold, the method may further include adjusting the second threshold to be lowered.
한편, 본 발명 일 실시예에 따른 컴퓨터가 읽을 수 있는 코드로서, 전자장치의 제어방법을 수행하는 코드를 포함하는 컴퓨터 프로그램이 저장된 기록매체에 있어서, 전자장치의 제어방법은, 소리수신부를 통해 수신되는 소리 신호에서 소음 특성을 획득하는 단계; 소리 신호에서 발화 특성을 획득하는 단계; 및 소음 특성을 나타내는 값이 제1임계값보다 크고, 발화 특성을 나타내는 값이 제2임계값보다 크면, 소리 신호에 기초하여 사용자 발화에 관한 인식 동작을 수행하고, 제2임계값이 상향되도록 조정하는 단계를 포함한다.Meanwhile, in a computer-readable code according to an embodiment of the present invention, in a recording medium storing a computer program including a code for performing a control method of an electronic device, the control method of the electronic device is received through a sound receiver. acquiring noise characteristics from the sound signal being obtaining a speech characteristic from the sound signal; and when the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation for the user's utterance is performed based on the sound signal, and the second threshold value is adjusted to increase including the steps of
상기한 바와 같은 본 발명의 전자장치 및 그 제어방법에 따르면, 소음 환경에서 소리 신호에 대한 사용자의 발화특성을 식별하기 위한 임계값을 재설정함으로써, 사용자로 하여금 큰 소리로 시작어를 발화하도록 유도하여 동작의 정확성을 향상시키는 효과를 기대할 수 있다. According to the electronic device and the control method of the present invention as described above, by resetting the threshold value for identifying the user's speech characteristics with respect to the sound signal in a noisy environment, the user induces the user to utter the starting word in a loud voice, The effect of improving the accuracy of motion can be expected.
또한, 본 발명의 전자장치 및 그 제어방법에 따르면, 전자장치가 소음 환경에서 사용자의 실제 발화가 아닌 주변 소음이 포함된 소리 신호를 시작어로 잘못 인식하는 오동작의 발생이 감소되어, 시작어 인식의 정확도가 향상되는 효과가 있다.In addition, according to the electronic device and the control method thereof of the present invention, the occurrence of a malfunction in which the electronic device incorrectly recognizes a sound signal including ambient noise rather than an actual utterance of a user in a noisy environment as a starting word is reduced, so that It has the effect of improving the accuracy.
도 1은 본 발명 일 실시예에 의한 전자장치를 포함하는 음성인식 시스템을 도시한다.1 illustrates a voice recognition system including an electronic device according to an embodiment of the present invention.
도 2는 본 발명 일 실시예에 따른 전자장치의 구성을 도시한 블록도이다.2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present invention.
도 3은 본 발명 일 실시예에 따른 전자장치의 음성인식모듈의 구성을 도시한 블록도이다.3 is a block diagram illustrating a configuration of a voice recognition module of an electronic device according to an embodiment of the present invention.
도 4는 본 발명 일 실시예에 따른 전자장치의 제어방법을 도시한 흐름도이다.4 is a flowchart illustrating a method of controlling an electronic device according to an embodiment of the present invention.
도 5는 본 발명 일 실시예에 따른 전자장치에서 음성인식기능의 활성화를 위한 패턴 매칭을 설명하기 위한 도면이다.5 is a diagram for explaining pattern matching for activating a voice recognition function in an electronic device according to an embodiment of the present invention.
도 6은 본 발명 일 실시예에 따른 전자장치의 소음특성 식별을 설명하기 위한 도면이다.6 is a view for explaining the identification of noise characteristics of an electronic device according to an embodiment of the present invention.
이하에서는 첨부 도면을 참조하여 본 발명의 실시예들을 상세히 설명한다. 도면에서 동일한 참조번호 또는 부호는 실질적으로 동일한 기능을 수행하는 구성요소를 지칭하며, 도면에서 각 구성요소의 크기는 설명의 명료성과 편의를 위해 과장되어 있을 수 있다. 다만, 본 발명의 기술적 사상과 그 핵심 구성 및 작용이 이하의 실시예에 설명된 구성 또는 작용으로만 한정되지는 않는다. 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numbers or symbols refer to components that perform substantially the same function, and the size of each component in the drawings may be exaggerated for clarity and convenience of description. However, the technical spirit of the present invention and its core configuration and operation are not limited to the configuration or operation described in the following embodiments. In describing the present invention, if it is determined that a detailed description of a known technology or configuration related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.
본 발명의 실시예에서, 제1, 제2 등과 같이 서수를 포함하는 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용되며, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 또한, 본 발명의 실시예에서, '구성되다', '포함하다', '가지다' 등의 용어는 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 또한, 본 발명의 실시예에서, '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있으며, 적어도 하나의 모듈로 일체화되어 구현될 수 있다. 또한, 본 발명의 실시예에서, 복수의 요소 중 적어도 하나(at least one)는, 복수의 요소 전부뿐만 아니라, 복수의 요소 중 나머지를 배제한 각 하나 혹은 이들의 조합 모두를 지칭한다.In an embodiment of the present invention, terms including an ordinal number such as first, second, etc. are used only for the purpose of distinguishing one element from another element, and the expression of the singular is plural unless the context clearly indicates otherwise. includes the expression of In addition, in an embodiment of the present invention, terms such as 'consisting', 'comprising', 'having' and the like are one or more other features or the presence of numbers, steps, operations, components, parts, or combinations thereof. Or it should be understood that the possibility of addition is not excluded in advance. In addition, in an embodiment of the present invention, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software or a combination of hardware and software, and is integrated into at least one module. and can be implemented. Further, in an embodiment of the present invention, at least one of the plurality of elements refers to all of the plurality of elements as well as each one or a combination thereof excluding the rest of the plurality of elements.
도 1은 본 발명 일 실시예에 의한 전자장치를 포함하는 음성인식 시스템을 도시한다.1 illustrates a voice recognition system including an electronic device according to an embodiment of the present invention.
일 실시예에서 음성인식 시스템은, 도 1에 도시된 바와 같이, 사용자로부터 발화된 음성 즉, 사운드(sound)로서 소리 신호를 수신할 수 있는 전자장치(10)와, 전자장치(10)와 네트워크를 통해 통신할 수 있는 서버(20)를 포함할 수 있다.In one embodiment, as shown in FIG. 1 , the voice recognition system includes an electronic device 10 capable of receiving a sound signal as a voice uttered by a user, that is, a sound, and the electronic device 10 and a network. It may include a server 20 that can communicate through.
전자장치(10)는 사용자에 의해 발화된 음성(이하, 사용자음성 이라고도 한다)을 수신하고, 그 음성에 대응하는 소리 신호를 처리하여, 대응하는 동작을 수행할 수 있다. The electronic device 10 may receive a voice uttered by a user (hereinafter, also referred to as a user voice), process a sound signal corresponding to the voice, and perform a corresponding operation.
일 실시예에서 전자장치(10)는 수신된 음성에 대응하는 동작으로서, 사용자음성의 처리 결과에 대응하는 사운드 즉, 음향을 출력부(도 2의 110)를 통해 출력함으로써, 사용자에게 오디오 컨텐트를 제공할 수 있다. 전자장치(10)에는, 사운드를 출력할 수 있는 출력부(110)로서 적어도 하나의 스피커(loudspeaker)가 마련될 수 있으며, 본 발명에서 전자장치(10)에 마련되는 스피커의 개수, 형태 및 설치 위치는 한정되지 않는다.In an embodiment, the electronic device 10 provides audio content to the user by outputting a sound corresponding to the processing result of the user's voice, ie, the sound, through the output unit ( 110 of FIG. 2 ) as an operation corresponding to the received voice. can provide At least one loudspeaker may be provided in the electronic device 10 as the output unit 110 capable of outputting sound, and the number, shape, and installation of the speakers provided in the electronic device 10 in the present invention The location is not limited.
일 실시예에서 전자장치(10)에는, 사용자음성으로서 소리 신호를 수신할 수 있는 소리수신부(도 3의 120)가 마련될 수 있다. 소리수신부(120)는 적어도 하나의 마이크(microphone)로서 구현될 수 있으며, 전자장치(10)에 마련되는 마이크의 개수, 형태 및 설치 위치는 한정되지 않는다.In an embodiment, the electronic device 10 may be provided with a sound receiver ( 120 in FIG. 3 ) capable of receiving a sound signal as a user's voice. The sound receiver 120 may be implemented as at least one microphone, and the number, shape, and installation location of the microphones provided in the electronic device 10 are not limited.
전자장치(10)의 구현 형태는 한정되지 않으며, 예를 들면, 도 1에 도시된 바와 같이, 인공지능 스피커(artificial intelligence speaker)(이하, AI 스피커 또는 스마트 스피커 라고도 한다)(10a), 스마트 TV와 같은 텔레비전을 포함하는 디스플레이장치(10b), 스마트폰이나 태블릿과 같은 모바일장치(10c) 등의 소리 신호가 수신 가능한 다양한 장치로서 구현될 수 있다.The implementation form of the electronic device 10 is not limited, and for example, as shown in FIG. 1 , an artificial intelligence speaker (hereinafter also referred to as an AI speaker or a smart speaker) 10a, a smart TV A display device 10b including a television such as , and a mobile device 10c such as a smart phone or tablet may be implemented as various devices capable of receiving a sound signal.
AI 스피커(10a)로 구현된 전자장치(10)는, 사용자로부터 음성을 수신하고, 그 수신된 음성에 대한 음성인식을 통해 음악 감상, 정보 검색 등 다양한 기능을 수행할 수 있다. AI 스피커는 음성인식 기능과, 클라우드 등을 활용함으로써, 단순하게 사운드를 출력하는 장치가 아니라, 사용자와의 상호작용 즉, 인터랙션(interaction)이 가능한 가상비서/음성비서가 내장된 장치로서 사용자에게 다양한 서비스를 제공할 수 있도록 구현될 수 있다. 이 경우, 전자장치(10)에는 AI 스피커 기능을 위한 어플리케이션이 설치 및 구동될 수 있다.The electronic device 10 implemented as the AI speaker 10a may receive a voice from a user and perform various functions, such as listening to music and searching for information, through voice recognition for the received voice. The AI speaker is not a device that simply outputs sound by utilizing the voice recognition function and the cloud, but is a device with a built-in virtual assistant/voice assistant that allows interaction with the user. It can be implemented to provide a service. In this case, an application for the AI speaker function may be installed and driven in the electronic device 10 .
디스플레이장치(10b)로 구현된 전자장치(10)는, 외부의 신호공급원 즉, 영상소스로부터 제공되는 영상신호를 기 설정된 프로세스에 따라 처리하여 영상으로 표시한다.The electronic device 10 implemented as the display device 10b processes an image signal provided from an external signal supply source, ie, an image source, according to a preset process, and displays the image as an image.
일 실시예에서 디스플레이장치(10b)는 방송국의 송출장비로부터 제공되는 방송신호, 방송정보, 또는 방송데이터 중 적어도 하나에 기초한 방송신호를 처리하여 영상으로 표시할 수 있는 텔레비전(TV)을 포함한다.In an embodiment, the display device 10b includes a television (TV) capable of processing a broadcast signal based on at least one of a broadcast signal, broadcast information, or broadcast data provided from a transmission device of a broadcast station and displaying the same as an image.
본 발명에서 컨텐트를 제공하는 영상소스의 종류는 한정되지 않으므로, 디스플레이장치(10b)는, 예를 들어, 셋탑박스, 블루레이(Blu-ray) 또는 DVD(digital versatile disc)와 같은 광디스크 재생장치, 테스크탑(desktop) 또는 랩탑(laptop)을 포함하는 컴퓨터(PC), 콘솔 게임기, 스마트폰(smart phone)이나 태블릿(tablet)과 같은 스마트패드(smart pad)를 포함하는 모바일 장치(mobile device) 등으로부터 영상신호를 수신할 수 있다. Since the type of image source providing content is not limited in the present invention, the display device 10b may include, for example, a set-top box, an optical disc playback device such as a Blu-ray or digital versatile disc (DVD); From a computer (PC) including a desktop or laptop, a console game machine, a mobile device including a smart pad such as a smart phone or a tablet, etc. A video signal can be received.
디스플레이장치(10b)가 텔레비전인 경우, 디스플레이장치(10b)는 방송국으로부터 송출되는 RF(radio frequency) 신호 즉, 방송신호를 무선으로 수신할 수 있으며, 이를 위해 방송신호를 수신하는 안테나와 방송신호를 채널 별로 튜닝하기 위한 튜너가 마련될 수 있다. When the display device 10b is a television, the display device 10b may wirelessly receive a radio frequency (RF) signal, that is, a broadcast signal transmitted from a broadcasting station, and for this purpose, an antenna for receiving a broadcast signal and a broadcast signal are used. A tuner for tuning for each channel may be provided.
디스플레이장치(10b)에서, 방송신호는 지상파, 케이블, 위성 등을 통해서 수신 가능하며, 신호공급원은 외부장치나 방송국에 한정되지 않는다. 즉, 데이터의 송수신이 가능한 장치 또는 스테이션이라면 본 발명의 영상소스에 포함될 수 있다.In the display device 10b, a broadcast signal can be received through a terrestrial wave, cable, satellite, or the like, and the signal source is not limited to an external device or a broadcasting station. That is, any device or station capable of transmitting and receiving data may be included in the image source of the present invention.
디스플레이장치(10b)에서 수신되는 신호의 규격은 장치의 구현 형태에 대응하여 다양한 방식으로 구성될 수 있으며, 예를 들면, 디스플레이장치(10b)는 후술하는 인터페이스부(도 2의 140)의 구현 형태에 대응하여, HDMI(High Definition Multimedia Interface), HDMI-CEC(Consumer Electronics Control), 디스플레이 포트(display port, DP), DVI, 컴포지트(composite) 비디오, 컴포넌트(component) 비디오, 슈퍼 비디오(super video), DVI(Digital Visual Interface), 썬더볼트(Thunderbolt), RGB 케이블, SCART(Syndicat des Constructeurs d'Appareils Radiorecepteurs et Televiseurs), USB 등의 규격에 대응하는 신호를 영상 컨텐트로서 유선으로 수신할 수 있다.The standard of the signal received from the display device 10b may be configured in various ways corresponding to the implementation form of the device. For example, the display device 10b may be configured in an implementation form of the interface unit 140 (see FIG. 2 ) to be described later. Corresponding to HDMI (High Definition Multimedia Interface), HDMI-CEC (Consumer Electronics Control), display port (DP), DVI, composite video, component video, super video , DVI (Digital Visual Interface), Thunderbolt, RGB cable, SCART (Syndicat des Constructeurs d'Appareils Radiorecepteurs et Televiseurs), USB, etc. can be received as video content by wire.
디스플레이장치(10b)는 컨텐트 제공을 위해 마련된 서버 등으로부터 유선 또는 무선 네트워크 통신에 의해 영상 컨텐트를 제공받을 수도 있으며, 통신의 종류는 한정되지 않는다. 예를 들면, 디스플레이장치(10b)는 후술하는 인터페이스부(140)의 구현 형태에 대응하여 와이파이(Wi-Fi), 와이파이 다이렉트(Wi-Fi Direct), 블루투스(bluetooth), 블루투스 저에너지(bluetooth low energy), 지그비(Zigbee), UWB(Ultra-Wideband), NFC(Near Field Communication) 등의 규격에 대응하는 신호를 영상 컨텐트로서 무선 네트워크 통신을 통해 수신할 수 있다. 다른 예로서, 디스플레이장치(10b)는 이더넷(Ethernet) 등과 같은 유선 네트워크 통신을 통해 컨텐트 신호를 수신할 수 있다.The display apparatus 10b may receive image content from a server or the like provided for content provision through wired or wireless network communication, and the type of communication is not limited. For example, the display device 10b corresponds to an implementation form of the interface unit 140 to be described later, such as Wi-Fi, Wi-Fi Direct, Bluetooth, and Bluetooth low energy. ), Zigbee, Ultra-Wideband (UWB), Near Field Communication (NFC), and the like may be received as video content through wireless network communication. As another example, the display apparatus 10b may receive a content signal through wired network communication such as Ethernet.
일 실시예에서, 디스플레이장치(10b)는 스마트폰과 같은 다양한 주변기기가 무선 통신을 수행하도록 하는 AP의 역할을 수행할 수 있다.In an embodiment, the display apparatus 10b may serve as an AP that allows various peripheral devices such as a smartphone to perform wireless communication.
디스플레이장치(10b)는 상기와 같은 유선 또는 무선 네트워크를 통해 실시간 스트리밍에 따른 파일 형태로 제공되는 컨텐트를 수신할 수 있다.The display apparatus 10b may receive the content provided in the form of a file according to real-time streaming through the wired or wireless network as described above.
또한, 디스플레이장치(10b)는 내부/외부의 저장매체에 저장된 신호/데이터에 기초한 동영상, 정지영상, 어플리케이션(application), OSD(on-screen display), 다양한 동작 제어를 위한 사용자 인터페이스(user interface, UI)(이하, GUI(graphic user interface) 라고도 한다) 등을 화면에 표시하도록 신호를 처리할 수 있다.In addition, the display apparatus 10b includes a user interface for controlling a video, a still image, an application, an on-screen display (OSD), and various operations based on signals/data stored in internal/external storage media. A signal may be processed to display a UI (hereinafter, also referred to as a graphic user interface (GUI)) on the screen.
일 실시예에서 디스플레이장치(10b)는 스마트 TV 또는 IP TV(Internet Protocol TV)로 동작 가능하다. 스마트 TV는 실시간으로 방송신호를 수신하여 표시할 수 있고, 웹 브라우징 기능을 가지고 있어 실시간 방송신호의 표시와 동시에 인터넷을 통하여 다양한 컨텐트의 검색 및 소비가 가능하고 이를 위하여 편리한 사용자 환경을 제공할 수 있는 텔레비전이다. 또한, 스마트 TV는 개방형 소프트웨어 플랫폼을 포함하고 있어 사용자에게 양방향 서비스를 제공할 수 있다. 따라서, 스마트 TV는 개방형 소프트웨어 플랫폼을 통하여 다양한 컨텐트, 예를 들어 소정의 서비스를 제공하는 어플리케이션을 사용자에게 제공할 수 있다. 이러한 어플리케이션은 다양한 종류의 서비스를 제공할 수 있는 응용 프로그램으로서, 예를 들어 SNS, 금융, 뉴스, 날씨, 지도, 음악, 영화, 게임, 전자 책 등의 서비스를 제공하는 어플리케이션을 포함한다.In an embodiment, the display device 10b may operate as a smart TV or an Internet Protocol TV (IP TV). Smart TV can receive and display broadcast signals in real time, and has a web browsing function, so that it is possible to search and consume various contents through the Internet at the same time as displaying real-time broadcast signals, and for this purpose, it is possible to provide a convenient user environment. it is television In addition, since the smart TV includes an open software platform, it can provide interactive services to users. Accordingly, the smart TV may provide a user with various contents, for example, an application providing a predetermined service through an open software platform. These applications are applications that can provide various types of services, and include, for example, applications that provide services such as SNS, finance, news, weather, maps, music, movies, games, and e-books.
일 실시예에서 디스플레이장치(10b)에는, 음성인식기능을 제공하기 위한 어플리케이션이 설치될 수 있다.In an embodiment, an application for providing a voice recognition function may be installed on the display device 10b.
전자장치(10)가, 디스플레이장치(10b) 또는 모바일장치(10c)인 경우, 전자장치(10)에는 영상을 표시할 수 있는 디스플레이가 마련될 수 있다. 디스플레이의 구현 방식은 한정되지 않으며, 예를 들면 액정(liquid crystal), 플라즈마(plasma), 발광 다이오드(light-emitting diode), 유기발광 다이오드(organic light-emitting diode), 면전도 전자총(surface-conduction electron-emitter), 탄소 나노 튜브(carbon nano-tube), 나노 크리스탈(nano-crystal) 등의 다양한 디스플레이 방식으로 구현될 수 있다.When the electronic device 10 is the display device 10b or the mobile device 10c, a display capable of displaying an image may be provided in the electronic device 10 . The implementation method of the display is not limited, and for example, liquid crystal, plasma, light-emitting diode, organic light-emitting diode, and surface-conduction gun. electron-emitter), carbon nano-tube, nano-crystal, and the like, may be implemented in various display methods.
전자장치(10)는 인터페이스부(140)를 통해 서버(20)를 포함한 다양한 외부장치와 통신을 수행할 수 있다.The electronic device 10 may communicate with various external devices including the server 20 through the interface unit 140 .
본 발명에서, 전자장치(10)와 외부장치의 통신방식은 한정되지 않으므로, 전자장치(10)는 다양한 방식의 유선 또는 무선 접속(예를 들어, 블루투스, 와이파이, 또는 와이파이 다이렉트 등)에 의해 외부장치와 통신 가능하도록 구현된다.In the present invention, since the communication method between the electronic device 10 and the external device is not limited, the electronic device 10 can be connected to an external device through various types of wired or wireless connection (eg, Bluetooth, Wi-Fi, or Wi-Fi Direct). It is implemented to be able to communicate with the device.
서버(20)는 전자장치(10)와 유선 또는 무선 통신을 수행할 수 있도록 마련된다. 서버(20)는, 예를 들면 클라우드 타입으로 구현되어, 전자장치(10) 및/또는 전자장치(10)와 연계된 부가장치(예를 들면, AI 스피커와 연동되도록 해당 어플리케이션이 설치된 스마트폰 등)의 사용자계정을 저장 및 관리할 수 있다.The server 20 is provided to perform wired or wireless communication with the electronic device 10 . The server 20, for example, is implemented in a cloud type, and an electronic device 10 and/or an additional device associated with the electronic device 10 (eg, a smart phone in which a corresponding application is installed to interwork with an AI speaker, etc.) ) of user accounts can be stored and managed.
서버(20)의 구현 형태는 한정되지 않으며, 일례로 음성에 관계된 소리신호를 텍스트로 변환하는 STT(Speech to Text) 서버로서 구현되거나, 음성인식에 관한 메인 서버로서 STT 서버의 기능을 함께 수행하도록 구현될 수 있다. 또한, 서버(20)는 STT 서버와 메인 서버와 같이 복수 개로 마련되어, 전자장치(10)가 복수의 서버와 통신을 수행할 수 있다.The implementation form of the server 20 is not limited, and as an example, it is implemented as an STT (Speech to Text) server that converts a sound signal related to voice into text, or to perform the function of the STT server as a main server related to voice recognition. can be implemented. In addition, the server 20 may be provided in plurality, such as the STT server and the main server, so that the electronic device 10 may communicate with the plurality of servers.
일 실시예에서, 서버(20)에는 사용자로부터 발화된 음성을 인식하기 위한 데이터 즉, 정보가 저장된 데이터베이스(database, DB)가 마련될 수 있다. 데이터베이스는, 예를 들면, 음성의 신호적인 특성을 모델링하여 미리 결정된 복수의 음향모델(Acoustic Model)을 포함할 수 있다. 또한, 데이터베이스는 인식대상 어휘에 해당하는 단어나 음절 등의 언어적인 순서 관계를 모델링하여 미리 결정된 언어모델(Language Model)을 더 포함할 수 있다. 음향모델 및/또는 언어모델은 미리 학습을 수행하여 구성될 수 있다.In an embodiment, the server 20 may be provided with data for recognizing a voice uttered by a user, that is, a database (DB) in which information is stored. The database may include, for example, a plurality of acoustic models predetermined by modeling signal characteristics of a voice. In addition, the database may further include a language model determined in advance by modeling a linguistic order relationship such as words or syllables corresponding to the recognition target vocabulary. The acoustic model and/or the language model may be configured by performing learning in advance.
전자장치(10)는 유선 또는 무선 네트워크에 의해 서버(20)에 접속하여 그 데이터베이스에 억세스함으로써, 수신된 사용자음성을 식별하여 처리하고, 그 처리 결과를 사운드 또는 영상을 통해 출력할 수 있게 된다.By accessing the database by accessing the server 20 through a wired or wireless network, the electronic device 10 can identify and process the received user voice, and output the processing result through sound or image.
이하에서는, 본 발명의 일 실시예에 따른 전자장치의 보다 구체적인 구성 및 그에 의한 동작에 관해 설명한다.Hereinafter, a more specific configuration of an electronic device according to an embodiment of the present invention and an operation thereof will be described.
도 2는 본 발명 일 실시예에 따른 전자장치의 구성을 도시한 블록도이다.2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present invention.
도 2에 도시된 바와 같이, 본 발명 일 실시예의 전자장치(10)는 출력부(110), 소리수신부(120), 신호처리부(161)인터페이스부(140), 저장부(150) 및 프로세서(160)를 포함한다.As shown in FIG. 2 , the electronic device 10 according to an embodiment of the present invention includes an output unit 110 , a sound receiving unit 120 , a signal processing unit 161 , an interface unit 140 , a storage unit 150 , and a processor ( 160).
다만, 도 2에 도시된 본 발명의 일 실시예에 의한 전자장치(10)의 구성은 하나의 예시일 뿐이며, 다른 실시예에 의한 전자장치는 도 2에 도시된 구성 외에 다른 구성으로 구현될 수 있다. 즉, 본 발명의 전자장치는 도 2에 도시된 구성 외 다른 구성이 추가되거나, 혹은 도 2에 도시된 구성 중 적어도 하나가 배제된 형태로 구현될 수도 있다.However, the configuration of the electronic device 10 according to an embodiment of the present invention shown in FIG. 2 is only an example, and the electronic device according to another embodiment may be implemented in a configuration other than the configuration shown in FIG. 2 . have. That is, the electronic device of the present invention may be implemented in a form in which a configuration other than the configuration shown in FIG. 2 is added or at least one of the configuration shown in FIG. 2 is excluded.
출력부(110)는 음향 즉, 사운드를 출력한다. 출력부(110)는 예를 들어, 가청주파수인 20Hz 내지 20KHz 대역의 사운드를 출력 가능한 적어도 하나의 스피커를 포함할 수 있다. 출력부(110)는 복수의 채널의 오디오신호/소리신호에 대응하는 사운드를 출력할 수 있다.The output unit 110 outputs a sound, that is, a sound. The output unit 110 may include, for example, at least one speaker capable of outputting sound in an audible frequency band of 20 Hz to 20 KHz. The output unit 110 may output a sound corresponding to an audio signal/sound signal of a plurality of channels.
일 실시예에서 출력부(110)는 소리수신부(120)를 통해 수신되는 사용자음성으로서 소리 신호의 처리에 따른 사운드를 출력할 수 있다.In an embodiment, the output unit 110 may output a sound according to the processing of the sound signal as a user voice received through the sound receiving unit 120 .
소리수신부(120)는 사용자로부터 발화된 음성 즉, 음파를 수신할 수 있다.The sound receiver 120 may receive a voice uttered by a user, that is, a sound wave.
소리수신부(120)를 통해 입력된 음파는 신호변환부에 의해 전기적인 신호로 변환된다. 일 실시예에서 신호변환부는 아날로그 음파를 디지털 신호로 변환하는 AD 변환부를 포함할 수 있다. 또한, 일 실시예에서 신호변환부는 후술하는 신호처리부(161)에 포함될 수 있다.The sound wave input through the sound receiving unit 120 is converted into an electrical signal by the signal converting unit. In an embodiment, the signal converter may include an AD converter that converts analog sound waves into digital signals. In addition, in an embodiment, the signal conversion unit may be included in a signal processing unit 161 to be described later.
본 발명 일 실시예에서 소리수신부(120)는 전자장치(10)에 자체적으로 마련되도록 구현된다.In an embodiment of the present invention, the sound receiver 120 is implemented to be provided in the electronic device 10 by itself.
다만, 다른 실시예에서 소리수신부(120)는 전자장치(10)에 포함되는 구성이 아닌 별도의 장치에 마련된 행태로서 구현될 수 있다.However, in another embodiment, the sound receiving unit 120 may be implemented as a form provided in a separate device, not a component included in the electronic device 10 .
예를 들면, 전자장치(10)가 텔레비전과 같은 디스플레이장치인 경우, 사용자조작이 가능한 입력장치로서 마련되는 리모컨(remote control)에 설치된 마이크 즉, 소리수신부를 통해 사용자음성이 수신되고, 그에 대응하는 소리신호가 리모컨으로부터 전자장치(10)로 전송될 수 있다. 여기서, 리모컨의 마이크를 통해 수신된 아날로그 음파는 디지털 신호로 변환되어 전자장치(10)로 전송될 수 있다.For example, when the electronic device 10 is a display device such as a television, a user voice is received through a microphone, that is, a sound receiver installed in a remote control provided as an input device capable of user manipulation, and the corresponding user voice is received. A sound signal may be transmitted from the remote control to the electronic device 10 . Here, the analog sound wave received through the microphone of the remote control may be converted into a digital signal and transmitted to the electronic device 10 .
일 실시예에서, 입력장치는 리모컨 어플리케이션이 설치된 스마트폰과 같은 단말장치를 포함한다.In an embodiment, the input device includes a terminal device, such as a smartphone, on which a remote control application is installed.
인터페이스부(140)는 전자장치(10)가 서버(20), 단말장치 등을 포함한 다양한 외부장치와 신호를 송신 또는 수신하도록 한다.The interface unit 140 allows the electronic device 10 to transmit or receive signals with various external devices including the server 20 and the terminal device.
인터페이스부(140)는 유선 인터페이스부(141)를 포함할 수 있다. 유선 인터페이스부(141)는 HDMI, HDMI-CEC, USB, 컴포넌트(Component), 디스플레이 포트(DP), DVI, 썬더볼트, RGB 케이블 등의 규격에 따른 신호/데이터를 송/수신하는 연결부를 포함할 수 있다. 여기서, 유선 인터페이스부(141)는 이들 각각의 규격에 대응하는 적어도 하나 이상의 커넥터, 단자 또는 포트를 포함할 수 있다.The interface unit 140 may include a wired interface unit 141 . The wired interface unit 141 includes a connection unit for transmitting/receiving signals/data according to standards such as HDMI, HDMI-CEC, USB, Component, Display Port (DP), DVI, Thunderbolt, RGB cable, etc. can Here, the wired interface unit 141 may include at least one connector, terminal, or port corresponding to each of these standards.
유선 인터페이스부(141)는 영상소스 등으로부터 신호를 입력받는 입력 포트를 포함하는 형태로 구현되며, 경우에 따라 출력 포트를 더 포함하여 양방향으로 신호를 송수신 가능하게 마련될 수 있다.The wired interface unit 141 may be implemented in a form including an input port for receiving a signal from an image source, etc., and may further include an output port in some cases to transmit/receive signals in both directions.
유선 인터페이스부(141)는 지상파/위성방송 등 방송규격에 따른 방송신호를 수신할 수 있는 안테나가 연결되거나, 케이블 방송 규격에 따른 방송신호를 수신할 수 있는 케이블이 연결될 수 있도록, HDMI 포트, DisplayPort, DVI 포트, 썬더볼트, 컴포지트(composite) 비디오, 컴포넌트(component) 비디오, 슈퍼 비디오(super video), SCART 등과 같이, 비디오 및/또는 오디오 전송규격에 따른 커넥터 또는 포트 등을 포함할 수 있다. 다른 예로서, 전자장치(10)는 방송신호를 수신할 수 있는 안테나를 내장할 수도 있다.The wired interface unit 141 is configured to connect an antenna capable of receiving a broadcast signal according to a broadcasting standard such as terrestrial/satellite broadcasting, or a cable capable of receiving a broadcast signal according to a cable broadcasting standard to be connected, an HDMI port, a DisplayPort , DVI port, Thunderbolt, composite video, component video, super video, SCART, etc. may include a connector or port according to the video and / or audio transmission standard. As another example, the electronic device 10 may have a built-in antenna capable of receiving a broadcast signal.
유선 인터페이스부(141)는 USB 포트 등과 같은 범용 데이터 전송규격에 따른 커넥터 또는 포트 등을 포함할 수 있다. 유선 인터페이스부(141)는 광 전송규격에 따라 광게이블이 연결될 수 있는 커넥터 또는 포트 등을 포함할 수 있다. 유선 인터페이스부(141)는 외부 마이크 또는 마이크를 구비한 외부 오디오기기가 연결되며, 오디오기기로부터 오디오 신호를 수신 또는 입력할 수 있는 커넥터 또는 포트 등을 포함할 수 있다. 인터페이스부(111)는 헤드셋, 이어폰, 외부 스피커 등과 같은 오디오기기가 연결되며, 오디오기기로 오디오 신호를 전송 또는 출력할 수 있는 커넥터 또는 포트 등을 포함할 수 있다. 유선 인터페이스부(141)는 이더넷(Ethernet) 등과 같은 네트워크 전송규격에 따른 커넥터 또는 포트를 포함할 수 있다. 예컨대, 유선 인터페이스부(141)는 라우터 또는 게이트웨이에 유선 접속된 랜카드 등으로 구현될 수 있다.The wired interface unit 141 may include a connector or port according to a universal data transmission standard such as a USB port. The wired interface unit 141 may include a connector or a port to which an optical cable can be connected according to an optical transmission standard. The wired interface unit 141 is connected to an external microphone or an external audio device having a microphone, and may include a connector or a port capable of receiving or inputting an audio signal from the audio device. The interface unit 111 is connected to an audio device such as a headset, earphone, or external speaker, and may include a connector or port capable of transmitting or outputting an audio signal to the audio device. The wired interface unit 141 may include a connector or port according to a network transmission standard such as Ethernet. For example, the wired interface unit 141 may be implemented as a LAN card connected to a router or a gateway by wire.
유선 인터페이스부(141)는 상기 커넥터 또는 포트를 통해 셋탑박스, 광학미디어 재생장치와 같은 외부기기, 또는 외부 디스플레이장치나, 스피커, 서버 등과 1:1 또는 1:N(N은 자연수) 방식으로 유선 접속됨으로써, 해당 외부기기로부터 비디오/오디오 신호를 수신하거나 또는 해당 외부기기에 비디오/오디오 신호를 송신한다. 유선 인터페이스부(141)는, 비디오/오디오 신호를 각각 별개로 전송하는 커넥터 또는 포트를 포함할 수도 있다.The wired interface unit 141 is wired through the connector or port in a 1:1 or 1:N (N is a natural number) method such as an external device such as a set-top box, an optical media playback device, or an external display device, speaker, server, etc. By being connected, a video/audio signal is received from the corresponding external device or a video/audio signal is transmitted to the corresponding external device. The wired interface unit 141 may include a connector or a port for separately transmitting video/audio signals.
일 실시예에서 유선 인터페이스부(141)는 전자장치(10)에 내장되나, 동글(dongle) 또는 모듈(module) 형태로 구현되어 전자장치(10)의 커넥터에 착탈될 수도 있다.In an embodiment, the wired interface unit 141 is embedded in the electronic device 10 , but may be implemented in the form of a dongle or a module to be detachably attached to the connector of the electronic device 10 .
인터페이스부(140)는 무선 인터페이스부(142)를 포함할 수 있다. 무선 인터페이스부(142)는 무선 인터페이스부(142)는 전자장치(10)의 구현 형태에 대응하여 다양한 방식으로 구현될 수 있다. 예를 들면, 무선 인터페이스부(142)는 통신방식으로 RF(radio frequency), 지그비(Zigbee), 블루투스(bluetooth), 와이파이(Wi-Fi), UWB(Ultra WideBand) 및 NFC(Near Field Communication) 등 무선통신을 사용할 수 있다.The interface unit 140 may include a wireless interface unit 142 . The wireless interface unit 142 may be implemented in various ways corresponding to the implementation form of the electronic device 10 . For example, the wireless interface unit 142 is a communication method such as RF (radio frequency), Zigbee (Zigbee), Bluetooth (bluetooth), Wi-Fi (Wi-Fi), UWB (Ultra WideBand) and NFC (Near Field Communication), etc. Wireless communication can be used.
무선 인터페이스부(142)는 다양한 종류의 통신 프로토콜에 대응하는 무선 통신모듈(S/W module, chip 등)을 포함하는 통신회로(communication circuitry)로서 구현될 수 있다.The wireless interface unit 142 may be implemented as a communication circuitry including a wireless communication module (S/W module, chip, etc.) corresponding to various types of communication protocols.
일 실시예에서 무선 인터페이스부(142)는 무선랜유닛을 포함한다. 무선랜유닛은 프로세서(160)의 제어에 따라 억세스 포인트(access point, AP)를 통해 무선으로 외부장치와 연결될 수 있다. 무선랜유닛은 와이파이 모듈을 포함한다.In an embodiment, the wireless interface unit 142 includes a wireless LAN unit. The wireless LAN unit may be wirelessly connected to an external device through an access point (AP) under the control of the processor 160 . The wireless LAN unit includes a WiFi module.
일 실시예에서 무선 인터페이스부(142)는 억세스 포인트 없이 무선으로 전자장치(10)와 외부장치 사이에 1 대 1 다이렉트 통신을 지원하는 무선통신모듈을 포함한다. 무선통신모듈은 와이파이 다이렉트, 블루투스, 블루투스 저에너지 등의 통신방식을 지원하도록 구현될 수 있다. 전자장치(10)가 외부장치와 다이렉트로 통신을 수행하는 경우, 저장부(150)에는 통신 대상 기기인 외부장치에 대한 식별정보(예를 들어, MAC address 또는 IP address)가 저장될 수 있다.In an embodiment, the wireless interface unit 142 includes a wireless communication module that supports one-to-one direct communication between the electronic device 10 and an external device wirelessly without an access point. The wireless communication module may be implemented to support communication methods such as Wi-Fi Direct, Bluetooth, and Bluetooth low energy. When the electronic device 10 directly communicates with the external device, the storage unit 150 may store identification information (eg, MAC address or IP address) on the external device, which is a communication target device.
본 발명 일 실시예에 따른 전자장치(10)에서, 무선 인터페이스부(142)는 성능에 따라 무선랜유닛과 무선통신모듈 중 적어도 하나에 의해 외부장치와 무선 통신을 수행하도록 마련된다.In the electronic device 10 according to an embodiment of the present invention, the wireless interface unit 142 is provided to perform wireless communication with an external device by at least one of a wireless LAN unit and a wireless communication module according to performance.
다른 실시예에서 무선 인터페이스부(142)는 LTE와 같은 이동통신, 자기장을 포함하는 EM 통신, 가시광통신 등의 다양한 통신방식에 의한 통신모듈을 더 포함할 수 있다.In another embodiment, the air interface unit 142 may further include a communication module using various communication methods such as mobile communication such as LTE, EM communication including a magnetic field, and visible light communication.
무선 인터페이스부(142)는 네트워크 상의 서버와 무선 통신함으로써, 서버와의 사이에 데이터 패킷을 송수신할 수 있다.The wireless interface unit 142 may transmit and receive data packets to and from the server by wirelessly communicating with the server on the network.
무선 인터페이스부(142)는 적외선 통신규격에 따라 IR(Infrared) 신호를 송신 및/또는 수신할 수 있는 IR송신부 및/또는 IR수신부를 포함할 수 있다. 무선 인터페이스부(142)는 IR송신부 및/또는 IR수신부를 통해 리모컨 또는 다른 외부기기로부터 리모컨신호를 수신 또는 입력하거나, 다른 외부기기로 리모컨신호를 전송 또는 출력할 수 있다. 다른 예로서, 전자장치(10)는 와이파이(Wi-Fi), 블루투스(bluetooth) 등 다른 방식의 무선 인터페이스부(142)를 통해 리모컨 또는 다른 외부기기와 리모컨신호를 송수신할 수 있다.The wireless interface unit 142 may include an IR transmitter and/or an IR receiver capable of transmitting and/or receiving an IR (Infrared) signal according to an infrared communication standard. The wireless interface unit 142 may receive or input a remote control signal from the remote control or other external device through the IR transmitter and/or the IR receiver, or may transmit or output a remote control signal to another external device. As another example, the electronic device 10 may transmit/receive a remote control signal to and from the remote control or other external device through the wireless interface unit 142 of another method such as Wi-Fi or Bluetooth.
전자장치(10)는 인터페이스부(140)를 통해 수신하는 비디오/오디오신호가 방송신호인 경우, 수신된 방송신호를 채널 별로 튜닝하는 튜너(tuner)를 더 포함할 수 있다.When the video/audio signal received through the interface unit 140 is a broadcast signal, the electronic device 10 may further include a tuner for tuning the received broadcast signal for each channel.
일 실시예에서 무선 인터페이스부(142)는 소리수신부(120)를 통해 수신된 사용자음성의 정보로서 소정 데이터를 외부장치 즉, 서버(20)로 전송할 수 있다. 여기서, 전송되는 데이터의 형태/종류는 한정되지 않으며, 예를 들면, 사용자로부터 발화된 음성에 대응하는 오디오신호나, 오디오신호로부터 추출된 음성특징 등을 포함할 수 있다.In an embodiment, the wireless interface unit 142 may transmit predetermined data as information of the user's voice received through the sound receiving unit 120 to an external device, that is, the server 20 . Here, the form/type of the transmitted data is not limited, and for example, an audio signal corresponding to a voice uttered by a user or a voice characteristic extracted from the audio signal may be included.
또한, 무선 인터페이스부(142)는 서버(20)로부터 해당 사용자음성의 처리 결과의 데이터를 수신할 수 있다. 전자장치(10)는 수신된 데이터에 기초하여, 음성 처리결과에 대응하는 사운드를 출력부(110)를 통해 출력할 된다.Also, the wireless interface unit 142 may receive data of the processing result of the user's voice from the server 20 . The electronic device 10 outputs a sound corresponding to the voice processing result through the output unit 110 based on the received data.
다만, 상기한 실시예는 예시로서, 사용자음성을 서버(20)로 전송하지 않고, 전자장치(10) 내에서 자체적으로 처리할 수도 있다. 즉, 다른 실시예에서 전자장치(10)가 STT 서버의 역할을 수행하도록 구현 가능하다.However, the above-described embodiment is an example, and the user's voice may not be transmitted to the server 20 , but may be processed by itself within the electronic device 10 . That is, in another embodiment, the electronic device 10 can be implemented to perform the role of the STT server.
전자장치(10)는 무선 인터페이스부(142)를 통해 리모컨과 같은 입력장치와 통신을 수행하여, 입력장치로부터 사용자음성에 대응하는 소리 신호를 수신할 수 있다.The electronic device 10 may communicate with an input device such as a remote control through the wireless interface unit 142 to receive a sound signal corresponding to the user's voice from the input device.
일 실시예의 전자장치(10)에서, 서버(20)와 통신하는 통신모듈과 리모컨과 통신하는 통신모듈은 서로 다를 수 있다. 예를 들어, 전자장치(10)는, 서버(20)와 이더넷 모뎀 또는 와이파이 모듈을 통해 통신을 수행하고, 리모컨과 블루투스 모듈을 통해 통신을 수행할 수 있다.In the electronic device 10 according to an embodiment, the communication module communicating with the server 20 and the communication module communicating with the remote controller may be different from each other. For example, the electronic device 10 may communicate with the server 20 through an Ethernet modem or Wi-Fi module, and may communicate through a remote controller and a Bluetooth module.
다른 실시예의 전자장치(10)에서, 서버(20)와 통신하는 통신모듈과 리모컨과 통신하는 통신모듈은 같을 수 있다. 예를 들어, 전자장치(10)는 블루투스 모듈을 통해 서버(20) 및 리모컨과 통신을 수행할 수 있다.In the electronic device 10 of another embodiment, the communication module communicating with the server 20 and the communication module communicating with the remote control may be the same. For example, the electronic device 10 may communicate with the server 20 and the remote controller through the Bluetooth module.
저장부(150)는 전자장치(10)의 다양한 데이터를 저장하도록 구성된다. 저장부(150)는 전자장치(10)에 공급되는 전원이 차단되더라도 데이터들이 남아있어야 하며, 변동사항을 반영할 수 있도록 쓰기 가능한 비휘발성 메모리(writable ROM)로 구비될 수 있다. 즉, 저장부(150)는 플래쉬 메모리(flash memory), EPROM 또는 EEPROM 중 어느 하나로 구비될 수 있다.The storage unit 150 is configured to store various data of the electronic device 10 . The storage unit 150 should retain data even when power supplied to the electronic device 10 is cut off, and may be provided as a writable nonvolatile memory (writable ROM) to reflect changes. That is, the storage unit 150 may be provided with any one of a flash memory, an EPROM, or an EEPROM.
저장부(150)는 전자장치(10)의 읽기 또는 쓰기 속도가 비휘발성 메모리에 비해 빠른 DRAM 또는 SRAM과 같은 휘발성 메모리(volatile memory)를 더 구비할 수 있다.The storage unit 150 may further include a volatile memory such as DRAM or SRAM, in which the read or write speed of the electronic device 10 is faster than that of the nonvolatile memory.
저장부(150)에 저장되는 데이터는, 예를 들면 전자장치(10)의 구동을 위한 운영체제를 비롯하여, 이 운영체제 상에서 실행 가능한 다양한 소프트웨어, 프로그램, 어플리케이션, 부가데이터 등을 포함한다.The data stored in the storage 150 includes, for example, an operating system for driving the electronic device 10 , and various software, programs, applications, and additional data executable on the operating system.
본 발명 일 실시예에 따른 전자장치(10)에서 저장부(150)에 저장 및 설치되는 어플리케이션은, 소리수신부(120)를 통해 수신되는 사용자음성을 인식하고, 그에 따른 동작을 수행하기 위한 AI 스피커 어플리케이션을 포함할 수 있다. An application stored and installed in the storage unit 150 in the electronic device 10 according to an embodiment of the present invention recognizes a user voice received through the sound receiver 120 and performs an operation according to the AI speaker. It can contain applications.
일 실시예에서, AI 스피커 어플리케이션은, 소리수신부(120)를 통해 미리 정해진 키워드로서 시작어 즉, 트리거 워드(trigger word)의 입력, 전자장치(10)의 특정 버튼에 대한 사용자 조작 등이 식별되면 실행 또는 활성화됨으로써, 사용자로부터 발화된 음성에 대한 음성인식 기능을 수행할 수 있다. 여기서, 어플리케이션의 활성화는 어플리케이션의 실행 상태가 백그라운드 모드(background mode)에서 포그라운드 모드(foreground mode)로 전환하는 것을 포함할 수 있다.In an embodiment, when the AI speaker application is identified as a predetermined keyword through the sound receiver 120 , that is, an input of a trigger word, a user operation on a specific button of the electronic device 10 , etc. are identified. By being executed or activated, it is possible to perform a voice recognition function for the voice uttered by the user. Here, the activation of the application may include switching the execution state of the application from the background mode to the foreground mode.
일 실시예의 전자장치(10)에서, 저장부(150)는, 도 2에 도시된 바와 같이, 소리수신부(120)를 통해 수신될 수 있는 사용자음성을 인식하기 위한 데이터 즉, 정보가 저장된 데이터베이스(151)를 포함할 수 있다.In the electronic device 10 according to an embodiment, the storage unit 150, as shown in FIG. 2, is a database (database) in which data for recognizing a user voice that can be received through the sound receiving unit 120, that is, information is stored. 151) may be included.
데이터베이스(151)는, 예를 들면, 음성의 신호적인 특성을 모델링하여 미리 결정된 복수의 음향모델을 포함할 수 있다. 또한, 데이터베이스(151)는 인식대상 어휘에 해당하는 단어나 음절 등의 언어적인 순서 관계를 모델링하여 미리 결정된 언어모델을 더 포함할 수 있다.The database 151 may include, for example, a plurality of acoustic models determined in advance by modeling signal characteristics of speech. In addition, the database 151 may further include a language model determined in advance by modeling a linguistic order relationship such as words or syllables corresponding to the recognition target vocabulary.
다른 실시예에서, 사용자음성을 인식하기 위한 정보가 저장된 데이터베이스는, 전술한 바와 같이 무선 인터페이스부(142)를 통하여 유선 또는 무선 네트워크에 의해 접속 가능한 외부장치의 일례인 서버(20)에 마련될 수 있다. 서버(20)는, 예를 들면 클라우드 타입으로 구현될 수 있다.In another embodiment, the database in which information for recognizing a user's voice is stored may be provided in the server 20, which is an example of an external device accessible by a wired or wireless network through the wireless interface unit 142 as described above. have. The server 20 may be implemented, for example, in a cloud type.
프로세서(160)는 전자장치(10)의 제반 구성들이 동작하기 위한 제어를 수행한다.The processor 160 controls all components of the electronic device 10 to operate.
프로세서(160)는 이러한 제어 동작을 수행할 수 있도록 하는 제어프로그램에 포함된 인스트럭션을 실행한다. 프로세서(160)는 제어프로그램이 설치된 비휘발성의 메모리로부터 제어프로그램의 적어도 일부를 휘발성의 메모리로 로드하고, 로드된 제어프로그램을 실행하는 적어도 하나의 범용 프로세서를 포함하며, 예를 들면 CPU(Central Processing Unit) 또는 응용 프로세서(application processor, AP)로 구현될 수 있다. The processor 160 executes instructions included in a control program to perform such a control operation. The processor 160 includes at least one general-purpose processor that loads at least a part of the control program from the non-volatile memory in which the control program is installed into the volatile memory, and executes the loaded control program, for example, CPU (Central Processing). Unit) or an application processor (AP).
프로세서(160)는 싱글 코어, 듀얼 코어, 트리플 코어, 쿼드 코어 및 그 배수의 코어를 포함할 수 있다. 프로세서(160)는 복수의 프로세서, 예를 들어, 주 프로세서(main processor) 및 슬립 모드(sleep mode, 예를 들어, 대기 전원만 공급되고 소리 신호를 수신하는 전자장치로서 동작하지 않는)에서 동작하는 부 프로세서(sub processor)를 포함할 수 있다. 또한, 프로세서, 롬 및 램은 내부 버스(bus)를 통해 상호 연결되며, 롬과 램은 저장부(150)에 포함된다.The processor 160 may include a single core, a dual core, a triple core, a quad core, and multiple cores thereof. The processor 160 operates in a plurality of processors, for example, a main processor and a sleep mode (for example, only standby power is supplied and does not operate as an electronic device receiving a sound signal). It may include a sub-processor. In addition, the processor, the ROM, and the RAM are interconnected through an internal bus, and the ROM and the RAM are included in the storage unit 150 .
본 발명에서 프로세서(160)를 구현하는 일례인 CPU 또는 응용 프로세서는 전자장치(10)에 내장되는 PCB 상에 실장되는 메인 SoC(Main SoC)에 포함되는 형태로서 구현 가능하다. In the present invention, a CPU or an application processor, which is an example of implementing the processor 160 , may be implemented as a form included in a main SoC mounted on a PCB embedded in the electronic device 10 .
제어프로그램은, BIOS, 디바이스드라이버, 운영체계, 펌웨어, 플랫폼 및 응용프로그램(어플리케이션) 중 적어도 하나의 형태로 구현되는 프로그램(들)을 포함할 수 있다. 일 실시예로서, 응용프로그램은, 전자장치(10)의 제조 시에 전자장치(10)에 미리 설치 또는 저장되거나, 혹은 추후 사용 시에 외부로부터 응용프로그램의 데이터를 수신하여 수신된 데이터에 기초하여 전자장치(10)에 설치될 수 있다. 응용 프로그램의 데이터는, 예를 들면, 어플리케이션 마켓과 같은 외부 서버로부터 전자장치(10)로 다운로드될 수도 있다. 이와 같은 응용프로그램, 외부 서버 등은, 본 발명의 컴퓨터프로그램제품의 일례이나, 이에 한정되는 것은 아니다.The control program may include program(s) implemented in the form of at least one of a BIOS, a device driver, an operating system, firmware, a platform, and an application program (application). As an embodiment, the application program is pre-installed or stored in the electronic device 10 when the electronic device 10 is manufactured, or receives data of the application program from the outside when used later, based on the received data. It may be installed in the electronic device 10 . Data of the application program may be downloaded to the electronic device 10 from, for example, an external server such as an application market. Such an application program, an external server, etc. is an example of the computer program product of the present invention, but is not limited thereto.
일 실시예에서 프로세서(160)는, 도 2에 도시된 바와 같이, 신호처리부(161)를 포함할 수 있다.In an embodiment, the processor 160 may include a signal processing unit 161 as shown in FIG. 2 .
신호처리부(161)는 오디오신호 즉, 소리신호를 처리한다. 신호처리부(161)에서 처리된 소리신호는, 출력부(110)를 통해 사운드로서 출력됨으로써 사용자에게 오디오 컨텐트가 제공될 수 있다.The signal processing unit 161 processes an audio signal, that is, a sound signal. The sound signal processed by the signal processing unit 161 may be output as sound through the output unit 110 to provide audio content to the user.
일 실시예에서 신호처리부(161)는 프로세서(160)의 소프트웨어 블록으로서, 프로세서(160)의 일 기능을 수행하는 형태로 구현될 수 있다. In an embodiment, the signal processing unit 161 is a software block of the processor 160 , and may be implemented in a form that performs one function of the processor 160 .
다른 실시예에서, 신호처리부(161)는, 프로세서(160)를 구현하는 예시인 CPU 또는 응용 프로세서(AP)와 구분된 별도의 구성, 예를 들면, 디지털 신호 프로세서(DSP)와 같은 마이크로 프로세서 또는 IC(integrated circuit)로서 구현되거나, 또는 하드웨어와 소프트웨어의 조합에 의해 구현될 수 있다. In another embodiment, the signal processing unit 161 is a separate configuration separated from the CPU or application processor (AP), which is an example of implementing the processor 160, for example, a microprocessor such as a digital signal processor (DSP) or It may be implemented as an integrated circuit (IC), or may be implemented by a combination of hardware and software.
일 실시예에서 프로세서(160)는, 도 2에 도시된 바와 같이, 사용자로부터 발화된 음성신호를 인식할 수 있는 음성인식모듈(162)을 포함할 수 있다.In an embodiment, the processor 160 may include a voice recognition module 162 capable of recognizing a voice signal uttered by a user, as shown in FIG. 2 .
도 3은 본 발명 일 실시예에 따른 전자장치의 음성인식모듈의 구성을 도시한 블록도이다.3 is a block diagram illustrating a configuration of a voice recognition module of an electronic device according to an embodiment of the present invention.
일 실시예에서 음성인식모듈(162)은 사용자발화를 입력으로 수신하며, 미리 정해진 시작어(이하, 트리거 워드 또는 웨이크 업 워드(wake-up word, WUW) 라고도 한다.)의 입력에 응답하여 음성인식을 위한 동작을 개시하도록 구현될 수 있다.In one embodiment, the voice recognition module 162 receives the user's utterance as an input, and in response to an input of a predetermined start word (hereinafter, also referred to as a trigger word or a wake-up word (WUW)), a voice It may be implemented to initiate an action for recognition.
본 발명 일 실시예의 전자장치(10)에서, 음성인식모듈(162)은, 도 3에 도시된 바와 같이, 전처리부(301), 시작어 엔진(302), 임계값 결정부(304) 및 음성인식 엔진(304)을 포함할 수 있다.In the electronic device 10 according to an embodiment of the present invention, the voice recognition module 162 includes a preprocessor 301 , a start word engine 302 , a threshold value determiner 304 and a voice, as shown in FIG. 3 . A recognition engine 304 may be included.
전처리부(301)는 사용자발화에 따른 음성신호를 소리수신부(120)로부터 입력받고, 주변 소음 즉, 노이즈를 제거하는 전처리를 수행할 수 있다.The preprocessor 301 may receive a voice signal according to the user's utterance from the sound receiver 120 and perform preprocessing for removing ambient noise, that is, noise.
일 실시예에서 전처리에는 디지털 신호 변환, 필터링, 프레이밍 등의 과정들이 포함될 수 있으며, 상기의 과정들에 따라 음성신호에서 불필요한 주변 소음이 제거됨으로써 유의미한 음성신호가 추출될 수 있다.In an embodiment, the pre-processing may include processes such as digital signal conversion, filtering, framing, and the like, and a meaningful voice signal can be extracted by removing unnecessary ambient noise from the voice signal according to the above processes.
시작어 엔진(302)은 전처리가 수행된 음성신호로부터 추출된 특징(feature)을 미리 정해진 소정 패턴과 비교하는 패턴 매칭을 수행한다.The start word engine 302 performs pattern matching by comparing features extracted from the pre-processed speech signal with a predetermined pattern.
일 실시예에서, 시작어 엔진(302)은, 미리 학습을 수행하여 구성된 음향모델을 이용하여 패턴 매칭을 수행할 수 있다.In an embodiment, the start word engine 302 may perform pattern matching using an acoustic model configured by performing pre-learning.
구체적으로, 시작어 엔진(302)은 입력발화 즉, 사용자 발화에 따른 음성신호(소리신호)의 파형과, 음향모델의 시작어 패턴 간의 유사도에 기초하여, 입력발화가 시작어를 포함하는지 여부를 식별할 수 있다.Specifically, the start word engine 302 determines whether the input speech includes a start word based on the similarity between the input speech, that is, the waveform of the voice signal (sound signal) according to the user's speech, and the start word pattern of the acoustic model. can be identified.
시작어 엔진(302)은, 패턴 매칭에 의한 비교 결과, 입력발화의 점수(score) 즉, 발화 스코어가 미리 정해진 시작어 임계값(WUW Threshold) 보다 큰 경우, 입력발화가 시작어를 포함하는 것으로 식별할 수 있다.The start word engine 302 determines that the input utterance includes the start word when, as a result of the comparison by pattern matching, the score of the input utterance, that is, the utterance score is greater than a predetermined start word threshold (WUW Threshold). can be identified.
여기서, 유사도의 임계값, 즉 시작어 임계값(WUW Threshold)는 음향 모델을 이용한 학습 알고리즘에 기반하여 미리 설정될 수 있다.Here, a threshold of similarity, that is, a starting word threshold (WUW Threshold) may be preset based on a learning algorithm using an acoustic model.
본 발명에서 시작어 임계값(WUW Threshold)은 전자장치(1)의 음성인식 기능을 활성화 시기키 위한 조건으로서 정의된다. 다시 말해, 시작어 임계값(WUW Threshold)은, 후술하는 소리 신호의 소음 특성 및 소음 대비 발화 특성과의 비교에 각각 사용되는 소음 임계값 및 SNR 임계값과 구분된다.In the present invention, the starting word threshold (WUW Threshold) is defined as a condition for activating the voice recognition function of the electronic device 1 . In other words, the starting word threshold (WUW Threshold) is distinguished from the noise threshold and SNR threshold respectively used for comparison with the noise characteristic and noise-to-speech characteristic of a sound signal, which will be described later.
본 발명 일 실시예에 따른 전자장치(1)는, 사용자발화가 소음환경에서 이루어진 경우, 서로 다른 값을 가지도록 설정된 2개의 시작어 임계값을 사용하도록 구현될 수 있다. 이러한 소음환경에서 2개의 시작어 임계값을 적용하는 구체적인 예에 관해서는 후술하는 도 4의 실시예에서 보다 상세하게 설명하기로 한다.The electronic device 1 according to an embodiment of the present invention may be implemented to use two starting word thresholds set to have different values when the user's utterance is made in a noisy environment. A specific example of applying the two starting word thresholds in such a noisy environment will be described in more detail in the embodiment of FIG. 4 to be described later.
임계값 결정부(303)는, 미리 정해진 소음 임계값을 이용하여 사용자발화가 소음환경에서 이루어졌는지를 식별한다. 여기서, 소음환경의 식별은, 사용자발화에 따른 소리신호의 소음특성으로서, 특정 구간에서의 전력 및 소음 임계값 간의 비교에 기초하여 이루어질 수 있다.The threshold value determining unit 303 identifies whether the user's utterance is made in a noisy environment using a predetermined noise threshold. Here, the identification of the noise environment, as a noise characteristic of a sound signal according to a user's utterance, may be made based on a comparison between power and a noise threshold in a specific section.
또한, 임계값 결정부(303)는, 미리 정해진 SNR 임계값을 이용하여 사용자발화에 따른 소리 신호의 발화특성으로서, 소음 대비 발화된 소리 신호의 비율이 특정 수준 이상인지 여부를 식별한다.In addition, the threshold value determining unit 303 identifies whether the ratio of the sound signal to the noise is equal to or greater than a specific level as the speech characteristic of the sound signal according to the user's speech using a predetermined SNR threshold.
일 실시예에서 임계값 결정부(303)는, 상기와 같은 소리 신호의 소음특성과 소음 임계값 과의 비교 결과 또는 소리 신호의 발화특성과 SNR 임계값과의 임계값의 비교 결과에 기초하여, SNR 임계값을 변경할 수 있다. SNR 임계값의 변경은, 예를 들어, 그 값을 상향 조정하거나, 또는 하향 조정하는 것을 포함할 수 있다. 이러한 SNR 임계값을 변경하는 구체적인 예에 관해서는 후술하는 도 4의 실시예에서 보다 상세하게 설명하기로 한다.In one embodiment, the threshold value determining unit 303 is, based on the comparison result of the noise characteristic of the sound signal and the noise threshold as described above, or the comparison result of the speech characteristic of the sound signal and the threshold value of the SNR threshold, The SNR threshold can be changed. Changing the SNR threshold may include, for example, adjusting the value upwards or downwards. A specific example of changing the SNR threshold will be described in more detail in the embodiment of FIG. 4 to be described later.
음성인식 엔진(304)은 사용자발화 관한 인식 동작을 수행할 수 있도록, 시용자발화로서 수신되는 음성신호 즉, 소리신호에 대한 음성인식기능을 포함하도록 구현될 수 있다.The voice recognition engine 304 may be implemented to include a voice recognition function for a voice signal received as a user's utterance, that is, a sound signal, so as to perform a recognition operation on the user's utterance.
본 발명 일 실시예에 따른 전자장치(10)에서, 음성인식 엔진(304)은 사용자발화가 소음환경에서 이루어진 경우, 전술한 2개의 시작어 임계값에 기초한 2단계의 활성화 조건을 만족한 경우, 음성인식 기능이 활성화되어, 전자장치(10)가 수신된 소리 신호에 기초하여 사용자발화에 관한 인식 동작을 수행하도록 구현될 수 있다. 이러한 2단계의 활성화 조건에 따른 음성인식기능의 활성화가 이루어지는 구체적인 예에 관해서는 후술하는 도 4의 실시예에서 보다 상세하게 설명하기로 한다.In the electronic device 10 according to an embodiment of the present invention, the voice recognition engine 304 satisfies the activation condition of the second stage based on the two threshold values of the starting words when the user's utterance is made in a noisy environment, The voice recognition function may be activated, and the electronic device 10 may be implemented to perform a recognition operation regarding the user's utterance based on the received sound signal. A specific example in which the voice recognition function is activated according to the activation conditions of these two steps will be described in more detail in the embodiment of FIG. 4 to be described later.
일 실시예에서 음성인식 엔진(304)의 음성인식기능은 하나 이상의 음성인식알고리즘을 이용하여 수행될 수 있다. 예를 들면, 음성인식 엔진(304)은 사용자발화된 음성신호로부터 음성특징을 나타내는 벡터를 추출하고, 그 추출된 벡터를 데이터베이스(151) 또는 서버(20)의 음향모델과 비교하여, 음성인식을 수행할 수 있다. 여기서, 음향모델은 미리 수행된 학습에 따른 모델인 것을 일례로 한다.In one embodiment, the voice recognition function of the voice recognition engine 304 may be performed using one or more voice recognition algorithms. For example, the voice recognition engine 304 extracts a vector representing a voice feature from a voice signal uttered by a user, and compares the extracted vector with an acoustic model of the database 151 or the server 20 to perform voice recognition. can be done Here, it is assumed that the acoustic model is a model according to previously performed learning as an example.
상기와 같이, 전처리부(301), 시작어 엔진(302), 임계값 결정부(304) 및 음성인식 엔진(304)로 이루어진 음성인식모듈(162)은 프로세서(160)로서 마련된 CPU에 상주하는 임베디드 타입으로 구현된 것을 예로 들어 설명하지만, 본 발명은 이에 한정되지 않는다. 그에 따라, 음성인식모듈(162)은 CPU와 별개인 전자장치(10)의 구성 예를 들면, 음성인식기능을 위한 전용 프로세서로서 마련되는 마이컴(Micro Computer)과 같은 별도의 칩으로 구현될 수 있다.As described above, the voice recognition module 162 comprising the preprocessor 301, the starting word engine 302, the threshold value determining unit 304 and the voice recognition engine 304 is a An example implemented as an embedded type is described as an example, but the present invention is not limited thereto. Accordingly, the voice recognition module 162 may be implemented as a configuration of the electronic device 10 separate from the CPU, for example, a separate chip such as a microcomputer provided as a dedicated processor for a voice recognition function. .
또한, 음성인식모듈(162)의 각 구성으로서 전처리부(301), 시작어 엔진(302), 임계값 결정부(304), 음성인식 엔진(304)은 일례로서 소프트웨어 블록으로 구현될 수 있으며, 경우에 따라 적어도 하나의 구성이 제외된 형태로 구현되거나, 다른 적어도 하나의 구성이 추가될 수 있다.In addition, as each component of the voice recognition module 162, the preprocessor 301, the start word engine 302, the threshold value determination unit 304, and the voice recognition engine 304 may be implemented as a software block as an example, In some cases, at least one configuration may be implemented in an excluded form, or at least one other configuration may be added.
이하의 실시예에서, 전자장치(10)가 음성인식 기능을 수행하도록 하기 위하여, 전술한 전처리부(301), 시작어 엔진(302), 임계값 결정부(304), 음성인식 엔진(304) 중 적어도 하나에 의해 수행되는 동작들은 전자장치(10)의 프로세서(160)에 의해 수행되는 것으로 이해될 것이다.In the following embodiment, in order for the electronic device 10 to perform the voice recognition function, the aforementioned preprocessor 301 , the start word engine 302 , the threshold value determiner 304 , and the voice recognition engine 304 . It will be understood that operations performed by at least one of these are performed by the processor 160 of the electronic device 10 .
일 실시예에서 프로세서(160)는 소리수신부(120)를 통해 수신되는 소리 신호의 소음특성을 나타내는 값이 소음 임계값(이하, 제1임계값 이라고도 한다) 보다 큰지 여부를 식별하고, 소리 신호의 발화특성을 나타내는 값이 SNR 임계값(이하, 제2임계값 이라고도 한다) 보다 큰지 여부를 식별하여, 소음 특성의 값이 제1임계값보다 크고, 발화 특성의 값이 제2임계값보다 큰 것으로 식별되면, 수신된 소리 신호에 기초하여 사용자 발화에 관한 인식 동작을 수행하고, 제2임계값 즉, SNR 임계값이 상향 조정되도록 할 수 있다. 여기서, 프로세서(160)는 소리 신호의 파형과 미리 정의된 시작어 패턴 간의 유사도가 제1시작어 임계값(이하, 제3임계값 이라고도 한다) 보다 큰 소리 신호 즉, 제1활성화 조건을 만족한 소리 신호에 대해, 그 소음 특성의 값 및 발화 특성의 값이 각각 제1임계값 및 제2임계값보다 큰 지 여부를 식별할 수 있다.In an embodiment, the processor 160 identifies whether a value representing the noise characteristic of the sound signal received through the sound receiver 120 is greater than a noise threshold (hereinafter, also referred to as a first threshold), and Whether or not the value indicating the ignition characteristic is greater than the SNR threshold (hereinafter also referred to as the second threshold value) is identified, and the value of the noise characteristic is greater than the first threshold value and the value of the ignition characteristic is greater than the second threshold value Once identified, a recognition operation on the user's utterance may be performed based on the received sound signal, and the second threshold value, that is, the SNR threshold may be adjusted upward. Here, the processor 160 determines that the similarity between the waveform of the sound signal and the predefined start word pattern is greater than a first start word threshold (hereinafter, also referred to as a third threshold), that is, a sound signal that satisfies the first activation condition. With respect to the sound signal, it may be identified whether the value of the noise characteristic and the value of the ignition characteristic are greater than a first threshold value and a second threshold value, respectively.
또한, 프로세서(160)는 수신되는 소리 신호의 소음 특성의 값이 제1임계값 이하인 것으로 식별되면, 수신된 소리 신호에 기초하여 사용자발화에 관한 인식 동작을 수행하고, 제2임계값 즉, SNR 임계값이 하향 조정되도록 할 수 있다.In addition, when it is identified that the value of the noise characteristic of the received sound signal is equal to or less than the first threshold, the processor 160 performs a recognition operation on the user's utterance based on the received sound signal, and performs a recognition operation on the second threshold, that is, the SNR. The threshold can be adjusted downward.
또한, 프로세서(160)는 소리 신호의 발화 특성의 값이 제2임계값 이하인 것으로 식별되면, 소리 신호의 파형과 시작어 패턴 간의 유사도가 제1시작어 임계값보다 큰 제2시작어 임계값(이하, 제4임계값 이라고도 한다) 보다 큰 경우, 즉 제2활성화 조건을 만족하는 경우, 수신된 소리 신호에 기초하여 사용자발화에 관한 인식 동작을 수행할 수 있다.In addition, when the value of the speech characteristic of the sound signal is identified as being less than or equal to the second threshold, the processor 160 determines that the similarity between the waveform of the sound signal and the starting word pattern is greater than the first starting word threshold ( Hereinafter, it is also referred to as a fourth threshold value), that is, when the second activation condition is satisfied, a recognition operation regarding the user's utterance may be performed based on the received sound signal.
일 실시예로서, 프로세서(160)의 동작은 전자장치(10)와 별도로 마련되는 컴퓨터프로그램제품(미도시)에 저장된 컴퓨터프로그램으로 구현될 수도 있다. 이 경우, 컴퓨터프로그램제품은 컴퓨터프로그램에 해당하는 인스트럭션이 저장된 메모리와, 프로세서를 포함한다. 인스트럭션은, 프로세서(160)에 의해 실행되면, 소리수신부(120)를 통해 수신되는 소리 신호의 소음특성을 나타내는 값이 제1임계값 보다 크고, 소리 신호의 발화특성을 나타내는 값이 제2임계값 보다 크면, 수신된 소리 신호에 기초하여 사용자발화에 관한 인식 동작을 수행하고, 제2임계값이 상향 조정되도록 하는 것을 포함한다. 또한, 인스트럭션은, 수신되는 소리 신호의 소음 특성을 나타내는 값이 제1임계값 이하이면, 수신된 소리 신호에 기초하여 사용자발화에 관한 인식 동작을 수행하고, 제2임계값이 하향 조정되도록 하는 것을 포함한다.As an embodiment, the operation of the processor 160 may be implemented as a computer program stored in a computer program product (not shown) provided separately from the electronic device 10 . In this case, the computer program product includes a memory in which instructions corresponding to the computer program are stored, and a processor. When the instruction is executed by the processor 160 , the value indicating the noise characteristic of the sound signal received through the sound receiving unit 120 is greater than the first threshold value, and the value indicating the speech characteristic of the sound signal is the second threshold value if greater than, performing a recognition operation on the user's utterance based on the received sound signal, and allowing the second threshold to be adjusted upward. In addition, the instruction includes, if the value representing the noise characteristic of the received sound signal is equal to or less than the first threshold, performing a recognition operation on the user's utterance based on the received sound signal and lowering the second threshold. include
이에 따라, 전자장치(10)의 프로세서(160)는 별도의 컴퓨터프로그램제품에 저장된 컴퓨터프로그램을 다운로드 및 실행하여, 상기와 같은 인스트럭션의 동작을 수행할 수 있다.Accordingly, the processor 160 of the electronic device 10 may download and execute a computer program stored in a separate computer program product to perform the above-described operation of the instruction.
이하, 도면들을 참조하여, 본 발명 전자장치에서 사용자발화에 관한 인식동작이 개선되도록 하는 실시예들을 설명한다.Hereinafter, with reference to the drawings, embodiments in which the recognition operation for the user's utterance is improved in the electronic device of the present invention will be described.
도 4는 본 발명 일 실시예에 따른 전자장치의 제어방법을 도시한 흐름도이고, 도 5는 본 발명 일 실시예에 따른 전자장치에서 음성인식기능의 활성화를 위한 패턴 매칭을 설명하기 위한 도면이고, 도 6은 본 발명 일 실시예에 따른 전자장치의 소음특성 식별을 설명하기 위한 도면이다.4 is a flowchart illustrating a control method of an electronic device according to an embodiment of the present invention, and FIG. 5 is a diagram for explaining pattern matching for activating a voice recognition function in an electronic device according to an embodiment of the present invention; 6 is a view for explaining the identification of noise characteristics of an electronic device according to an embodiment of the present invention.
전자장치(10)는, 도 7에 도시된 바와 같이, 소리수신부(120)를 통해 소리 신호를 수신할 수 있다(401). 여기서, 수신되는 소리 신호는 사용자발화에 따른 신호일 수 있다.As shown in FIG. 7 , the electronic device 10 may receive a sound signal through the sound receiver 120 ( 401 ). Here, the received sound signal may be a signal according to the user's utterance.
프로세서(160)는, 단계 401에서 수신된 소리 신호가 음성인식기능에 대한 제1활성화 조건을 만족하는지 여부를 식별할 수 있다(402).The processor 160 may identify whether the sound signal received in step 401 satisfies the first activation condition for the voice recognition function (step 402).
일 실시예에서, 프로세서(160)는, 도 5에 도시된 바와 같이, 전처리를 수행함에 따라 주변 소음 즉, 노이즈가 제거된 소리 신호와 미리 정의된 시작어 신호 사이의 패턴 매칭을 수행하여, 제1활성화 조건의 만족 여부를 식별할 수 있다.In an embodiment, the processor 160 performs pattern matching between a sound signal from which ambient noise, that is, noise has been removed, and a predefined start word signal as shown in FIG. 5 , as shown in FIG. 1 It can be identified whether the activation condition is satisfied.
구체적으로, 프로세서(160)는, 도 5와 같은 패턴 매칭에 기초하여, 사용자발화 즉, 소리신호의 파형과 시작어 신호 패턴 간 유사도로서 발화스코어(Score speech)를 도출하고, 아래 수학식 1를 이용하여, 도출된 발화스코어 즉, 유사도가 미리 정해진 제1시작어 임계값(WUW Threshold1) 즉, 제3임계값 보다 큰지 여부를 식별할 수 있다. Specifically, the processor 160 derives a score speech as a similarity between the user's speech, that is, a waveform of a sound signal and a pattern of a start word signal, based on the pattern matching as shown in FIG. 5 , and uses Equation 1 below It can be identified whether the derived speech score, that is, the degree of similarity, is greater than the predetermined first start word threshold WUW Threshold1, that is, the third threshold value.
Figure PCTKR2020018442-appb-M000001
Figure PCTKR2020018442-appb-M000001
여기서, 제1시작어 임계값(제3임계값)은 소리 신호가 음성인식 기능에 대한 제1활성화 조건을 만족하는지 여부를 식별하기 위한 것으로, 사용자발화가 소음환경에서 이루어진 것인지 여부와 관계없이 적용된다.Here, the first starting word threshold (third threshold) is for identifying whether the sound signal satisfies the first activation condition for the voice recognition function, and is applied regardless of whether the user's utterance is made in a noisy environment. do.
일 실시예에서, 제1시작어 임계값은, 예를 들면, 0.1로 미리 설정될 수 있으나, 이는 일례로서 그 값이 한정되는 것은 아니다.In an embodiment, the first starting word threshold may be preset to, for example, 0.1, but the value is not limited thereto as an example.
프로세서(160)는, 수학식 1에 의해, 발화스코어가 제1시작어 임계값 보다 큰 것으로 식별되면, 단계 401에서 입력된 소리신호가 제1활성화 조건을 만족하는 것으로 결정할 수 있다.The processor 160 may determine that the sound signal input in step 401 satisfies the first activation condition when it is identified that the utterance score is greater than the first starting word threshold by Equation (1).
단계 402에서 소리 신호가 제1활성화 조건을 만족하는 것으로 결정되면, 프로세서(160)는 사용자발화에 따른 소리 신호의 소음 특성을 나타내는 값이 미리 정해진 소음 임계값 즉, 제1임계값 보다 큰지 여부를 식별할 수 있다(403). 여기서, 제1임계값은 사용자 주변이 소음 환경인지 여부를 식별하기 위한 것으로, 주변에 충분히 시끄러운 소음이 존재하는 경우의 소리 신호의 전력값에 대응하도록 미리 설정될 수 있다.If it is determined in step 402 that the sound signal satisfies the first activation condition, the processor 160 determines whether the value representing the noise characteristic of the sound signal according to the user's utterance is greater than a predetermined noise threshold, that is, the first threshold. can be identified (403). Here, the first threshold value is for identifying whether the user's surroundings are a noisy environment, and may be preset to correspond to a power value of a sound signal when a sufficiently loud noise is present in the surroundings.
여기서, 프로세서(160)는, 단계 401에서 수신되는 소리 신호에 대해, 사용자에 의해 발화된 시작어가 포함된 구간(이하, 시작어 구간 이라고도 한다)을 식별하고, 시작어 구간 이전의 기정의된 시간 길이의 구간(이하, 소음 특성 확인 구간 이라고도 한다)에서 수신되는 소음 특성을 나타내는 값이 제1임계값보다 큰지 여부를 식별할 수 있다.Here, with respect to the sound signal received in step 401, the processor 160 identifies a section including a start word uttered by the user (hereinafter, also referred to as a start word section), and a predefined time before the start word section It can be identified whether a value indicating a noise characteristic received in a length section (hereinafter, also referred to as a noise characteristic confirmation section) is greater than a first threshold value.
본 발명 일 실시예에 따른 전자장치(10)에서, 소리수신부(120)에 의해 스트리밍 방식으로 수신되는 소리 신호는, 도 6에 도시된 바와 같이, 선입선출(First In First Out, FIFO)의 큐(queue) 형태의 자료 구조에 연속되는 프레임 단위로 임시 저장될 수 있다. 즉, 스트리밍 소리 신호는, 다음 프레임이 수신되면, 가장 먼저 저장되었던 프레임을 밀어내는 방식으로 저장된다. 여기서, 저장되는 소리 신호의 길이는 저장 공간에 대응하여 미리 설정될 수 있으며, 예를 들면 2.5 초 길이의 신호가 저장되도록 구현될 수 있다.In the electronic device 10 according to an embodiment of the present invention, the sound signal received by the sound receiving unit 120 in a streaming manner is, as shown in FIG. 6 , a First In First Out (FIFO) queue. It can be temporarily stored in units of consecutive frames in a (queue) type data structure. That is, when the next frame is received, the streaming sound signal is stored in such a way that the first stored frame is pushed out. Here, the length of the sound signal to be stored may be preset to correspond to the storage space, and for example, it may be implemented such that a signal having a length of 2.5 seconds is stored.
일 실시예에서, 프로세서(160)는, 위와 같이 연속되는 프레임 단위로 수신 및 저장되는 스트리밍 소리 신호의 각 프레임에 대해 사용자발화에 따른 시작어가 포함되는지 여부를 모니터링할 수 있다. 프로세서(160)는, 모니터링에 기초하여, 예를 들어, 단계 402에서 설명한 바와 같이, 특정 신호 프레임에서 발화스코어가 제1시작어 임계값 보다 큰 것으로 검출되면, 해당 신호 프레임이 사용자발화 즉, 시작어를 포함하는 것으로 식별할 수 있다.In an embodiment, the processor 160 may monitor whether a start word according to the user's utterance is included in each frame of the streaming sound signal received and stored in units of consecutive frames as described above. The processor 160, based on the monitoring, for example, as described in step 402, when it is detected that the speech score in a specific signal frame is greater than the first start word threshold, the corresponding signal frame is a user speech, that is, start It can be identified as containing
프로세서(160)는 단계 402에서 식별된 신호 프레임으로부터 기정의된 시간 길이, 예를 들면 약 1초 이전까지의 시간 구간을 시작어 구간으로 식별할 수 있다. 그리고, 프로세서(160)는 식별된 시작어 구간 이전의 기정의된 시간 길이, 예를 들면 약 1.5초의 시간 구간을 소음 특성 확인 구간으로서 식별할 수 있다.The processor 160 may identify a predetermined time length from the signal frame identified in step 402, for example, a time period up to about 1 second before the start word period. In addition, the processor 160 may identify a predetermined length of time before the identified start word section, for example, a time section of about 1.5 seconds as the noise characteristic confirmation section.
여기서, 소음 특성 확인 구간은, 저장되는 전체 소리 신호의 시간에서 시작어 구간의 시간을 감한 시간에 대응하도록 정의될 수 있으며, 본 발명에서 시작어 구간에 대응하는 시간 길이 및 소음 특성 확인 구간의 시간 길이는 제시된 예시에 한정되지 않는다.Here, the noise characteristic check section may be defined to correspond to a time obtained by subtracting the time of the start word section from the time of the entire sound signal being stored, and in the present invention, the length of time corresponding to the start word section and the time of the noise property check section The length is not limited to the examples presented.
프로세서(160)는 소리 신호의 소음 특성으로서, 소음 특성 확인 구간의 신호 전력을 제1임계값과 비교하여, 발화 시 주변 환경이 충분히 시끄러운지 여부, 다시 말해, 사용자발화가 소음환경에서 이루어진 것인지 여부를 식별할 수 있다.As a noise characteristic of the sound signal, the processor 160 compares the signal power of the noise characteristic confirmation section with a first threshold value to determine whether the surrounding environment is sufficiently noisy when uttering, that is, whether the user's utterance is made in a noisy environment. can be identified.
단계 403에서, 소리 신호의 소음 특성, 즉, 신호전력이 제1임계값보다 큰 것으로 식별되면, 프로세서(160)는 소리 신호의 발화 특성이 미리 정해진 SNR 임계값 즉, 제2임계값 보다 큰지 여부를 식별할 수 있다(404). 여기서, 발화 특성은 소리 신호의 소음 대비 신호비(Signal to Noise Ratio, SNR)를 포함할 수 있다.In step 403, if it is identified that the noise characteristic of the sound signal, that is, the signal power is greater than the first threshold, the processor 160 determines whether the speech characteristic of the sound signal is greater than a predetermined SNR threshold, that is, the second threshold. can be identified (404). Here, the speech characteristic may include a signal to noise ratio (SNR) of a sound signal.
일 실시예에서, 프로세서(160)는 소리 신호의 발화 특성으로서, 전체 소리 신호에 대한 소음 비율에 대응하는 사후 SNR(a posteriori SNR)(SNR post)을 연산하고, 아래 수학식 2를 이용하여, 연산된 사후 SNR이 미리 정해진 제2임계값 즉, SNR 임계값(SNR Threshold) 보다 큰지 여부를 식별할 수 있다. In one embodiment, the processor 160 calculates a posteriori SNR (SNR post ) corresponding to the noise ratio to the total sound signal as the speech characteristic of the sound signal, and using Equation 2 below, It may be identified whether the calculated posterior SNR is greater than a predetermined second threshold, that is, an SNR threshold.
Figure PCTKR2020018442-appb-M000002
Figure PCTKR2020018442-appb-M000002
여기서, 사후 SNR (SNR post)는 아래의 수학식 3 및 수학식 4를 이용하여 연산될 수 있다.Here, the post SNR (SNR post ) may be calculated using Equations 3 and 4 below.
Figure PCTKR2020018442-appb-M000003
Figure PCTKR2020018442-appb-M000003
여기서, 프레임 p의 k번째 스펙트럼에 대하여, X(p,k)는 잡음이 포함된 전체 소리 신호를, S(p,k)는 발화 신호를, N(p,k)는 잡음 신호를 각각 나타낸다.Here, for the k-th spectrum of frame p, X(p,k) represents the total sound signal including noise, S(p,k) represents the speech signal, and N(p,k) represents the noise signal, respectively. .
그에 따라, 수신된 입력 소리 신호(음성 신호) X는 수학식 3과 같이, 발화 요소 S와 잡음 요소 N의 각 프레임 p 별 k번째 스펙트럼 요소의 합으로서 나타낼 수 있다.Accordingly, the received input sound signal (voice signal) X may be expressed as the sum of the k-th spectral element for each frame p of the speech element S and the noise element N, as shown in Equation (3).
사후 SNR(SNR post)은, 각 프레임(p) 별로, 잡음이 포함된 전체 소리 신호 X(p,k)에 대한 잡음 즉, 노이즈 N(p,k)의 크기 비율로서 아래의 수학식 4에 의해 연산될 수 있다. The post SNR (SNR post ) is the ratio of the magnitude of the noise, that is, the noise N(p,k), to the total sound signal X(p,k) including the noise for each frame (p), as expressed in Equation 4 below. can be calculated by
Figure PCTKR2020018442-appb-M000004
Figure PCTKR2020018442-appb-M000004
그리고, 전체 프레임에 대한 최종 사후 SNR은, 각 프레임(p) 별 사후 SNR의 평균값으로서 연산될 수 있다.And, the final posterior SNR for all frames may be calculated as an average value of posterior SNRs for each frame p.
일 실시예의 전자장치(10)에 따르면, 프로세서(160)는, 단계 404에서 이렇게 연산된 최종 사후 SNR을 소리 신호의 발화 특성으로 결정하여, 발화 특성을 제2임계값(SNR Threshold)과 비교함으로써, 소음 환경에서 사용자발화가 충분히 크게 이루어졌는지 여부를 식별할 수 있게 된다.According to the electronic device 10 of an embodiment, the processor 160 determines the final posterior SNR calculated in this way as the speech characteristic of the sound signal in step 404, and compares the speech characteristic with a second threshold (SNR Threshold). , it is possible to identify whether the user's speech is sufficiently loud in a noisy environment.
여기서, 제2임계값 즉, SNR 임계값(SNR Threshold)은, 입력되는 소리 신호가 소음 환경에서 사용자발화에 의해 발생한 것으로 인지할 수 있는 크기에 대응하는 소정 값으로서, 그 초기값이 미리 설정될 수 있으며, 본 발명 일 실시예에서는 초기 SNR 임계값이, 예를 들어, 4로 설정될 수 있으나, 한정되는 것은 아니다.Here, the second threshold, that is, the SNR threshold, is a predetermined value corresponding to a level at which an input sound signal can be recognized as generated by a user's speech in a noisy environment, and the initial value is to be set in advance. In one embodiment of the present invention, the initial SNR threshold may be set to, for example, 4, but is not limited thereto.
단계 403의 식별 결과, 전자장치(10)가 충분히 시끄러운 소음 환경에서 동작하는 경우(단계 430에서 YES), 단계 402에서 사용자의 실제 발화가 아닌 주변 소음이 포함된 소리 신호를 시작어를 포함하는 것으로 잘못 인식되는 경우가 발생할 수 있다.As a result of the identification in step 403, if the electronic device 10 operates in a sufficiently noisy noise environment (YES in step 430), in step 402, a sound signal including ambient noise, not the user's actual speech, is regarded as including the starting word. Misrecognition may occur.
이를 고려하여, 본 발명 일 실시예에 따른 전자장치(10)에서는, 단계 402에서 제1활성화 조건을 만족하는 것으로 식별된 소리 신호에 대해, 단계 403에서 해당 신호의 소음 특성의 값을 제1임계값과 비교하는 방식으로 소음 환경 여부를 식별한 후, 소음 환경인 경우, 단계 404에서 그 소리 신호의 발화 특성을 제2임계값(초기 SNR 임계값)과 더 비교하게 된다.In consideration of this, in the electronic device 10 according to an embodiment of the present invention, with respect to the sound signal identified as satisfying the first activation condition in step 402 , in step 403 , the value of the noise characteristic of the signal is set as a first threshold After identifying whether there is a noisy environment by comparing the value with the value, in the case of a noisy environment, the ignition characteristic of the sound signal is further compared with a second threshold value (initial SNR threshold) in step 404 .
그에 따라, 단계 404에 의해 소음 환경에서 사용자발화가 충분히 크게 이루어졌는지 여부를 더 판단하고, 그 결과에 기초하여 음성인식 동작을 수행하기 위한 트리거의 실행에 대한 제어가 이루어질 수 있다.Accordingly, it is further determined whether the user's utterance is sufficiently loud in a noisy environment by step 404, and based on the result, it is possible to control execution of a trigger for performing a voice recognition operation.
프로세서(160)는, 단계 404에서 소리 신호의 발화 특성의 값이 미리 정해진 제2임계값 예를 들면, 초기 SNR 임계값 보다 큰 것으로 식별된 경우, 트리거를 실행하여 전자장치(10)가 수신된 소리 신호에 기초하여 사용자발화에 관한 음성인식 동작을 수행하도록 제어한다(405). 여기서, 단계 404에서 연산된 최종 사후 SNR이, 예를 들면, 5 로서, 초기 SNR 임계값인 4 보다 큰 경우, 트리거가 실행될 수 있다.In step 404 , when it is identified that the value of the speech characteristic of the sound signal is greater than a predetermined second threshold, for example, an initial SNR threshold, the processor 160 executes a trigger to receive the received electronic device 10 . A control is performed to perform a voice recognition operation on the user's utterance based on the sound signal ( 405 ). Here, when the final posterior SNR calculated in step 404 is, for example, 5, which is greater than the initial SNR threshold of 4, a trigger may be executed.
즉, 소음 환경에서(단계 403에서 YES), 사용자발화가 충분히 크게 이루어진 것으로 식별되면(단계 404에서 YES), 프로세서(160)는 트리거를 즉시 실행함으로써, 전자장치(10)에서 음성인식 기능을 활성화시켜 수신되는 소리 신호에 대응하여 동작이 이루어질 수 있게 된다.That is, in a noisy environment (YES in step 403), if it is identified that the user's speech is sufficiently loud (YES in step 404), the processor 160 immediately executes a trigger, thereby activating the voice recognition function in the electronic device 10 Thus, an operation can be performed in response to the received sound signal.
그리고, 프로세서(160)는 제2임계값을 미리 정해진 초기 SNR 임계값으로부터 상향 조정할 수 있다(406).Then, the processor 160 may adjust the second threshold upward from the predetermined initial SNR threshold ( 406 ).
다시 말해, 본 발명 일 실시예에 따른 전자장치(10)에서, 프로세서(160)는 단계 405에서 트리거를 실행한 후, 주변 환경의 변화로서 소음 환경을 반영하도록 제2임계값을 재설정할 수 있다.In other words, in the electronic device 10 according to an embodiment of the present invention, the processor 160 may reset the second threshold to reflect the noise environment as a change in the surrounding environment after executing the trigger in step 405 . .
일 실시예에서, 프로세서(160)는 아래의 수학식 5에 따라, 초기 SNR 임계값(SNR Th _ init)과 단계 404에서 연산된 사후 SNR(SNR post)를 이용하여 새로운 제2임계값(SNR Threshold)을 도출할 수 있다.In one embodiment, the processor 160 uses the initial SNR threshold (SNR Th _ init ) and the post SNR (SNR post ) calculated in step 404 according to Equation 5 below to generate a new second threshold value (SNR). Threshold) can be derived.
Figure PCTKR2020018442-appb-M000005
Figure PCTKR2020018442-appb-M000005
예를 들어, 초기 SNR 임계값(SNR Th _ init)이 4이고, 단계 404에서 연산된 사후 SNR(SNR post)이 5인 경우, 새로운 제2임계값(SNR Threshold)은, 수학식 5에 따라 4*log_4 (5) = 4*1.16 = 4.64 로서, 4보다 큰 값을 가지도록 증가 즉, 상향 조정된다.For example, when the initial SNR threshold (SNR Th _ init ) is 4 and the post SNR (SNR post ) calculated in step 404 is 5, the new second threshold (SNR Threshold) is calculated according to Equation (5). As 4*log_4 (5) = 4*1.16 = 4.64, it is increased to have a value greater than 4, that is, it is adjusted upward.
이렇게 상향 조정된 제2임계값은 다음 번 소리 신호의 수신에 응답하여, 해당 소리 신호에 대해 단계 404 에서 적용되는 값이 된다.The second threshold value adjusted upward as described above becomes a value applied in step 404 to the corresponding sound signal in response to reception of the next sound signal.
본 발명 일 실시예의 전자장치(10)에서는, 상기와 같이, 주변이 소음 환경인 경우, 그에 대응하여 트리거 실행 조건으로서의 제2임계값(SNR 임계값)을 상향 조정함으로써, 소음 환경에서 사용자에 의해 큰 소리로 발화가 이루어지도록 유도할 수 있다.In the electronic device 10 according to an embodiment of the present invention, as described above, when the surrounding is a noisy environment, the second threshold value (SNR threshold) as a trigger execution condition is increased correspondingly by adjusting the second threshold value (SNR threshold) by the user in the noisy environment. It can induce utterance in a loud voice.
한편, 도 4에 도시된 바와 같이, 단계 404에서 소리 신호의 발화 특성의 값이 미리 정해진 제2임계값 예를 들면, 초기 SNR 임계값 이하인 것으로 식별된 경우, 프로세서(160)는, 해당 소리 신호가 제2활성화 조건을 만족하는지 여부를 더 식별할 수 있다(407).Meanwhile, as shown in FIG. 4 , when it is identified that the value of the ignition characteristic of the sound signal is less than or equal to a predetermined second threshold, for example, an initial SNR threshold, in step 404, the processor 160, the corresponding sound signal Whether or not satisfies the second activation condition may be further identified ( 407 ).
일 실시예에서, 프로세서(160)는, 아래 수학식 6을 이용하여, 단계 401에서 도출된 소리 신호의 파형과 시작어 패턴 간 유사도로서 도출된 발화스코어가 미리 정해진 제2시작어 임계값(WUW Threshold1) 즉, 제4임계값 보다 큰 경우, 소리 신호가 제2활성화 조건을 만족하는 것으로 식별할 수 있다.In an embodiment, the processor 160, using Equation 6 below, the utterance score derived as a similarity between the waveform of the sound signal derived in step 401 and the start word pattern is a predetermined second start word threshold (WUW). Threshold1), that is, when it is greater than the fourth threshold, it may be identified that the sound signal satisfies the second activation condition.
Figure PCTKR2020018442-appb-M000006
Figure PCTKR2020018442-appb-M000006
여기서, 제2시작어 임계값(WUW Threshold2)(제4임계값)은 소리 신호가 음성인식 기능에 대한 제2활성화 조건을 만족하는지 여부를 식별하기 위한 것으로, 사용자발화가 소음 환경인 경우(단계 403에서 YES)에 적용될 수 있다.Here, the second starting word threshold WUW Threshold2 (fourth threshold) is for identifying whether the sound signal satisfies the second activation condition for the voice recognition function, and when the user's speech is a noisy environment (step 403 to YES).
제2시작어 임계값(WUW Threshold2)(제4임계값)은, 아래 수학식 7과 같이, 단계 401에서의 제1시작어 임계값(WUW Threshold1)(제3임계값) 보다 큰 값으로 설정될 수 있다. The second starting word threshold WUW Threshold2 (the fourth threshold) is set to a value greater than the first starting word threshold WUW Threshold1 (third threshold) in step 401 as shown in Equation 7 below. can be
Figure PCTKR2020018442-appb-M000007
Figure PCTKR2020018442-appb-M000007
일 실시예에서, 예를 들어 제1시작어 임계값(제3임계값)은 0.1로, 제2시작어 임계값(제4임계값)은 0.15 로 미리 설정될 수 있으나, 이는 예를 들어 제시한 것이므로, 그 값이 한정되는 것은 아니다.In one embodiment, for example, the first starting word threshold value (third threshold value) may be preset to 0.1, and the second starting word threshold value (fourth threshold value) may be preset to 0.15, but this is presented as an example , so the value is not limited.
프로세서(160)는, 수학식 6에 의해, 발화스코어가 제2시작어 임계값 보다 큰 것으로 식별되면, 단계 401에서 입력된 소리신호가 제2활성화 조건을 만족하는 것으로 결정할 수 있다.The processor 160 may determine that the sound signal input in step 401 satisfies the second activation condition when it is identified that the utterance score is greater than the second starting word threshold by Equation (6).
단계 407에서 소리 신호가 제2활성화 조건을 만족하는 것으로 결정되면, 프로세서(160)는 트리거를 실행하여 전자장치(10)가 수신된 소리 신호에 기초하여 사용자발화에 관한 음성인식 동작을 수행하도록 제어한다(408).If it is determined in step 407 that the sound signal satisfies the second activation condition, the processor 160 executes a trigger to control the electronic device 10 to perform a voice recognition operation related to the user's utterance based on the received sound signal do (408).
반면, 단계 407에서 소리 신호가 제2활성화 조건을 만족하지 못하는 것으로 결정되면, 즉 수학식 6에 의해 발화스코어가 제2시작어 임계값 이하인 것으로 식별되면, 프로세서(160)가 트리거를 실행하지 않으므로 전자장치(10)가 음성인식 비활성화를 유지하도록 제어된다(409).On the other hand, if it is determined in step 407 that the sound signal does not satisfy the second activation condition, that is, if the utterance score is identified as being equal to or less than the second starting word threshold by Equation 6, since the processor 160 does not execute the trigger, The electronic device 10 is controlled to keep the voice recognition inactive ( 409 ).
그에 따라, 본 발명 일 실시예에 따른 전자장치(10)는, 프로세서(160)가 소음 환경인 경우(단계 403에서 YES), 입력 소리신호의 파형과 시작어 신호의 패턴의 유사도가 제1시작어 임계값 보다 크게 식별되어, 입력 소리 신호가 제1활성화 조건을 만족하더라도(단계 402에서 YES), 소리 신호가 제2활성화 조건까지 만족하는 경우에만(단계 407에서 YES) 음성인식 기능이 활성화되도록, 2단계의 활성화 조건에 기초한 제어가 이루어지게 된다.Accordingly, in the electronic device 10 according to an embodiment of the present invention, when the processor 160 is in a noisy environment (YES in step 403), the similarity between the waveform of the input sound signal and the pattern of the start word signal is the first start. is identified to be greater than the threshold value, so that even if the input sound signal meets the first activation condition (YES in step 402), the voice recognition function is activated only when the sound signal satisfies up to the second activation condition (YES in step 407) , control based on the activation conditions of the second stage is made.
다시 말해, 소음 환경에서는, 소리 신호와 시작어 신호 간의 패턴 매칭에 따른 유사도를 나타내는 발화스코어가, 제2시작어 임계값보다 큰 경우에만, 음성인식 기능이 활성화되므로, 단계 402에서 사용자의 실제 발화가 아닌 주변 소음을 포함하는 소리 신호가 시작어를 포함하는 것으로 잘못 인식되더라도, 단계 407에 의해 오인 동작의 가능성이 줄어들게 된다.In other words, in a noisy environment, since the speech recognition function is activated only when the speech score indicating the similarity according to the pattern matching between the sound signal and the starting word signal is greater than the second starting word threshold, the user's actual speech in step 402 Even if the sound signal including the ambient noise other than , is erroneously recognized as including the starting word, the possibility of an erroneous operation is reduced by step 407 .
한편, 단계 403에서, 소리 신호의 소음 특성 즉, 신호전력이 제1임계값 이하인 것으로 식별되면, 프로세서(160)는 트리거를 실행하여 전자장치(10)가 음성인식 동작을 수행하도록 제어한다(410).Meanwhile, in step 403, when it is identified that the noise characteristic of the sound signal, that is, the signal power is equal to or less than the first threshold, the processor 160 controls the electronic device 10 to perform a voice recognition operation by executing a trigger (410). ).
그리고, 프로세서(160)는, 제2임계값을 미리 정해진 초기 SNR 임계값으로부터 하향 조정할 수 있다(411).Then, the processor 160 may adjust the second threshold downward from a predetermined initial SNR threshold ( 411 ).
다시 말해, 본 발명 일 실시예에 따른 전자장치(10)에서, 프로세서(160)는, 단계 410에서 트리거를 실행 한 후, 단계 403의 식별 결과에 대응하여 전자장치(10)가 주변이 시끄럽지 않은 환경에서 동작하는 것으로 판단되는 경우(단계 403에서 NO), 즉, 주변이 소음환경이 아닌 경우, 이를 반영하도록 제2임계값을 재설정할 수 있다.In other words, in the electronic device 10 according to an embodiment of the present invention, after the processor 160 executes the trigger in step 410 , in response to the identification result in step 403 , the electronic device 10 determines that the surroundings are not noisy. When it is determined that the operation is performed in the environment (NO in step 403), that is, when the surrounding is not a noisy environment, the second threshold value may be reset to reflect this.
일 실시예에서, 프로세서(160)는, 단계 404에서 설명한 바 있는 소리 신호의 발화 특성으로서 전체 소리 신호에 대한 소음 비율에 대응하는 사후 SNR (SNR post)를 연산하고, 전술한 수학식 5에 따라 연산된 사후 SNR과 초기 SNR 임계값을 이용하여 새로운 제2임계값(SNR Threshold)을 도출할 수 있다. 여기서, 주변이 소음환경이 아니므로, 연산된 최종 사후 SNR은 단계 404에서의 경우 보다 작은 값으로 도출되게 되며, 예를 들면 2 가 될 수 있다. In one embodiment, the processor 160 calculates a post SNR (SNR post ) corresponding to the noise ratio to the total sound signal as the speech characteristic of the sound signal as described in step 404, and according to Equation 5 above, A new second threshold (SNR Threshold) may be derived using the calculated post SNR and the initial SNR threshold. Here, since the surrounding is not a noisy environment, the calculated final posterior SNR is derived to be smaller than the case in step 404, and may be, for example, 2
일례로서, 초기 SNR 임계값(SNR Th _ init)이 4.16 이고, 연산된 사후 SNR (SNR post)이 2인 경우, 새로운 제2임계값(SNR Threshold)은, 수학식 5에 따라 4.16*log_4.16 (2) = 4.16*0.49 = 2.02 로서, 4보다 작은 값을 가지도록 감소 즉, 하향 조정된다.As an example, when the initial SNR threshold (SNR Th _ init ) is 4.16 and the calculated post SNR (SNR post ) is 2, the new second threshold (SNR Threshold) is 4.16*log_4. 16 (2) = 4.16*0.49 = 2.02, that is, it is decreased to have a value less than 4, that is, it is adjusted downward.
본 발명 일 실시예의 전자장치(10)에서는, 상기와 같이, 주변이 소음 환경이 아닌 경우, 그에 대응하여 트리거 실행 조건으로서의 제2임계값(SNR 임계값)을 하향 조정함으로써, 소음 환경이 아닌 경우, 사용자에 의해 작은 소리로 발화가 이루어져도 즉각적인 음성인식이 가능하도록 동작할 수 있다.In the electronic device 10 according to an embodiment of the present invention, as described above, when the environment is not in a noisy environment, the second threshold value (SNR threshold) as a trigger execution condition is lowered correspondingly to the case where the environment is not in a noisy environment , even when a small sound is uttered by the user, it can operate to enable immediate voice recognition.
한편, 도 4에 도시된 바와 같이, 단계 402에서 소리 신호가 제1활성화 조건을 만족하지 못하는 것으로 결정되면, 즉 수학식 1에 의해 소리 신호와 시작어 신호 간의 유사성을 나타내는 발화스코어가 제1시작어 임계값 이하인 것으로 식별되면, 프로세서(160)는 트리거를 실행하지 않으므로, 전자장치(10)가 음성인식 비활성화를 유지하도록 제어될 수 있다(412).Meanwhile, as shown in FIG. 4 , if it is determined in step 402 that the sound signal does not satisfy the first activation condition, that is, an utterance score indicating the similarity between the sound signal and the start word signal by Equation 1 is the first start If it is identified as being equal to or less than the threshold value, the processor 160 does not execute a trigger, and thus the electronic device 10 may be controlled to maintain the voice recognition deactivation ( 412 ).
상기와 같은 본 발명 일 실시예에 따른 전자장치(10)에서는, 사용자발화에 따른 소리 신호가 제1활성화 조건을 만족하더라도, 소음 환경에서는, 소리 신호의 발화 특성인 소음 대비 신호비(SNR)가 SNR 임계값 이하로서 사용자발화가 소음 대비 충분히 크게 이루어지지 않은 경우, 소리 신호가 제2활성화 조건을 만족하는지 여부를 추가로 식별함으로써, 음성인식 기능에 대해 2단계의 활성화 조건을 적용하게 된다. In the electronic device 10 according to the embodiment of the present invention as described above, even if the sound signal according to the user's utterance satisfies the first activation condition, in a noisy environment, the noise-to-signal ratio (SNR), which is the utterance characteristic of the sound signal, When the user's speech is not sufficiently loud compared to the noise as less than the SNR threshold, by additionally identifying whether the sound signal satisfies the second activation condition, the activation condition of the second step is applied to the voice recognition function.
그에 따라, 전자장치(10)가 소음 환경에서 사용자의 실제 발화가 아닌 주변 소음이 포함된 소리신호를 시작어를 포함하는 것으로 잘못 인식하는 경우와 같은, 오동작의 발생을 감소시킬 수 있다.Accordingly, it is possible to reduce the occurrence of erroneous operation, such as when the electronic device 10 erroneously recognizes a sound signal including ambient noise instead of an actual utterance of the user in a noisy environment as including a starting word.
또한, 본 발명 일 실시예에 따른 전자장치(10)에서는 소음 환경에서 소리 신호의 발화 특성인 소음 대비 신호비(SNR)가 SNR 임계값 보다 큰 경우, 즉 사용자발화가 소음 대비 충분히 크게 이루어진 경우, SNR 임계값을 보다 상향되게 조정하여, 소음 환경에서 사용자로 하여금 큰 소리로 시작어를 발화하도록 유도함으로써, 동작의 정확성을 향상시키는 효과를 기대할 수 있게 된다.In addition, in the electronic device 10 according to an embodiment of the present invention, when a signal-to-noise ratio (SNR), which is an utterance characteristic of a sound signal in a noisy environment, is greater than the SNR threshold, that is, when user utterance is sufficiently large compared to noise, By adjusting the SNR threshold to be higher, inducing the user to utter the starting word loudly in a noisy environment, the effect of improving the accuracy of motion can be expected.
또한, 본 발명 일 실시예에 따른 전자장치(10)에서는, 주변이 소음 환경이 아닌 경우, SNR 임계값을 보다 하향되게 조정함으로써, 조용한 환경에서는 그 환경 변화에 따른 전자장치(10)의 즉각적인 동작이 이루어지도록 할 수 있다.In addition, in the electronic device 10 according to an embodiment of the present invention, when the surrounding environment is not a noisy environment, by adjusting the SNR threshold to be lowered, immediate operation of the electronic device 10 according to the environmental change in a quiet environment can make this happen.
이상, 바람직한 실시예를 통하여 본 발명에 관하여 상세히 설명하였으나, 본 발명은 이에 한정되는 것은 아니며 특허청구범위 내에서 다양하게 실시될 수 있다.As mentioned above, although the present invention has been described in detail through preferred embodiments, the present invention is not limited thereto and may be practiced in various ways within the scope of the claims.

Claims (15)

  1. 전자장치에 있어서,In an electronic device,
    소리수신부; 및sound receiver; and
    상기 소리수신부를 통해 수신되는 소리 신호의 소음 특성을 나타내는 값이 제1임계값보다 크고, A value representing the noise characteristic of the sound signal received through the sound receiver is greater than a first threshold value,
    상기 소리 신호의 발화 특성을 나타내는 값이 제2임계값보다 크면, If the value representing the ignition characteristic of the sound signal is greater than the second threshold,
    상기 소리 신호에 기초하여 사용자 발화에 관한 인식 동작을 수행하고, 상기 제2임계값이 상향되도록 조정하는 performing a recognition operation on the user's utterance based on the sound signal, and adjusting the second threshold to increase
    프로세서를 포함하는 전자장치.An electronic device comprising a processor.
  2. 제1항에 있어서,According to claim 1,
    상기 발화 특성은, 상기 소리 신호의 소음 대비 신호비를 포함하는 전자장치.The ignition characteristic may include a noise-to-signal ratio of the sound signal.
  3. 제2항에 있어서,3. The method of claim 2,
    상기 프로세서는,The processor is
    상기 소리 신호의 각 프레임 별로, 소리 신호에 대한 잡음의 크기 비율을 연산하고, 상기 연산된 각 프레임 별 비율의 평균값을 상기 발화 특성의 값으로 결정하는 전자장치. An electronic device for calculating a noise level ratio for each frame of the sound signal, and determining an average value of the calculated ratio for each frame as the value of the speech characteristic.
  4. 제1항에 있어서,According to claim 1,
    상기 프로세서는,The processor is
    상기 소리 신호에 미리 정의된 시작어가 포함되어 있는지 여부를 식별하고,Identifies whether a predefined start word is included in the sound signal,
    상기 시작어가 포함된 것으로 식별되는 소리 신호의 상기 소음 특성의 값이 상기 제1임계값보다 큰지 여부를 식별하는 전자장치.An electronic device for identifying whether the value of the noise characteristic of the sound signal identified as including the starting word is greater than the first threshold value.
  5. 제4항에 있어서,5. The method of claim 4,
    상기 프로세서는,The processor is
    상기 소리 신호의 파형과, 미리 정의된 시작어 패턴 간의 유사도에 기초하여 상기 소리 신호에 상기 시작어가 포함되어 있는지 여부를 식별하는 전자장치.An electronic device for identifying whether the start word is included in the sound signal based on a similarity between a waveform of the sound signal and a predefined start word pattern.
  6. 제5항에 있어서,6. The method of claim 5,
    상기 유사도의 임계값은 음향 모델을 이용한 학습 알고리즘에 기반하여 미리 설정되는 전자장치.The threshold value of the similarity is preset based on a learning algorithm using an acoustic model.
  7. 제5항에 있어서,6. The method of claim 5,
    상기 프로세서는,The processor is
    상기 유사도가 제3임계값보다 큰 소리 신호의 상기 발화 특성의 값이 상기 제2임계값 이하이면, 상기 유사도가 상기 제3임계값보다 큰 제4임계값을 만족하는 상기 소리 신호에 기초하여 상기 사용자 발화에 관한 인식 동작을 수행하는 전자장치.If the value of the ignition characteristic of the sound signal having the similarity greater than the third threshold value is equal to or less than the second threshold value, the similarity level is greater than the third threshold value based on the sound signal satisfying a fourth threshold value. An electronic device that performs a recognition operation on a user's utterance.
  8. 제5항에 있어서,6. The method of claim 5,
    상기 프로세서는,The processor is
    상기 시작어가 포함된 구간 이전에 기정의된 시간 길이의 구간에 수신되는 소리 신호의 상기 소음 특성의 값이 상기 제1임계값보다 큰지 여부를 식별하는 전자장치. An electronic device for identifying whether the value of the noise characteristic of a sound signal received in a section having a predefined time length before the section including the start word is greater than the first threshold value.
  9. 제7항에 있어서,8. The method of claim 7,
    상기 프로세서는,The processor is
    상기 기정의된 시간 길이의 구간에 수신되는 소리 신호의 전력값을 상기 제1임계값과 비교하는 전자장치. The electronic device compares the power value of the sound signal received during the predetermined time length with the first threshold value.
  10. 제1항에 있어서,According to claim 1,
    상기 프로세서는, 상기 소음 특성의 값이 상기 제1임계값 이하이면, 상기 제2임계값이 하향되도록 조정하는 전자장치.When the value of the noise characteristic is equal to or less than the first threshold, the processor adjusts the second threshold to decrease.
  11. 전자장치의 제어방법에 있어서,A method for controlling an electronic device, comprising:
    소리수신부를 통해 수신되는 소리 신호에서 소음 특성을 획득하는 단계; acquiring a noise characteristic from a sound signal received through a sound receiver;
    상기 소리 신호에서 발화 특성을 획득하는 단계; 및 acquiring a speech characteristic from the sound signal; and
    상기 소음 특성을 나타내는 값이 상기 제1임계값보다 크고, 상기 발화 특성을 나타내는 값이 상기 제2임계값보다 크면, 상기 소리 신호에 기초하여 사용자 발화에 관한 인식 동작을 수행하고, 상기 제2임계값이 상향되도록 조정하는 단계;When the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation on the user's utterance is performed based on the sound signal, and the second threshold value adjusting the value to increase;
    를 포함하는 전자장치의 제어방법.A control method of an electronic device comprising a.
  12. 제11항에 있어서,12. The method of claim 11,
    상기 발화 특성은, 상기 소리 신호의 소음 대비 신호비를 포함하는 전자장치의 제어방법.The ignition characteristic is a control method of an electronic device including a noise-to-signal ratio of the sound signal.
  13. 제12항에 있어서,13. The method of claim 12,
    상기 소리 신호의 각 프레임 별로, 소리 신호에 대한 잡음의 크기 비율을 연산하고, 상기 연산된 각 프레임 별 비율의 평균값을 상기 발화 특성의 값으로 결정하는 단계를 더 포함하는 전자장치의 제어방법.The method of controlling an electronic device further comprising: calculating a ratio of noise to the sound signal for each frame of the sound signal, and determining an average value of the calculated ratio for each frame as the value of the speech characteristic.
  14. 제11항에 있어서,12. The method of claim 11,
    상기 소리 신호에 미리 정의된 시작어가 포함되어 있는지 여부를 식별하는 단계; 및identifying whether a predefined start word is included in the sound signal; and
    상기 시작어가 포함된 것으로 식별되는 소리 신호의 상기 소음 특성의 값이 상기 제1임계값보다 큰지 여부를 식별하는 단계를 더 포함하는 전자장치의 제어방법.and identifying whether the value of the noise characteristic of the sound signal identified as including the start word is greater than the first threshold value.
  15. 제14항에 있어서,15. The method of claim 14,
    상기 시작어가 포함되어 있는지 여부를 식별하는 단계는,The step of identifying whether the starting word is included is,
    상기 소리 신호의 파형과, 미리 정의된 시작어 패턴 간의 유사도에 기초하여 상기 소리 신호에 상기 시작어가 포함되어 있는지 여부를 식별하는 전자장치의 제어방법.A control method of an electronic device for identifying whether the start word is included in the sound signal based on a similarity between a waveform of the sound signal and a predefined start word pattern.
PCT/KR2020/018442 2019-12-19 2020-12-16 Electronic device and control method therefor WO2021125784A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2019-0170363 2019-12-19
KR1020190170363A KR20210078682A (en) 2019-12-19 2019-12-19 Electronic apparatus and method of controlling the same

Publications (1)

Publication Number Publication Date
WO2021125784A1 true WO2021125784A1 (en) 2021-06-24

Family

ID=76476805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/018442 WO2021125784A1 (en) 2019-12-19 2020-12-16 Electronic device and control method therefor

Country Status (2)

Country Link
KR (1) KR20210078682A (en)
WO (1) WO2021125784A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11948569B2 (en) 2021-07-05 2024-04-02 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method thereof
KR20230006999A (en) * 2021-07-05 2023-01-12 삼성전자주식회사 Electronic apparatus and controlling method thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9047857B1 (en) * 2012-12-19 2015-06-02 Rawles Llc Voice commands for transitioning between device states
KR20170035602A (en) * 2015-09-23 2017-03-31 삼성전자주식회사 Voice Recognition Apparatus, Voice Recognition Method of User Device and Computer Readable Recording Medium
US20170256270A1 (en) * 2016-03-02 2017-09-07 Motorola Mobility Llc Voice Recognition Accuracy in High Noise Conditions
KR20180018146A (en) * 2016-08-12 2018-02-21 삼성전자주식회사 Electronic device and method for recognizing voice of speech
KR20190117725A (en) * 2017-03-22 2019-10-16 삼성전자주식회사 Speech signal processing method and apparatus adaptive to noise environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9047857B1 (en) * 2012-12-19 2015-06-02 Rawles Llc Voice commands for transitioning between device states
KR20170035602A (en) * 2015-09-23 2017-03-31 삼성전자주식회사 Voice Recognition Apparatus, Voice Recognition Method of User Device and Computer Readable Recording Medium
US20170256270A1 (en) * 2016-03-02 2017-09-07 Motorola Mobility Llc Voice Recognition Accuracy in High Noise Conditions
KR20180018146A (en) * 2016-08-12 2018-02-21 삼성전자주식회사 Electronic device and method for recognizing voice of speech
KR20190117725A (en) * 2017-03-22 2019-10-16 삼성전자주식회사 Speech signal processing method and apparatus adaptive to noise environment

Also Published As

Publication number Publication date
KR20210078682A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
WO2017052082A1 (en) Voice recognition apparatus, voice recognition method of user device, and non-transitory computer readable recording medium
WO2014003283A1 (en) Display apparatus, method for controlling display apparatus, and interactive system
WO2021025350A1 (en) Electronic device managing plurality of intelligent agents and operation method thereof
WO2018043895A1 (en) Display device and method for controlling display device
WO2021049795A1 (en) Electronic device and operating method thereof
WO2020231230A1 (en) Method and apparatus for performing speech recognition with wake on voice
WO2015194693A1 (en) Video display device and operation method therefor
WO2021125784A1 (en) Electronic device and control method therefor
WO2019013447A1 (en) Remote controller and method for receiving a user's voice thereof
WO2015170832A1 (en) Display device and video call performing method therefor
WO2020184842A1 (en) Electronic device, and method for controlling electronic device
WO2020251122A1 (en) Electronic device for providing content translation service and control method therefor
WO2021002611A1 (en) Electronic apparatus and control method thereof
WO2020091519A1 (en) Electronic apparatus and controlling method thereof
WO2019112181A1 (en) Electronic device for executing application by using phoneme information included in audio data and operation method therefor
WO2020167006A1 (en) Method of providing speech recognition service and electronic device for same
WO2019017665A1 (en) Electronic apparatus for processing user utterance for controlling an external electronic apparatus and controlling method thereof
WO2020091183A1 (en) Electronic device for sharing user-specific voice command and method for controlling same
WO2021137558A1 (en) Electronic device and control method thereof
WO2020050593A1 (en) Electronic device and operation method thereof
WO2020096218A1 (en) Electronic device and operation method thereof
WO2019112332A1 (en) Electronic apparatus and control method thereof
WO2019177377A1 (en) Apparatus for processing user voice input
WO2022131566A1 (en) Electronic device and operation method of electronic device
WO2018021750A1 (en) Electronic device and voice recognition method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20903382

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20903382

Country of ref document: EP

Kind code of ref document: A1