WO2021125784A1 - Dispositif électronique et son procédé de commande - Google Patents
Dispositif électronique et son procédé de commande Download PDFInfo
- Publication number
- WO2021125784A1 WO2021125784A1 PCT/KR2020/018442 KR2020018442W WO2021125784A1 WO 2021125784 A1 WO2021125784 A1 WO 2021125784A1 KR 2020018442 W KR2020018442 W KR 2020018442W WO 2021125784 A1 WO2021125784 A1 WO 2021125784A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound signal
- electronic device
- value
- threshold
- threshold value
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000005236 sound signal Effects 0.000 claims abstract description 166
- 238000004891 communication Methods 0.000 description 30
- 230000006870 function Effects 0.000 description 28
- 230000004913 activation Effects 0.000 description 24
- 238000012545 processing Methods 0.000 description 16
- 238000013473 artificial intelligence Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000004590 computer program Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000003213 activating effect Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000012790 confirmation Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- APTZNLHMIGJTEW-UHFFFAOYSA-N pyraflufen-ethyl Chemical compound C1=C(Cl)C(OCC(=O)OCC)=CC(C=2C(=C(OC(F)F)N(C)N=2)Cl)=C1F APTZNLHMIGJTEW-UHFFFAOYSA-N 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 239000002131 composite material Substances 0.000 description 2
- 238000009434 installation Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 239000002041 carbon nanotube Substances 0.000 description 1
- 229910021393 carbon nanotube Inorganic materials 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000002159 nanocrystal Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Definitions
- the present invention relates to an electronic device and a control method thereof, and more particularly, to an electronic device for processing a voice uttered by a user and a control method thereof.
- Electronic devices such as artificial intelligence (AI) speakers, mobile devices such as smart phones or tablets, and smart TVs recognize the voice uttered by the user and perform a function according to the voice recognition can do.
- AI artificial intelligence
- mobile devices such as smart phones or tablets
- smart TVs recognize the voice uttered by the user and perform a function according to the voice recognition can do.
- the electronic device may operate to activate the voice recognition function by recognizing that a predetermined start word, that is, a trigger word, is input from the user.
- the start word recognition may include a process of determining the similarity between the audio signal of the user's voice and the start word. For example, when the similarity between the pattern of the audio signal and the start word is greater than a predetermined criterion, the input voice is It can be identified by including the starting word.
- the threshold value for identifying the speech characteristic of a sound signal is reset according to whether the user's speech property is in a noisy environment in response to the user's speech property, so that the accuracy of starting word recognition
- An electronic device includes: a sound receiver; and when the value indicating the noise characteristic of the sound signal received through the sound receiver is greater than the first threshold value and the value indicating the speech characteristic of the sound signal is greater than the second threshold value, the recognition operation regarding the user's utterance based on the sound signal and a processor for adjusting the second threshold to increase.
- the speech characteristic may include a signal-to-noise ratio of the sound signal.
- the processor may calculate a ratio of the noise to the sound signal for each frame of the sound signal, and determine an average value of the calculated ratio for each frame as the value of the speech characteristic.
- the processor may identify whether a predefined start word is included in the sound signal, and identify whether a noise characteristic of the sound signal identified as including the start word is greater than a first threshold value.
- the processor may identify whether a start word is included in the sound signal based on a similarity between a waveform of the sound signal and a predefined start word pattern.
- the threshold of similarity may be preset based on a learning algorithm using an acoustic model.
- the processor is configured to provide information regarding the user's speech based on the sound signal satisfying a fourth threshold value having the similarity greater than the third threshold value.
- a recognition operation may be performed.
- the processor may identify whether the value of the noise characteristic of the sound signal received in a section having a predefined time length before the section including the start word is greater than the first threshold value.
- the processor may compare the power value of the sound signal received in the section of the predefined time length with the first threshold value.
- the processor may adjust the second threshold to decrease.
- a control method of an electronic device includes: acquiring a noise characteristic from a sound signal received through a sound receiver; obtaining a speech characteristic from the sound signal; and when the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation for the user's utterance is performed based on the sound signal, and the second threshold value is adjusted to increase including the steps of
- the speech characteristic may include a signal-to-noise ratio of the sound signal.
- the method may further include calculating a ratio of the noise to the sound signal for each frame of the sound signal, and determining an average value of the calculated ratio for each frame as a value of the speech characteristic.
- identifying whether the sound signal includes a predefined starting word and, identifying whether the value of the noise characteristic of the sound signal identified as including the starting word is greater than a first threshold value.
- the step of identifying whether the start word is included may include identifying whether the start word is included in the sound signal based on a similarity between the waveform of the sound signal and a predefined start word pattern.
- the threshold of similarity may be preset based on a learning algorithm using an acoustic model.
- a recognition operation regarding the user's speech is performed based on the sound signal satisfying the fourth threshold having the similarity greater than the third threshold. It may further include the step of performing.
- the method may further include the step of identifying whether a value of a noise characteristic of a sound signal received in a section having a predefined time length before the section identified as including the starting word is greater than a first threshold value.
- the method may further include adjusting the second threshold to be lowered.
- a computer-readable code in a recording medium storing a computer program including a code for performing a control method of an electronic device, the control method of the electronic device is received through a sound receiver. acquiring noise characteristics from the sound signal being obtaining a speech characteristic from the sound signal; and when the value indicating the noise characteristic is greater than the first threshold value and the value indicating the speech characteristic is greater than the second threshold value, a recognition operation for the user's utterance is performed based on the sound signal, and the second threshold value is adjusted to increase including the steps of
- the electronic device and the control method of the present invention by resetting the threshold value for identifying the user's speech characteristics with respect to the sound signal in a noisy environment, the user induces the user to utter the starting word in a loud voice, The effect of improving the accuracy of motion can be expected.
- the electronic device and the control method thereof of the present invention the occurrence of a malfunction in which the electronic device incorrectly recognizes a sound signal including ambient noise rather than an actual utterance of a user in a noisy environment as a starting word is reduced, so that It has the effect of improving the accuracy.
- FIG. 1 illustrates a voice recognition system including an electronic device according to an embodiment of the present invention.
- FIG. 2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present invention.
- FIG. 3 is a block diagram illustrating a configuration of a voice recognition module of an electronic device according to an embodiment of the present invention.
- FIG. 4 is a flowchart illustrating a method of controlling an electronic device according to an embodiment of the present invention.
- FIG. 5 is a diagram for explaining pattern matching for activating a voice recognition function in an electronic device according to an embodiment of the present invention.
- FIG. 6 is a view for explaining the identification of noise characteristics of an electronic device according to an embodiment of the present invention.
- a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software or a combination of hardware and software, and is integrated into at least one module. and can be implemented.
- at least one of the plurality of elements refers to all of the plurality of elements as well as each one or a combination thereof excluding the rest of the plurality of elements.
- FIG. 1 illustrates a voice recognition system including an electronic device according to an embodiment of the present invention.
- the voice recognition system includes an electronic device 10 capable of receiving a sound signal as a voice uttered by a user, that is, a sound, and the electronic device 10 and a network. It may include a server 20 that can communicate through.
- the electronic device 10 may receive a voice uttered by a user (hereinafter, also referred to as a user voice), process a sound signal corresponding to the voice, and perform a corresponding operation.
- a user voice a voice uttered by a user
- process a sound signal corresponding to the voice and perform a corresponding operation.
- the electronic device 10 provides audio content to the user by outputting a sound corresponding to the processing result of the user's voice, ie, the sound, through the output unit ( 110 of FIG. 2 ) as an operation corresponding to the received voice.
- can provide At least one loudspeaker may be provided in the electronic device 10 as the output unit 110 capable of outputting sound, and the number, shape, and installation of the speakers provided in the electronic device 10 in the present invention The location is not limited.
- the electronic device 10 may be provided with a sound receiver ( 120 in FIG. 3 ) capable of receiving a sound signal as a user's voice.
- the sound receiver 120 may be implemented as at least one microphone, and the number, shape, and installation location of the microphones provided in the electronic device 10 are not limited.
- an artificial intelligence speaker hereinafter also referred to as an AI speaker or a smart speaker
- a smart TV A display device 10b including a television such as and a mobile device 10c such as a smart phone or tablet may be implemented as various devices capable of receiving a sound signal.
- the electronic device 10 implemented as the AI speaker 10a may receive a voice from a user and perform various functions, such as listening to music and searching for information, through voice recognition for the received voice.
- the AI speaker is not a device that simply outputs sound by utilizing the voice recognition function and the cloud, but is a device with a built-in virtual assistant/voice assistant that allows interaction with the user. It can be implemented to provide a service.
- an application for the AI speaker function may be installed and driven in the electronic device 10 .
- the electronic device 10 implemented as the display device 10b processes an image signal provided from an external signal supply source, ie, an image source, according to a preset process, and displays the image as an image.
- an external signal supply source ie, an image source
- the display device 10b includes a television (TV) capable of processing a broadcast signal based on at least one of a broadcast signal, broadcast information, or broadcast data provided from a transmission device of a broadcast station and displaying the same as an image.
- TV television
- the display device 10b may include, for example, a set-top box, an optical disc playback device such as a Blu-ray or digital versatile disc (DVD); From a computer (PC) including a desktop or laptop, a console game machine, a mobile device including a smart pad such as a smart phone or a tablet, etc. A video signal can be received.
- a set-top box an optical disc playback device such as a Blu-ray or digital versatile disc (DVD)
- PC computer
- PC including a desktop or laptop
- console game machine a console game machine
- a mobile device including a smart pad such as a smart phone or a tablet, etc.
- a video signal can be received.
- the display device 10b When the display device 10b is a television, the display device 10b may wirelessly receive a radio frequency (RF) signal, that is, a broadcast signal transmitted from a broadcasting station, and for this purpose, an antenna for receiving a broadcast signal and a broadcast signal are used.
- RF radio frequency
- a tuner for tuning for each channel may be provided.
- a broadcast signal can be received through a terrestrial wave, cable, satellite, or the like, and the signal source is not limited to an external device or a broadcasting station. That is, any device or station capable of transmitting and receiving data may be included in the image source of the present invention.
- the standard of the signal received from the display device 10b may be configured in various ways corresponding to the implementation form of the device.
- the display device 10b may be configured in an implementation form of the interface unit 140 (see FIG. 2 ) to be described later.
- HDMI High Definition Multimedia Interface
- HDMI-CEC Consumer Electronics Control
- DP display port
- DVI composite video
- component video super video
- DVI Digital Visual Interface
- Thunderbolt RGB cable
- SCART Syndicat des Constructeurs d'Appareils Radiorecepteurs et Televiseurs
- USB etc.
- the display apparatus 10b may receive image content from a server or the like provided for content provision through wired or wireless network communication, and the type of communication is not limited.
- the display device 10b corresponds to an implementation form of the interface unit 140 to be described later, such as Wi-Fi, Wi-Fi Direct, Bluetooth, and Bluetooth low energy.
- Wi-Fi Wi-Fi Direct
- Bluetooth Bluetooth low energy
- Zigbee Ultra-Wideband
- NFC Near Field Communication
- the display apparatus 10b may receive a content signal through wired network communication such as Ethernet.
- the display apparatus 10b may serve as an AP that allows various peripheral devices such as a smartphone to perform wireless communication.
- the display apparatus 10b may receive the content provided in the form of a file according to real-time streaming through the wired or wireless network as described above.
- the display apparatus 10b includes a user interface for controlling a video, a still image, an application, an on-screen display (OSD), and various operations based on signals/data stored in internal/external storage media.
- a signal may be processed to display a UI (hereinafter, also referred to as a graphic user interface (GUI)) on the screen.
- GUI graphic user interface
- the display device 10b may operate as a smart TV or an Internet Protocol TV (IP TV).
- Smart TV can receive and display broadcast signals in real time, and has a web browsing function, so that it is possible to search and consume various contents through the Internet at the same time as displaying real-time broadcast signals, and for this purpose, it is possible to provide a convenient user environment.
- it is television
- the smart TV since the smart TV includes an open software platform, it can provide interactive services to users. Accordingly, the smart TV may provide a user with various contents, for example, an application providing a predetermined service through an open software platform.
- These applications are applications that can provide various types of services, and include, for example, applications that provide services such as SNS, finance, news, weather, maps, music, movies, games, and e-books.
- an application for providing a voice recognition function may be installed on the display device 10b.
- a display capable of displaying an image may be provided in the electronic device 10 .
- the implementation method of the display is not limited, and for example, liquid crystal, plasma, light-emitting diode, organic light-emitting diode, and surface-conduction gun. electron-emitter), carbon nano-tube, nano-crystal, and the like, may be implemented in various display methods.
- the electronic device 10 may communicate with various external devices including the server 20 through the interface unit 140 .
- the electronic device 10 can be connected to an external device through various types of wired or wireless connection (eg, Bluetooth, Wi-Fi, or Wi-Fi Direct). It is implemented to be able to communicate with the device.
- wired or wireless connection eg, Bluetooth, Wi-Fi, or Wi-Fi Direct.
- the server 20 is provided to perform wired or wireless communication with the electronic device 10 .
- the server 20, for example, is implemented in a cloud type, and an electronic device 10 and/or an additional device associated with the electronic device 10 (eg, a smart phone in which a corresponding application is installed to interwork with an AI speaker, etc.) ) of user accounts can be stored and managed.
- an electronic device 10 and/or an additional device associated with the electronic device 10 eg, a smart phone in which a corresponding application is installed to interwork with an AI speaker, etc.
- the implementation form of the server 20 is not limited, and as an example, it is implemented as an STT (Speech to Text) server that converts a sound signal related to voice into text, or to perform the function of the STT server as a main server related to voice recognition. can be implemented.
- the server 20 may be provided in plurality, such as the STT server and the main server, so that the electronic device 10 may communicate with the plurality of servers.
- the server 20 may be provided with data for recognizing a voice uttered by a user, that is, a database (DB) in which information is stored.
- the database may include, for example, a plurality of acoustic models predetermined by modeling signal characteristics of a voice.
- the database may further include a language model determined in advance by modeling a linguistic order relationship such as words or syllables corresponding to the recognition target vocabulary.
- the acoustic model and/or the language model may be configured by performing learning in advance.
- the electronic device 10 can identify and process the received user voice, and output the processing result through sound or image.
- FIG. 2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present invention.
- the electronic device 10 includes an output unit 110 , a sound receiving unit 120 , a signal processing unit 161 , an interface unit 140 , a storage unit 150 , and a processor ( 160).
- the configuration of the electronic device 10 according to an embodiment of the present invention shown in FIG. 2 is only an example, and the electronic device according to another embodiment may be implemented in a configuration other than the configuration shown in FIG. 2 . have. That is, the electronic device of the present invention may be implemented in a form in which a configuration other than the configuration shown in FIG. 2 is added or at least one of the configuration shown in FIG. 2 is excluded.
- the output unit 110 outputs a sound, that is, a sound.
- the output unit 110 may include, for example, at least one speaker capable of outputting sound in an audible frequency band of 20 Hz to 20 KHz.
- the output unit 110 may output a sound corresponding to an audio signal/sound signal of a plurality of channels.
- the output unit 110 may output a sound according to the processing of the sound signal as a user voice received through the sound receiving unit 120 .
- the sound receiver 120 may receive a voice uttered by a user, that is, a sound wave.
- the sound wave input through the sound receiving unit 120 is converted into an electrical signal by the signal converting unit.
- the signal converter may include an AD converter that converts analog sound waves into digital signals.
- the signal conversion unit may be included in a signal processing unit 161 to be described later.
- the sound receiver 120 is implemented to be provided in the electronic device 10 by itself.
- the sound receiving unit 120 may be implemented as a form provided in a separate device, not a component included in the electronic device 10 .
- the electronic device 10 when the electronic device 10 is a display device such as a television, a user voice is received through a microphone, that is, a sound receiver installed in a remote control provided as an input device capable of user manipulation, and the corresponding user voice is received.
- a sound signal may be transmitted from the remote control to the electronic device 10 .
- the analog sound wave received through the microphone of the remote control may be converted into a digital signal and transmitted to the electronic device 10 .
- the input device includes a terminal device, such as a smartphone, on which a remote control application is installed.
- the interface unit 140 allows the electronic device 10 to transmit or receive signals with various external devices including the server 20 and the terminal device.
- the interface unit 140 may include a wired interface unit 141 .
- the wired interface unit 141 includes a connection unit for transmitting/receiving signals/data according to standards such as HDMI, HDMI-CEC, USB, Component, Display Port (DP), DVI, Thunderbolt, RGB cable, etc. can Here, the wired interface unit 141 may include at least one connector, terminal, or port corresponding to each of these standards.
- the wired interface unit 141 may be implemented in a form including an input port for receiving a signal from an image source, etc., and may further include an output port in some cases to transmit/receive signals in both directions.
- the wired interface unit 141 is configured to connect an antenna capable of receiving a broadcast signal according to a broadcasting standard such as terrestrial/satellite broadcasting, or a cable capable of receiving a broadcast signal according to a cable broadcasting standard to be connected, an HDMI port, a DisplayPort , DVI port, Thunderbolt, composite video, component video, super video, SCART, etc. may include a connector or port according to the video and / or audio transmission standard.
- the electronic device 10 may have a built-in antenna capable of receiving a broadcast signal.
- the wired interface unit 141 may include a connector or port according to a universal data transmission standard such as a USB port.
- the wired interface unit 141 may include a connector or a port to which an optical cable can be connected according to an optical transmission standard.
- the wired interface unit 141 is connected to an external microphone or an external audio device having a microphone, and may include a connector or a port capable of receiving or inputting an audio signal from the audio device.
- the interface unit 111 is connected to an audio device such as a headset, earphone, or external speaker, and may include a connector or port capable of transmitting or outputting an audio signal to the audio device.
- the wired interface unit 141 may include a connector or port according to a network transmission standard such as Ethernet.
- the wired interface unit 141 may be implemented as a LAN card connected to a router or a gateway by wire.
- the wired interface unit 141 is wired through the connector or port in a 1:1 or 1:N (N is a natural number) method such as an external device such as a set-top box, an optical media playback device, or an external display device, speaker, server, etc. By being connected, a video/audio signal is received from the corresponding external device or a video/audio signal is transmitted to the corresponding external device.
- the wired interface unit 141 may include a connector or a port for separately transmitting video/audio signals.
- the wired interface unit 141 is embedded in the electronic device 10 , but may be implemented in the form of a dongle or a module to be detachably attached to the connector of the electronic device 10 .
- the interface unit 140 may include a wireless interface unit 142 .
- the wireless interface unit 142 may be implemented in various ways corresponding to the implementation form of the electronic device 10 .
- the wireless interface unit 142 is a communication method such as RF (radio frequency), Zigbee (Zigbee), Bluetooth (bluetooth), Wi-Fi (Wi-Fi), UWB (Ultra WideBand) and NFC (Near Field Communication), etc. Wireless communication can be used.
- the wireless interface unit 142 may be implemented as a communication circuitry including a wireless communication module (S/W module, chip, etc.) corresponding to various types of communication protocols.
- a wireless communication module S/W module, chip, etc.
- the wireless interface unit 142 includes a wireless LAN unit.
- the wireless LAN unit may be wirelessly connected to an external device through an access point (AP) under the control of the processor 160 .
- the wireless LAN unit includes a WiFi module.
- the wireless interface unit 142 includes a wireless communication module that supports one-to-one direct communication between the electronic device 10 and an external device wirelessly without an access point.
- the wireless communication module may be implemented to support communication methods such as Wi-Fi Direct, Bluetooth, and Bluetooth low energy.
- the storage unit 150 may store identification information (eg, MAC address or IP address) on the external device, which is a communication target device.
- the wireless interface unit 142 is provided to perform wireless communication with an external device by at least one of a wireless LAN unit and a wireless communication module according to performance.
- the air interface unit 142 may further include a communication module using various communication methods such as mobile communication such as LTE, EM communication including a magnetic field, and visible light communication.
- the wireless interface unit 142 may transmit and receive data packets to and from the server by wirelessly communicating with the server on the network.
- the wireless interface unit 142 may include an IR transmitter and/or an IR receiver capable of transmitting and/or receiving an IR (Infrared) signal according to an infrared communication standard.
- the wireless interface unit 142 may receive or input a remote control signal from the remote control or other external device through the IR transmitter and/or the IR receiver, or may transmit or output a remote control signal to another external device.
- the electronic device 10 may transmit/receive a remote control signal to and from the remote control or other external device through the wireless interface unit 142 of another method such as Wi-Fi or Bluetooth.
- the electronic device 10 may further include a tuner for tuning the received broadcast signal for each channel.
- the wireless interface unit 142 may transmit predetermined data as information of the user's voice received through the sound receiving unit 120 to an external device, that is, the server 20 .
- the form/type of the transmitted data is not limited, and for example, an audio signal corresponding to a voice uttered by a user or a voice characteristic extracted from the audio signal may be included.
- the wireless interface unit 142 may receive data of the processing result of the user's voice from the server 20 .
- the electronic device 10 outputs a sound corresponding to the voice processing result through the output unit 110 based on the received data.
- the above-described embodiment is an example, and the user's voice may not be transmitted to the server 20 , but may be processed by itself within the electronic device 10 . That is, in another embodiment, the electronic device 10 can be implemented to perform the role of the STT server.
- the electronic device 10 may communicate with an input device such as a remote control through the wireless interface unit 142 to receive a sound signal corresponding to the user's voice from the input device.
- an input device such as a remote control
- the wireless interface unit 142 may receive a sound signal corresponding to the user's voice from the input device.
- the communication module communicating with the server 20 and the communication module communicating with the remote controller may be different from each other.
- the electronic device 10 may communicate with the server 20 through an Ethernet modem or Wi-Fi module, and may communicate through a remote controller and a Bluetooth module.
- the communication module communicating with the server 20 and the communication module communicating with the remote control may be the same.
- the electronic device 10 may communicate with the server 20 and the remote controller through the Bluetooth module.
- the storage unit 150 is configured to store various data of the electronic device 10 .
- the storage unit 150 should retain data even when power supplied to the electronic device 10 is cut off, and may be provided as a writable nonvolatile memory (writable ROM) to reflect changes. That is, the storage unit 150 may be provided with any one of a flash memory, an EPROM, or an EEPROM.
- writable ROM writable nonvolatile memory
- the storage unit 150 may further include a volatile memory such as DRAM or SRAM, in which the read or write speed of the electronic device 10 is faster than that of the nonvolatile memory.
- a volatile memory such as DRAM or SRAM
- the data stored in the storage 150 includes, for example, an operating system for driving the electronic device 10 , and various software, programs, applications, and additional data executable on the operating system.
- An application stored and installed in the storage unit 150 in the electronic device 10 recognizes a user voice received through the sound receiver 120 and performs an operation according to the AI speaker. It can contain applications.
- the AI speaker application when the AI speaker application is identified as a predetermined keyword through the sound receiver 120 , that is, an input of a trigger word, a user operation on a specific button of the electronic device 10 , etc. are identified.
- the activation of the application may include switching the execution state of the application from the background mode to the foreground mode.
- the storage unit 150 is a database (database) in which data for recognizing a user voice that can be received through the sound receiving unit 120, that is, information is stored. 151) may be included.
- the database 151 may include, for example, a plurality of acoustic models determined in advance by modeling signal characteristics of speech.
- the database 151 may further include a language model determined in advance by modeling a linguistic order relationship such as words or syllables corresponding to the recognition target vocabulary.
- the database in which information for recognizing a user's voice is stored may be provided in the server 20, which is an example of an external device accessible by a wired or wireless network through the wireless interface unit 142 as described above.
- the server 20 may be implemented, for example, in a cloud type.
- the processor 160 controls all components of the electronic device 10 to operate.
- the processor 160 executes instructions included in a control program to perform such a control operation.
- the processor 160 includes at least one general-purpose processor that loads at least a part of the control program from the non-volatile memory in which the control program is installed into the volatile memory, and executes the loaded control program, for example, CPU (Central Processing). Unit) or an application processor (AP).
- CPU Central Processing
- Unit Central Processing
- AP application processor
- the processor 160 may include a single core, a dual core, a triple core, a quad core, and multiple cores thereof.
- the processor 160 operates in a plurality of processors, for example, a main processor and a sleep mode (for example, only standby power is supplied and does not operate as an electronic device receiving a sound signal). It may include a sub-processor.
- the processor, the ROM, and the RAM are interconnected through an internal bus, and the ROM and the RAM are included in the storage unit 150 .
- a CPU or an application processor which is an example of implementing the processor 160 , may be implemented as a form included in a main SoC mounted on a PCB embedded in the electronic device 10 .
- the control program may include program(s) implemented in the form of at least one of a BIOS, a device driver, an operating system, firmware, a platform, and an application program (application).
- the application program is pre-installed or stored in the electronic device 10 when the electronic device 10 is manufactured, or receives data of the application program from the outside when used later, based on the received data. It may be installed in the electronic device 10 . Data of the application program may be downloaded to the electronic device 10 from, for example, an external server such as an application market. Such an application program, an external server, etc. is an example of the computer program product of the present invention, but is not limited thereto.
- the processor 160 may include a signal processing unit 161 as shown in FIG. 2 .
- the signal processing unit 161 processes an audio signal, that is, a sound signal.
- the sound signal processed by the signal processing unit 161 may be output as sound through the output unit 110 to provide audio content to the user.
- the signal processing unit 161 is a software block of the processor 160 , and may be implemented in a form that performs one function of the processor 160 .
- the signal processing unit 161 is a separate configuration separated from the CPU or application processor (AP), which is an example of implementing the processor 160, for example, a microprocessor such as a digital signal processor (DSP) or It may be implemented as an integrated circuit (IC), or may be implemented by a combination of hardware and software.
- AP application processor
- DSP digital signal processor
- IC integrated circuit
- the processor 160 may include a voice recognition module 162 capable of recognizing a voice signal uttered by a user, as shown in FIG. 2 .
- FIG. 3 is a block diagram illustrating a configuration of a voice recognition module of an electronic device according to an embodiment of the present invention.
- the voice recognition module 162 receives the user's utterance as an input, and in response to an input of a predetermined start word (hereinafter, also referred to as a trigger word or a wake-up word (WUW)), a voice It may be implemented to initiate an action for recognition.
- a predetermined start word hereinafter, also referred to as a trigger word or a wake-up word (WUW)
- WUW wake-up word
- the voice recognition module 162 includes a preprocessor 301 , a start word engine 302 , a threshold value determiner 304 and a voice, as shown in FIG. 3 .
- a recognition engine 304 may be included.
- the preprocessor 301 may receive a voice signal according to the user's utterance from the sound receiver 120 and perform preprocessing for removing ambient noise, that is, noise.
- the pre-processing may include processes such as digital signal conversion, filtering, framing, and the like, and a meaningful voice signal can be extracted by removing unnecessary ambient noise from the voice signal according to the above processes.
- the start word engine 302 performs pattern matching by comparing features extracted from the pre-processed speech signal with a predetermined pattern.
- the start word engine 302 may perform pattern matching using an acoustic model configured by performing pre-learning.
- the start word engine 302 determines whether the input speech includes a start word based on the similarity between the input speech, that is, the waveform of the voice signal (sound signal) according to the user's speech, and the start word pattern of the acoustic model. can be identified.
- the start word engine 302 determines that the input utterance includes the start word when, as a result of the comparison by pattern matching, the score of the input utterance, that is, the utterance score is greater than a predetermined start word threshold (WUW Threshold). can be identified.
- WUW Threshold a predetermined start word threshold
- a threshold of similarity that is, a starting word threshold (WUW Threshold) may be preset based on a learning algorithm using an acoustic model.
- the starting word threshold (WUW Threshold) is defined as a condition for activating the voice recognition function of the electronic device 1 .
- the starting word threshold (WUW Threshold) is distinguished from the noise threshold and SNR threshold respectively used for comparison with the noise characteristic and noise-to-speech characteristic of a sound signal, which will be described later.
- the electronic device 1 may be implemented to use two starting word thresholds set to have different values when the user's utterance is made in a noisy environment.
- a specific example of applying the two starting word thresholds in such a noisy environment will be described in more detail in the embodiment of FIG. 4 to be described later.
- the threshold value determining unit 303 identifies whether the user's utterance is made in a noisy environment using a predetermined noise threshold.
- the identification of the noise environment as a noise characteristic of a sound signal according to a user's utterance, may be made based on a comparison between power and a noise threshold in a specific section.
- the threshold value determining unit 303 identifies whether the ratio of the sound signal to the noise is equal to or greater than a specific level as the speech characteristic of the sound signal according to the user's speech using a predetermined SNR threshold.
- the threshold value determining unit 303 is, based on the comparison result of the noise characteristic of the sound signal and the noise threshold as described above, or the comparison result of the speech characteristic of the sound signal and the threshold value of the SNR threshold.
- the SNR threshold can be changed. Changing the SNR threshold may include, for example, adjusting the value upwards or downwards. A specific example of changing the SNR threshold will be described in more detail in the embodiment of FIG. 4 to be described later.
- the voice recognition engine 304 may be implemented to include a voice recognition function for a voice signal received as a user's utterance, that is, a sound signal, so as to perform a recognition operation on the user's utterance.
- the voice recognition engine 304 satisfies the activation condition of the second stage based on the two threshold values of the starting words when the user's utterance is made in a noisy environment
- the voice recognition function may be activated, and the electronic device 10 may be implemented to perform a recognition operation regarding the user's utterance based on the received sound signal.
- a specific example in which the voice recognition function is activated according to the activation conditions of these two steps will be described in more detail in the embodiment of FIG. 4 to be described later.
- the voice recognition function of the voice recognition engine 304 may be performed using one or more voice recognition algorithms.
- the voice recognition engine 304 extracts a vector representing a voice feature from a voice signal uttered by a user, and compares the extracted vector with an acoustic model of the database 151 or the server 20 to perform voice recognition.
- the acoustic model is a model according to previously performed learning as an example.
- the voice recognition module 162 comprising the preprocessor 301, the starting word engine 302, the threshold value determining unit 304 and the voice recognition engine 304 is a An example implemented as an embedded type is described as an example, but the present invention is not limited thereto. Accordingly, the voice recognition module 162 may be implemented as a configuration of the electronic device 10 separate from the CPU, for example, a separate chip such as a microcomputer provided as a dedicated processor for a voice recognition function. .
- each component of the voice recognition module 162, the preprocessor 301, the start word engine 302, the threshold value determination unit 304, and the voice recognition engine 304 may be implemented as a software block as an example, In some cases, at least one configuration may be implemented in an excluded form, or at least one other configuration may be added.
- the aforementioned preprocessor 301 in order for the electronic device 10 to perform the voice recognition function, the aforementioned preprocessor 301 , the start word engine 302 , the threshold value determiner 304 , and the voice recognition engine 304 . It will be understood that operations performed by at least one of these are performed by the processor 160 of the electronic device 10 .
- the processor 160 identifies whether a value representing the noise characteristic of the sound signal received through the sound receiver 120 is greater than a noise threshold (hereinafter, also referred to as a first threshold), and Whether or not the value indicating the ignition characteristic is greater than the SNR threshold (hereinafter also referred to as the second threshold value) is identified, and the value of the noise characteristic is greater than the first threshold value and the value of the ignition characteristic is greater than the second threshold value
- a recognition operation on the user's utterance may be performed based on the received sound signal, and the second threshold value, that is, the SNR threshold may be adjusted upward.
- the processor 160 determines that the similarity between the waveform of the sound signal and the predefined start word pattern is greater than a first start word threshold (hereinafter, also referred to as a third threshold), that is, a sound signal that satisfies the first activation condition. With respect to the sound signal, it may be identified whether the value of the noise characteristic and the value of the ignition characteristic are greater than a first threshold value and a second threshold value, respectively.
- a first start word threshold hereinafter, also referred to as a third threshold
- the processor 160 when it is identified that the value of the noise characteristic of the received sound signal is equal to or less than the first threshold, the processor 160 performs a recognition operation on the user's utterance based on the received sound signal, and performs a recognition operation on the second threshold, that is, the SNR.
- the threshold can be adjusted downward.
- the processor 160 determines that the similarity between the waveform of the sound signal and the starting word pattern is greater than the first starting word threshold ( Hereinafter, it is also referred to as a fourth threshold value), that is, when the second activation condition is satisfied, a recognition operation regarding the user's utterance may be performed based on the received sound signal.
- the operation of the processor 160 may be implemented as a computer program stored in a computer program product (not shown) provided separately from the electronic device 10 .
- the computer program product includes a memory in which instructions corresponding to the computer program are stored, and a processor.
- the instruction is executed by the processor 160 , the value indicating the noise characteristic of the sound signal received through the sound receiving unit 120 is greater than the first threshold value, and the value indicating the speech characteristic of the sound signal is the second threshold value if greater than, performing a recognition operation on the user's utterance based on the received sound signal, and allowing the second threshold to be adjusted upward.
- the instruction includes, if the value representing the noise characteristic of the received sound signal is equal to or less than the first threshold, performing a recognition operation on the user's utterance based on the received sound signal and lowering the second threshold.
- the processor 160 of the electronic device 10 may download and execute a computer program stored in a separate computer program product to perform the above-described operation of the instruction.
- FIG. 4 is a flowchart illustrating a control method of an electronic device according to an embodiment of the present invention
- FIG. 5 is a diagram for explaining pattern matching for activating a voice recognition function in an electronic device according to an embodiment of the present invention
- 6 is a view for explaining the identification of noise characteristics of an electronic device according to an embodiment of the present invention.
- the electronic device 10 may receive a sound signal through the sound receiver 120 ( 401 ).
- the received sound signal may be a signal according to the user's utterance.
- the processor 160 may identify whether the sound signal received in step 401 satisfies the first activation condition for the voice recognition function (step 402).
- the processor 160 performs pattern matching between a sound signal from which ambient noise, that is, noise has been removed, and a predefined start word signal as shown in FIG. 5 , as shown in FIG. 1 It can be identified whether the activation condition is satisfied.
- the processor 160 derives a score speech as a similarity between the user's speech, that is, a waveform of a sound signal and a pattern of a start word signal, based on the pattern matching as shown in FIG. 5 , and uses Equation 1 below It can be identified whether the derived speech score, that is, the degree of similarity, is greater than the predetermined first start word threshold WUW Threshold1, that is, the third threshold value.
- the first starting word threshold (third threshold) is for identifying whether the sound signal satisfies the first activation condition for the voice recognition function, and is applied regardless of whether the user's utterance is made in a noisy environment. do.
- the first starting word threshold may be preset to, for example, 0.1, but the value is not limited thereto as an example.
- the processor 160 may determine that the sound signal input in step 401 satisfies the first activation condition when it is identified that the utterance score is greater than the first starting word threshold by Equation (1).
- the processor 160 determines whether the value representing the noise characteristic of the sound signal according to the user's utterance is greater than a predetermined noise threshold, that is, the first threshold. can be identified (403).
- the first threshold value is for identifying whether the user's surroundings are a noisy environment, and may be preset to correspond to a power value of a sound signal when a sufficiently loud noise is present in the surroundings.
- the processor 160 identifies a section including a start word uttered by the user (hereinafter, also referred to as a start word section), and a predefined time before the start word section It can be identified whether a value indicating a noise characteristic received in a length section (hereinafter, also referred to as a noise characteristic confirmation section) is greater than a first threshold value.
- the sound signal received by the sound receiving unit 120 in a streaming manner is, as shown in FIG. 6 , a First In First Out (FIFO) queue. It can be temporarily stored in units of consecutive frames in a (queue) type data structure. That is, when the next frame is received, the streaming sound signal is stored in such a way that the first stored frame is pushed out.
- the length of the sound signal to be stored may be preset to correspond to the storage space, and for example, it may be implemented such that a signal having a length of 2.5 seconds is stored.
- the processor 160 may monitor whether a start word according to the user's utterance is included in each frame of the streaming sound signal received and stored in units of consecutive frames as described above. The processor 160, based on the monitoring, for example, as described in step 402, when it is detected that the speech score in a specific signal frame is greater than the first start word threshold, the corresponding signal frame is a user speech, that is, start It can be identified as containing
- the processor 160 may identify a predetermined time length from the signal frame identified in step 402, for example, a time period up to about 1 second before the start word period. In addition, the processor 160 may identify a predetermined length of time before the identified start word section, for example, a time section of about 1.5 seconds as the noise characteristic confirmation section.
- the noise characteristic check section may be defined to correspond to a time obtained by subtracting the time of the start word section from the time of the entire sound signal being stored, and in the present invention, the length of time corresponding to the start word section and the time of the noise property check section The length is not limited to the examples presented.
- the processor 160 compares the signal power of the noise characteristic confirmation section with a first threshold value to determine whether the surrounding environment is sufficiently noisy when uttering, that is, whether the user's utterance is made in a noisy environment. can be identified.
- step 403 if it is identified that the noise characteristic of the sound signal, that is, the signal power is greater than the first threshold, the processor 160 determines whether the speech characteristic of the sound signal is greater than a predetermined SNR threshold, that is, the second threshold. can be identified (404).
- the speech characteristic may include a signal to noise ratio (SNR) of a sound signal.
- the processor 160 calculates a posteriori SNR (SNR post ) corresponding to the noise ratio to the total sound signal as the speech characteristic of the sound signal, and using Equation 2 below, It may be identified whether the calculated posterior SNR is greater than a predetermined second threshold, that is, an SNR threshold.
- a predetermined second threshold that is, an SNR threshold.
- the post SNR (SNR post ) may be calculated using Equations 3 and 4 below.
- X(p,k) represents the total sound signal including noise
- S(p,k) represents the speech signal
- N(p,k) represents the noise signal, respectively.
- the received input sound signal (voice signal) X may be expressed as the sum of the k-th spectral element for each frame p of the speech element S and the noise element N, as shown in Equation (3).
- the post SNR (SNR post ) is the ratio of the magnitude of the noise, that is, the noise N(p,k), to the total sound signal X(p,k) including the noise for each frame (p), as expressed in Equation 4 below. can be calculated by
- the final posterior SNR for all frames may be calculated as an average value of posterior SNRs for each frame p.
- the processor 160 determines the final posterior SNR calculated in this way as the speech characteristic of the sound signal in step 404, and compares the speech characteristic with a second threshold (SNR Threshold). , it is possible to identify whether the user's speech is sufficiently loud in a noisy environment.
- SNR Threshold a second threshold
- the second threshold that is, the SNR threshold
- the SNR threshold is a predetermined value corresponding to a level at which an input sound signal can be recognized as generated by a user's speech in a noisy environment, and the initial value is to be set in advance.
- the initial SNR threshold may be set to, for example, 4, but is not limited thereto.
- step 403 if the electronic device 10 operates in a sufficiently noisy noise environment (YES in step 430), in step 402, a sound signal including ambient noise, not the user's actual speech, is regarded as including the starting word. Misrecognition may occur.
- the value of the noise characteristic of the signal is set as a first threshold
- the ignition characteristic of the sound signal is further compared with a second threshold value (initial SNR threshold) in step 404 .
- step 404 it is further determined whether the user's utterance is sufficiently loud in a noisy environment by step 404, and based on the result, it is possible to control execution of a trigger for performing a voice recognition operation.
- step 404 when it is identified that the value of the speech characteristic of the sound signal is greater than a predetermined second threshold, for example, an initial SNR threshold, the processor 160 executes a trigger to receive the received electronic device 10 .
- a predetermined second threshold for example, an initial SNR threshold
- a control is performed to perform a voice recognition operation on the user's utterance based on the sound signal ( 405 ).
- the final posterior SNR calculated in step 404 is, for example, 5, which is greater than the initial SNR threshold of 4, a trigger may be executed.
- step 403 if it is identified that the user's speech is sufficiently loud (YES in step 404), the processor 160 immediately executes a trigger, thereby activating the voice recognition function in the electronic device 10 Thus, an operation can be performed in response to the received sound signal.
- the processor 160 may adjust the second threshold upward from the predetermined initial SNR threshold ( 406 ).
- the processor 160 may reset the second threshold to reflect the noise environment as a change in the surrounding environment after executing the trigger in step 405 . .
- the processor 160 uses the initial SNR threshold (SNR Th _ init ) and the post SNR (SNR post ) calculated in step 404 according to Equation 5 below to generate a new second threshold value (SNR). Threshold) can be derived.
- the new second threshold (SNR Threshold) is calculated according to Equation (5).
- the second threshold value adjusted upward as described above becomes a value applied in step 404 to the corresponding sound signal in response to reception of the next sound signal.
- the second threshold value (SNR threshold) as a trigger execution condition is increased correspondingly by adjusting the second threshold value (SNR threshold) by the user in the noisy environment. It can induce utterance in a loud voice.
- the processor 160 when it is identified that the value of the ignition characteristic of the sound signal is less than or equal to a predetermined second threshold, for example, an initial SNR threshold, in step 404, the processor 160, the corresponding sound signal Whether or not satisfies the second activation condition may be further identified ( 407 ).
- a predetermined second threshold for example, an initial SNR threshold
- the processor 160 using Equation 6 below, the utterance score derived as a similarity between the waveform of the sound signal derived in step 401 and the start word pattern is a predetermined second start word threshold (WUW). Threshold1), that is, when it is greater than the fourth threshold, it may be identified that the sound signal satisfies the second activation condition.
- WUW second start word threshold
- the second starting word threshold WUW Threshold2 (fourth threshold) is for identifying whether the sound signal satisfies the second activation condition for the voice recognition function, and when the user's speech is a noisy environment (step 403 to YES).
- the second starting word threshold WUW Threshold2 (the fourth threshold) is set to a value greater than the first starting word threshold WUW Threshold1 (third threshold) in step 401 as shown in Equation 7 below.
- the first starting word threshold value (third threshold value) may be preset to 0.1
- the second starting word threshold value (fourth threshold value) may be preset to 0.15, but this is presented as an example , so the value is not limited.
- the processor 160 may determine that the sound signal input in step 401 satisfies the second activation condition when it is identified that the utterance score is greater than the second starting word threshold by Equation (6).
- step 407 If it is determined in step 407 that the sound signal satisfies the second activation condition, the processor 160 executes a trigger to control the electronic device 10 to perform a voice recognition operation related to the user's utterance based on the received sound signal do (408).
- step 407 if it is determined in step 407 that the sound signal does not satisfy the second activation condition, that is, if the utterance score is identified as being equal to or less than the second starting word threshold by Equation 6, since the processor 160 does not execute the trigger, The electronic device 10 is controlled to keep the voice recognition inactive ( 409 ).
- the processor 160 when the processor 160 is in a noisy environment (YES in step 403), the similarity between the waveform of the input sound signal and the pattern of the start word signal is the first start. is identified to be greater than the threshold value, so that even if the input sound signal meets the first activation condition (YES in step 402), the voice recognition function is activated only when the sound signal satisfies up to the second activation condition (YES in step 407) , control based on the activation conditions of the second stage is made.
- step 407 the speech recognition function is activated only when the speech score indicating the similarity according to the pattern matching between the sound signal and the starting word signal is greater than the second starting word threshold, the user's actual speech in step 402 Even if the sound signal including the ambient noise other than , is erroneously recognized as including the starting word, the possibility of an erroneous operation is reduced by step 407 .
- step 403 when it is identified that the noise characteristic of the sound signal, that is, the signal power is equal to or less than the first threshold, the processor 160 controls the electronic device 10 to perform a voice recognition operation by executing a trigger (410). ).
- the processor 160 may adjust the second threshold downward from a predetermined initial SNR threshold ( 411 ).
- the electronic device 10 determines that the surroundings are not noisy.
- the second threshold value may be reset to reflect this.
- the processor 160 calculates a post SNR (SNR post ) corresponding to the noise ratio to the total sound signal as the speech characteristic of the sound signal as described in step 404, and according to Equation 5 above, A new second threshold (SNR Threshold) may be derived using the calculated post SNR and the initial SNR threshold.
- SNR post post SNR
- SNR Threshold the calculated final posterior SNR is derived to be smaller than the case in step 404, and may be, for example, 2
- the new second threshold (SNR Threshold) is 4.16*log_4.
- the second threshold value (SNR threshold) as a trigger execution condition is lowered correspondingly to the case where the environment is not in a noisy environment , even when a small sound is uttered by the user, it can operate to enable immediate voice recognition.
- step 402 if it is determined in step 402 that the sound signal does not satisfy the first activation condition, that is, an utterance score indicating the similarity between the sound signal and the start word signal by Equation 1 is the first start If it is identified as being equal to or less than the threshold value, the processor 160 does not execute a trigger, and thus the electronic device 10 may be controlled to maintain the voice recognition deactivation ( 412 ).
- the noise-to-signal ratio which is the utterance characteristic of the sound signal
- the electronic device 10 when a signal-to-noise ratio (SNR), which is an utterance characteristic of a sound signal in a noisy environment, is greater than the SNR threshold, that is, when user utterance is sufficiently large compared to noise, the SNR threshold to be higher, inducing the user to utter the starting word loudly in a noisy environment, the effect of improving the accuracy of motion can be expected.
- SNR signal-to-noise ratio
- the electronic device 10 when the surrounding environment is not a noisy environment, by adjusting the SNR threshold to be lowered, immediate operation of the electronic device 10 according to the environmental change in a quiet environment can make this happen.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Theoretical Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Telephone Function (AREA)
Abstract
L'invention concerne un dispositif électronique et son procédé de commande. Le dispositif électronique comprend : une unité de réception sonore ; et un processeur qui, lorsqu'une caractéristique de bruit acquise à partir d'un signal sonore reçu par le biais de l'unité de réception sonore est supérieure à une première valeur seuil, et qu'une caractéristique d'énoncé acquise à partir du signal sonore est supérieure à une seconde valeur seuil, effectue une opération de reconnaissance pour un énoncé d'utilisateur d'après le signal sonore, puis ajuste la seconde valeur seuil pour qu'elle soit supérieure.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020190170363A KR20210078682A (ko) | 2019-12-19 | 2019-12-19 | 전자장치 및 그 제어방법 |
KR10-2019-0170363 | 2019-12-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021125784A1 true WO2021125784A1 (fr) | 2021-06-24 |
Family
ID=76476805
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2020/018442 WO2021125784A1 (fr) | 2019-12-19 | 2020-12-16 | Dispositif électronique et son procédé de commande |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR20210078682A (fr) |
WO (1) | WO2021125784A1 (fr) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11948569B2 (en) | 2021-07-05 | 2024-04-02 | Samsung Electronics Co., Ltd. | Electronic apparatus and controlling method thereof |
KR20230006999A (ko) * | 2021-07-05 | 2023-01-12 | 삼성전자주식회사 | 전자 장치 및 그 제어 방법 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9047857B1 (en) * | 2012-12-19 | 2015-06-02 | Rawles Llc | Voice commands for transitioning between device states |
KR20170035602A (ko) * | 2015-09-23 | 2017-03-31 | 삼성전자주식회사 | 음성인식장치, 음성인식방법 및 컴퓨터 판독가능 기록매체 |
US20170256270A1 (en) * | 2016-03-02 | 2017-09-07 | Motorola Mobility Llc | Voice Recognition Accuracy in High Noise Conditions |
KR20180018146A (ko) * | 2016-08-12 | 2018-02-21 | 삼성전자주식회사 | 음성 인식이 가능한 디스플레이 장치 및 방법 |
KR20190117725A (ko) * | 2017-03-22 | 2019-10-16 | 삼성전자주식회사 | 잡음 환경에 적응적인 음성 신호 처리방법 및 장치 |
-
2019
- 2019-12-19 KR KR1020190170363A patent/KR20210078682A/ko unknown
-
2020
- 2020-12-16 WO PCT/KR2020/018442 patent/WO2021125784A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9047857B1 (en) * | 2012-12-19 | 2015-06-02 | Rawles Llc | Voice commands for transitioning between device states |
KR20170035602A (ko) * | 2015-09-23 | 2017-03-31 | 삼성전자주식회사 | 음성인식장치, 음성인식방법 및 컴퓨터 판독가능 기록매체 |
US20170256270A1 (en) * | 2016-03-02 | 2017-09-07 | Motorola Mobility Llc | Voice Recognition Accuracy in High Noise Conditions |
KR20180018146A (ko) * | 2016-08-12 | 2018-02-21 | 삼성전자주식회사 | 음성 인식이 가능한 디스플레이 장치 및 방법 |
KR20190117725A (ko) * | 2017-03-22 | 2019-10-16 | 삼성전자주식회사 | 잡음 환경에 적응적인 음성 신호 처리방법 및 장치 |
Also Published As
Publication number | Publication date |
---|---|
KR20210078682A (ko) | 2021-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2017052082A1 (fr) | Appareil de reconnaissance vocale, procédé de reconnaissance vocale de dispositif d'utilisateur, et support d'enregistrement lisible par ordinateur non-transitoire | |
WO2021049795A1 (fr) | Dispositif électronique et son procédé de fonctionnement | |
WO2018043895A1 (fr) | Dispositif d'affichage et procédé de commande de dispositif d'affichage | |
WO2021025350A1 (fr) | Dispositif électronique gérant une pluralité d'agents intelligents et son procédé de fonctionnement | |
WO2015030474A1 (fr) | Dispositif électronique et procédé de reconnaissance vocale | |
WO2020231230A1 (fr) | Procédé et appareil pour effectuer une reconnaissance de parole avec réveil sur la voix | |
WO2020184842A1 (fr) | Dispositif électronique et son procédé de commande | |
WO2014003283A1 (fr) | Dispositif d'affichage, procédé de commande de dispositif d'affichage, et système interactif | |
WO2015194693A1 (fr) | Dispositif d'affichage de vidéo et son procédé de fonctionnement | |
WO2021125784A1 (fr) | Dispositif électronique et son procédé de commande | |
WO2019013447A1 (fr) | Dispositif de commande à distance et procédé de réception de voix d'un utilisateur associé | |
WO2015170832A1 (fr) | Dispositif d'affichage, et procédé d'exécution d'appel vidéo correspondant | |
WO2020251122A1 (fr) | Dispositif électronique de fourniture de service de traduction de contenu et procédé de commande associé | |
WO2020091183A1 (fr) | Dispositif électronique de partage de commande vocale spécifique à l'utilisateur et son procédé de commande | |
WO2021002611A1 (fr) | Appareil électronique et son procédé de commande | |
WO2020091519A1 (fr) | Appareil électronique et procédé de commande associé | |
WO2020167006A1 (fr) | Procédé de fourniture de service de reconnaissance vocale et dispositif électronique associé | |
WO2020013666A1 (fr) | Procédé de traitement d'entrée vocale utilisateur et dispositif électronique prenant en charge ledit procédé | |
WO2021137558A1 (fr) | Dispositif électronique et son procédé de commande | |
WO2024063507A1 (fr) | Dispositif électronique et procédé de traitement d'énoncé d'utilisateur d'un dispositif électronique | |
WO2018021750A1 (fr) | Dispositif électronique et procédé de reconnaissance vocale associé | |
WO2020050593A1 (fr) | Dispositif électronique et procédé de fonctionnement associé | |
WO2022055107A1 (fr) | Dispositif électronique de reconnaissance vocale et son procédé de commande | |
WO2019112332A1 (fr) | Appareil électronique et procédé de commande associé | |
WO2022186540A1 (fr) | Dispositif électronique et procédé de traitement d'enregistrement et d'entrée vocale dans un dispositif électronique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20903382 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20903382 Country of ref document: EP Kind code of ref document: A1 |