WO2022211257A1 - Dispositif électronique et procédé de reconnaissance de la parole l'utilisant - Google Patents

Dispositif électronique et procédé de reconnaissance de la parole l'utilisant Download PDF

Info

Publication number
WO2022211257A1
WO2022211257A1 PCT/KR2022/001821 KR2022001821W WO2022211257A1 WO 2022211257 A1 WO2022211257 A1 WO 2022211257A1 KR 2022001821 W KR2022001821 W KR 2022001821W WO 2022211257 A1 WO2022211257 A1 WO 2022211257A1
Authority
WO
WIPO (PCT)
Prior art keywords
electronic device
candidate
sub
probability value
wfst
Prior art date
Application number
PCT/KR2022/001821
Other languages
English (en)
Korean (ko)
Inventor
이태우
자보로우스키바르토쉬
강태균
이지현
홍연아
박영진
시코라말신
권민석
이정수
정석영
김한별
방규섭
원미미
Original Assignee
삼성전자 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 삼성전자 주식회사 filed Critical 삼성전자 주식회사
Publication of WO2022211257A1 publication Critical patent/WO2022211257A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • Various embodiments of the present disclosure relate to an electronic device and a method for performing voice recognition using the same.
  • a voice recognition function may be implemented in various electronic devices having a voice input device (eg, a microphone). For example, through the voice recognition function, the electronic device may recognize the voice uttered by the user, understand the utterance of the user, and provide a service according to the utterance intention. For example, the electronic device may recognize an object name (eg, an application name, a person name stored in a contact information, and/or a song title) of information stored in the electronic device, based on a voice signal according to a user's utterance, and the object name and Based on the domain information of the voice signal, a function according to the user's utterance may be executed.
  • a voice input device eg, a microphone
  • the electronic device may recognize the voice uttered by the user, understand the utterance of the user, and provide a service according to the utterance intention.
  • the electronic device may recognize an object name (eg, an application name, a person name stored in a contact information, and/or a
  • the object name of information stored in the electronic device may be set differently for each user. As the object name of information is set differently for each user, the recognition result corresponding to the voice signal uttered by the user may be different. However, when voice recognition according to a user's utterance is performed, universally used words, rather than object names of information stored in the electronic device, may be output as a recognition result.
  • the electronic device may generate a small-sized language model based on the sub-word sequence of the entity name of information stored in the electronic device.
  • the electronic device may use the generated language model to provide a plurality of candidate texts having a high probability of being predicted as a word corresponding to the voice signal.
  • An electronic device includes a microphone, a memory, and a processor operatively connected to the microphone and the memory, wherein the processor includes each of a plurality of named entities stored in the memory. is divided into sub words, and based on the sub word sequence, a weighted finite state transducer (WFST) model for each sub word of the plurality of entity names is generated, and the user through the microphone
  • WFST weighted finite state transducer
  • each of a plurality of named entities stored in a memory of the electronic device is divided into sub words, and a sub word sequence is performed. word sequence), generating a weighted finite state transducer (WFST) model for each subword of the plurality of entity names, receiving a voice signal according to a user's utterance through a microphone, and using the WFST model generating a plurality of candidate texts related to the speech signal, and when at least one object name among the plurality of object names is included in the plurality of candidate texts, the candidate text corresponding to the at least one object name is the voice setting the probability value of the candidate text corresponding to the at least one entity name to be higher than that of other candidate texts so as to be predicted as a word corresponding to the signal, and determining the candidate text having the high probability value as the target text It may include an action to
  • the electronic device may provide a plurality of candidate texts related to the voice signal based on the reception of the voice signal according to the user's utterance.
  • the electronic device provides a plurality of candidate texts related to the voice signal using a language model for the entity name of information stored in the electronic device, thereby improving the recognition accuracy of the entity name of information corresponding to the voice signal according to the user's utterance.
  • FIG. 1 is a block diagram of an electronic device in a network environment, according to various embodiments of the present disclosure
  • FIG. 2 is a block diagram illustrating an electronic device according to various embodiments of the present disclosure.
  • FIG. 3 is a block diagram illustrating a configuration of the voice recognition module of FIG. 2 according to various embodiments of the present disclosure
  • FIG. 4 is a flowchart illustrating a method of recognizing a voice of an electronic device according to a user's utterance, according to various embodiments of the present disclosure.
  • FIG. 5 is a diagram for describing a method of recognizing a person's name stored in a contact information of an electronic device according to a user's utterance, according to various embodiments of the present disclosure
  • FIG. 6 is a diagram for explaining a method of recognizing a person's name stored in a contact information of an electronic device according to a user's utterance, according to various embodiments of the present disclosure
  • FIG. 7 is a diagram for describing a method of recognizing an application name installed in an electronic device according to a user's utterance, according to various embodiments of the present disclosure
  • FIG. 1 is a block diagram of an electronic device 101 in a network environment 100, according to various embodiments.
  • the electronic device 101 communicates with the electronic device 102 through a first network 198 (eg, a short-range wireless communication network) or a second network 199 . It may communicate with at least one of the electronic device 104 and the server 108 through (eg, a long-distance wireless communication network). According to an embodiment, the electronic device 101 may communicate with the electronic device 104 through the server 108 .
  • the electronic device 101 includes a processor 120 , a memory 130 , an input module 150 , a sound output module 155 , a display module 160 , an audio module 170 , and a sensor module ( 176), interface 177, connection terminal 178, haptic module 179, camera module 180, power management module 188, battery 189, communication module 190, subscriber identification module 196 , or an antenna module 197 .
  • at least one of these components eg, the connection terminal 178
  • some of these components are integrated into one component (eg, display module 160 ). can be
  • the processor 120 for example, executes software (eg, a program 140) to execute at least one other component (eg, a hardware or software component) of the electronic device 101 connected to the processor 120. It can control and perform various data processing or operations. According to one embodiment, as at least part of data processing or operation, the processor 120 converts commands or data received from other components (eg, the sensor module 176 or the communication module 190 ) to the volatile memory 132 . may be stored in , process commands or data stored in the volatile memory 132 , and store the result data in the non-volatile memory 134 .
  • software eg, a program 140
  • the processor 120 converts commands or data received from other components (eg, the sensor module 176 or the communication module 190 ) to the volatile memory 132 .
  • the volatile memory 132 may be stored in , process commands or data stored in the volatile memory 132 , and store the result data in the non-volatile memory 134 .
  • the processor 120 is a main processor 121 (eg, a central processing unit or an application processor) or a secondary processor 123 (eg, a graphic processing unit, a neural network processing unit) a neural processing unit (NPU), an image signal processor, a sensor hub processor, or a communication processor).
  • a main processor 121 eg, a central processing unit or an application processor
  • a secondary processor 123 eg, a graphic processing unit, a neural network processing unit
  • NPU neural processing unit
  • an image signal processor e.g., a sensor hub processor, or a communication processor.
  • the secondary processor 123 may, for example, act on behalf of the main processor 121 while the main processor 121 is in an inactive (eg, sleep) state, or when the main processor 121 is active (eg, executing an application). ), together with the main processor 121, at least one of the components of the electronic device 101 (eg, the display module 160, the sensor module 176, or the communication module 190) It is possible to control at least some of the related functions or states.
  • the coprocessor 123 eg, an image signal processor or a communication processor
  • may be implemented as part of another functionally related component eg, the camera module 180 or the communication module 190. have.
  • the auxiliary processor 123 may include a hardware structure specialized for processing an artificial intelligence model.
  • Artificial intelligence models can be created through machine learning. Such learning may be performed, for example, in the electronic device 101 itself on which the artificial intelligence model is performed, or may be performed through a separate server (eg, the server 108).
  • the learning algorithm may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but in the above example not limited
  • the artificial intelligence model may include a plurality of artificial neural network layers.
  • Artificial neural networks include deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), restricted boltzmann machines (RBMs), deep belief networks (DBNs), bidirectional recurrent deep neural networks (BRDNNs), It may be one of deep Q-networks or a combination of two or more of the above, but is not limited to the above example.
  • the artificial intelligence model may include, in addition to, or alternatively, a software structure in addition to the hardware structure.
  • the memory 130 may store various data used by at least one component (eg, the processor 120 or the sensor module 176 ) of the electronic device 101 .
  • the data may include, for example, input data or output data for software (eg, the program 140 ) and instructions related thereto.
  • the memory 130 may include a volatile memory 132 or a non-volatile memory 134 .
  • the program 140 may be stored as software in the memory 130 , and may include, for example, an operating system 142 , middleware 144 , or an application 146 .
  • the input module 150 may receive a command or data to be used by a component (eg, the processor 120 ) of the electronic device 101 from the outside (eg, a user) of the electronic device 101 .
  • the input module 150 may include, for example, a microphone, a mouse, a keyboard, a key (eg, a button), or a digital pen (eg, a stylus pen).
  • the sound output module 155 may output a sound signal to the outside of the electronic device 101 .
  • the sound output module 155 may include, for example, a speaker or a receiver.
  • the speaker can be used for general purposes such as multimedia playback or recording playback.
  • the receiver can be used to receive incoming calls. According to one embodiment, the receiver may be implemented separately from or as part of the speaker.
  • the display module 160 may visually provide information to the outside (eg, a user) of the electronic device 101 .
  • the display module 160 may include, for example, a control circuit for controlling a display, a hologram device, or a projector and a corresponding device.
  • the display module 160 may include a touch sensor configured to sense a touch or a pressure sensor configured to measure the intensity of a force generated by the touch.
  • the audio module 170 may convert a sound into an electric signal or, conversely, convert an electric signal into a sound. According to an embodiment, the audio module 170 acquires a sound through the input module 150 or an external electronic device (eg, a sound output module 155 ) directly or wirelessly connected to the electronic device 101 .
  • the electronic device 102) eg, a speaker or headphones
  • the sensor module 176 detects an operating state (eg, power or temperature) of the electronic device 101 or an external environmental state (eg, a user state), and generates an electrical signal or data value corresponding to the sensed state. can do.
  • the sensor module 176 may include, for example, a gesture sensor, a gyro sensor, a barometric pressure sensor, a magnetic sensor, an acceleration sensor, a grip sensor, a proximity sensor, a color sensor, an IR (infrared) sensor, a biometric sensor, It may include a temperature sensor, a humidity sensor, or an illuminance sensor.
  • the interface 177 may support one or more specified protocols that may be used by the electronic device 101 to directly or wirelessly connect with an external electronic device (eg, the electronic device 102 ).
  • the interface 177 may include, for example, a high definition multimedia interface (HDMI), a universal serial bus (USB) interface, an SD card interface, or an audio interface.
  • the connection terminal 178 may include a connector through which the electronic device 101 can be physically connected to an external electronic device (eg, the electronic device 102 ).
  • the connection terminal 178 may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (eg, a headphone connector).
  • the haptic module 179 may convert an electrical signal into a mechanical stimulus (eg, vibration or movement) or an electrical stimulus that the user can perceive through tactile or kinesthetic sense.
  • the haptic module 179 may include, for example, a motor, a piezoelectric element, or an electrical stimulation device.
  • the camera module 180 may capture still images and moving images. According to an embodiment, the camera module 180 may include one or more lenses, image sensors, image signal processors, or flashes.
  • the power management module 188 may manage power supplied to the electronic device 101 .
  • the power management module 188 may be implemented as, for example, at least a part of a power management integrated circuit (PMIC).
  • PMIC power management integrated circuit
  • the battery 189 may supply power to at least one component of the electronic device 101 .
  • battery 189 may include, for example, a non-rechargeable primary cell, a rechargeable secondary cell, or a fuel cell.
  • the communication module 190 is a direct (eg, wired) communication channel or a wireless communication channel between the electronic device 101 and an external electronic device (eg, the electronic device 102, the electronic device 104, or the server 108). It can support establishment and communication performance through the established communication channel.
  • the communication module 190 may include one or more communication processors that operate independently of the processor 120 (eg, an application processor) and support direct (eg, wired) communication or wireless communication.
  • the communication module 190 is a wireless communication module 192 (eg, a cellular communication module, a short-range wireless communication module, or a global navigation satellite system (GNSS) communication module) or a wired communication module 194 (eg, : It may include a local area network (LAN) communication module, or a power line communication module).
  • a wireless communication module 192 eg, a cellular communication module, a short-range wireless communication module, or a global navigation satellite system (GNSS) communication module
  • GNSS global navigation satellite system
  • wired communication module 194 eg, : It may include a local area network (LAN) communication module, or a power line communication module.
  • a corresponding communication module among these communication modules is a first network 198 (eg, a short-range communication network such as Bluetooth, wireless fidelity (WiFi) direct, or infrared data association (IrDA)) or a second network 199 (eg, legacy It may communicate with the external electronic device 104 through a cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (eg, a telecommunication network such as a LAN or a WAN).
  • a first network 198 eg, a short-range communication network such as Bluetooth, wireless fidelity (WiFi) direct, or infrared data association (IrDA)
  • a second network 199 eg, legacy It may communicate with the external electronic device 104 through a cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (eg, a telecommunication network such as a LAN or a WAN).
  • a telecommunication network
  • the wireless communication module 192 uses subscriber information (eg, International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module 196 within a communication network such as the first network 198 or the second network 199 .
  • subscriber information eg, International Mobile Subscriber Identifier (IMSI)
  • IMSI International Mobile Subscriber Identifier
  • the electronic device 101 may be identified or authenticated.
  • the wireless communication module 192 may support a 5G network after a 4G network and a next-generation communication technology, for example, a new radio access technology (NR).
  • NR access technology includes high-speed transmission of high-capacity data (eMBB (enhanced mobile broadband)), minimization of terminal power and access to multiple terminals (mMTC (massive machine type communications)), or high reliability and low latency (URLLC (ultra-reliable and low-latency) -latency communications)).
  • eMBB enhanced mobile broadband
  • mMTC massive machine type communications
  • URLLC ultra-reliable and low-latency
  • the wireless communication module 192 may support a high frequency band (eg, mmWave band) to achieve a high data rate, for example.
  • a high frequency band eg, mmWave band
  • the wireless communication module 192 uses various techniques for securing performance in a high-frequency band, for example, beamforming, massive multiple-input and multiple-output (MIMO), all-dimensional multiplexing. It may support technologies such as full dimensional MIMO (FD-MIMO), an array antenna, analog beam-forming, or a large scale antenna.
  • the wireless communication module 192 may support various requirements defined in the electronic device 101 , an external electronic device (eg, the electronic device 104 ), or a network system (eg, the second network 199 ).
  • the wireless communication module 192 may include a peak data rate (eg, 20 Gbps or more) for realizing eMBB, loss coverage (eg, 164 dB or less) for realizing mMTC, or U-plane latency for realizing URLLC ( Example: Downlink (DL) and uplink (UL) each 0.5 ms or less, or round trip 1 ms or less) can be supported.
  • a peak data rate eg, 20 Gbps or more
  • loss coverage eg, 164 dB or less
  • U-plane latency for realizing URLLC
  • the antenna module 197 may transmit or receive a signal or power to the outside (eg, an external electronic device).
  • the antenna module 197 may include an antenna including a conductor formed on a substrate (eg, a PCB) or a radiator formed of a conductive pattern.
  • the antenna module 197 may include a plurality of antennas (eg, an array antenna). In this case, at least one antenna suitable for a communication method used in a communication network such as the first network 198 or the second network 199 is connected from the plurality of antennas by, for example, the communication module 190 . can be selected. A signal or power may be transmitted or received between the communication module 190 and an external electronic device through the selected at least one antenna.
  • other components eg, a radio frequency integrated circuit (RFIC)
  • RFIC radio frequency integrated circuit
  • the antenna module 197 may form a mmWave antenna module.
  • the mmWave antenna module comprises a printed circuit board, an RFIC disposed on or adjacent to a first side (eg, bottom side) of the printed circuit board and capable of supporting a designated high frequency band (eg, mmWave band); and a plurality of antennas (eg, an array antenna) disposed on or adjacent to a second side (eg, top or side) of the printed circuit board and capable of transmitting or receiving signals of the designated high frequency band. can do.
  • peripheral devices eg, a bus, general purpose input and output (GPIO), serial peripheral interface (SPI), or mobile industry processor interface (MIPI)
  • a signal e.g. commands or data
  • the command or data may be transmitted or received between the electronic device 101 and the external electronic device 104 through the server 108 connected to the second network 199 .
  • Each of the external electronic devices 102 or 104 may be the same as or different from the electronic device 101 .
  • all or part of the operations performed by the electronic device 101 may be executed by one or more external electronic devices 102 , 104 , or 108 .
  • the electronic device 101 may perform the function or service itself instead of executing the function or service itself.
  • one or more external electronic devices may be requested to perform at least a part of the function or the service.
  • One or more external electronic devices that have received the request may execute at least a part of the requested function or service, or an additional function or service related to the request, and transmit a result of the execution to the electronic device 101 .
  • the electronic device 101 may process the result as it is or additionally and provide it as at least a part of a response to the request.
  • cloud computing distributed computing, mobile edge computing (MEC), or client-server computing technology may be used.
  • the electronic device 101 may provide an ultra-low latency service using, for example, distributed computing or mobile edge computing.
  • the external electronic device 104 may include an Internet of things (IoT) device.
  • the server 108 may be an intelligent server using machine learning and/or neural networks.
  • the external electronic device 104 or the server 108 may be included in the second network 199 .
  • the electronic device 101 may be applied to an intelligent service (eg, smart home, smart city, smart car, or health care) based on 5G communication technology and IoT-related technology.
  • the electronic device may have various types of devices.
  • the electronic device may include, for example, a portable communication device (eg, a smart phone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a home appliance device.
  • a portable communication device eg, a smart phone
  • a computer device e.g., a smart phone
  • a portable multimedia device e.g., a portable medical device
  • a camera e.g., a portable medical device
  • a camera e.g., a portable medical device
  • a camera e.g., a portable medical device
  • a wearable device e.g., a smart bracelet
  • a home appliance device e.g., a home appliance
  • first”, “second”, or “first” or “second” may simply be used to distinguish the component from other such components, and refer to those components in other aspects (e.g., importance or order) is not limited. It is said that one (eg, first) component is “coupled” or “connected” to another (eg, second) component, with or without the terms “functionally” or “communicatively”. When referenced, it means that one component can be connected to the other component directly (eg by wire), wirelessly, or through a third component.
  • module used in various embodiments of this document may include a unit implemented in hardware, software, or firmware, and is interchangeable with terms such as, for example, logic, logic block, component, or circuit.
  • a module may be an integrally formed part or a minimum unit or a part of the part that performs one or more functions.
  • the module may be implemented in the form of an application-specific integrated circuit (ASIC).
  • ASIC application-specific integrated circuit
  • Various embodiments of the present document include one or more instructions stored in a storage medium (eg, internal memory 136 or external memory 138) readable by a machine (eg, electronic device 101).
  • a storage medium eg, internal memory 136 or external memory 138
  • the processor eg, the processor 120
  • the device eg, the electronic device 101
  • the one or more instructions may include code generated by a compiler or code executable by an interpreter.
  • the device-readable storage medium may be provided in the form of a non-transitory storage medium.
  • 'non-transitory' only means that the storage medium is a tangible device and does not contain a signal (eg, electromagnetic wave), and this term is used in cases where data is semi-permanently stored in the storage medium and It does not distinguish between temporary storage cases.
  • a signal eg, electromagnetic wave
  • the method according to various embodiments disclosed in this document may be provided in a computer program product (computer program product).
  • Computer program products may be traded between sellers and buyers as commodities.
  • the computer program product is distributed in the form of a machine-readable storage medium (eg compact disc read only memory (CD-ROM)), or via an application store (eg Play Store TM ) or on two user devices ( It can be distributed (eg downloaded or uploaded) directly or online between smartphones (eg: smartphones).
  • a portion of the computer program product may be temporarily stored or temporarily generated in a machine-readable storage medium such as a memory of a server of a manufacturer, a server of an application store, or a memory of a relay server.
  • each component eg, a module or a program of the above-described components may include a singular or a plurality of entities, and some of the plurality of entities may be separately disposed in other components. have.
  • one or more components or operations among the above-described corresponding components may be omitted, or one or more other components or operations may be added.
  • a plurality of components eg, a module or a program
  • the integrated component may perform one or more functions of each component of the plurality of components identically or similarly to those performed by the corresponding component among the plurality of components prior to the integration. .
  • operations performed by a module, program, or other component are executed sequentially, in parallel, repeatedly, or heuristically, or one or more of the operations are executed in a different order, or omitted. , or one or more other operations may be added.
  • FIG. 2 is a block diagram 200 illustrating an electronic device 201 according to various embodiments of the present disclosure.
  • 3 is a block diagram 300 illustrating the configuration of the voice recognition module 260 of FIG. 2 according to various embodiments of the present disclosure.
  • an electronic device 201 (eg, the electronic device 101 of FIG. 1 ) includes a communication circuit 210 (eg, the communication module 190 of FIG. 1 ) and a memory 220 (eg, FIG. 1 ). 1 memory 130 ), touch screen display 230 (eg, display module 160 of FIG. 1 ), audio processing circuit 240 (eg, audio module 170 of FIG. 1 ), processor 250 ) (eg, the processor 120 of FIG. 1 ), and/or a voice recognition module 260 .
  • a communication circuit 210 eg, the communication module 190 of FIG. 1
  • a memory 220 eg, FIG. 1
  • 1 memory 130 e.g, touch screen display 230 (eg, display module 160 of FIG. 1 ), audio processing circuit 240 (eg, audio module 170 of FIG. 1 ), processor 250 ) (eg, the processor 120 of FIG. 1 ), and/or a voice recognition module 260 .
  • the communication circuit 210 (eg, the communication module 190 of FIG. 1 ) communicates with the electronic device 201 and at least one external electronic device (eg, the electronic device of FIG. 1 ) under the control of the processor 250 .
  • the device 102 , the electronic device 104 ) (and/or a server (eg, the server 108 of FIG. 1 )) may control a communication connection.
  • the memory 220 (eg, the memory 130 of FIG. 1 ) is a program (eg, the program 140 of FIG. 1 ) for processing and controlling the processor 250 of the electronic device 201 ).
  • an operating system (OS) eg, the operating system 142 of FIG. 1
  • OS operating system
  • the memory 220 may store various setting information required when the electronic device 201 processes functions related to various embodiments of the present disclosure.
  • memory 220 is a speech recognition module associated with a function (or operation) of processing an intelligent service (eg, an artificial intelligent voice assistant), which may be performed by processor 250 .
  • the memory 220 may include at least one module of the voice recognition module 260 in the form of software.
  • the function of the voice recognition module 260 may be implemented and stored in the memory 220 in the form of an instruction.
  • the touch screen display 230 (eg, the display module 160 of FIG. 1 ) may be integrally configured including the display 231 and the touch panel 233 .
  • the touch screen display 230 displays an image under the control of the processor 250, a liquid crystal display (LCD), a light-emitting diode (LED) display, an organic light emitting diode ( It may be implemented as any one of an organic light-emitting diode (OLED) display, a micro electro mechanical systems (MEMS) display, an electronic paper display, or a flexible display.
  • LCD liquid crystal display
  • LED light-emitting diode
  • OLED organic light emitting diode
  • MEMS micro electro mechanical systems
  • the present invention is not limited thereto.
  • the touch screen display 230 may, under the control of the processor 250 , display an execution screen related to performance of an intelligent service (eg, AI voice assistant) and/or Alternatively, various information may be displayed.
  • the touch screen display 230 may display a user interface including a response result processed to the user's utterance under the control of the processor 250 .
  • the audio processing circuitry 240 may include a microphone 241 and a speaker 243 (eg, the sound output module 155 of FIG. 1 ).
  • a microphone 241 may be included in the audio processing circuitry 240 and the audio module 170 of FIG. 1 .
  • a speaker 243 eg, the sound output module 155 of FIG. 1 .
  • the microphone 241 may receive an acoustic signal according to a user utterance.
  • the microphone 241 may receive a voice signal from an external device (eg, a headset or a microphone) connected by wire through an audio connector configured in the connection terminal 178 of FIG. 1 .
  • the microphone 241 may receive a voice signal from an external electronic device wirelessly connected to the electronic device 201 through a wireless communication circuit (eg, the wireless communication module 192 of FIG. 1 ).
  • the speech signal may include a sequence of words. Each word sequence may include a word, a sub-word that is a sub-unit of a word, a phrase, or a sentence.
  • the speaker 243 may output a result of a response processed to the user's utterance as a sound.
  • the processor 250 may control an operation (or processing) related to processing an intelligent service (eg, an AI voice assistant) in the electronic device 201 .
  • the processor 250 may include at least one module of the voice recognition module 260, and based on at least one module of the voice recognition module 260, may perform a voice recognition operation according to a user's utterance. have.
  • the voice recognition module 260 is included in the processor 250 as a hardware module (eg, circuitry), and/or is a storage medium readable by the processor 250 . It may be implemented as software including one or more instructions stored in (eg, the memory 220 ). For example, operations performed by the processor 250 may be stored in the memory 220 and executed by instructions that cause the processor 250 to operate when executed.
  • the present invention is not limited thereto, and the voice recognition module 260 may be implemented in an external electronic device (eg, the server 108 of FIG. 1 ).
  • the speech recognition module 260 may include a weighted finite state transducer (WFST) model generation module 310 , an automatic speech recognition (ASR) module 320 , a prioritization module 330 , and/or a target and a text determination module 340 .
  • WFST weighted finite state transducer
  • ASR automatic speech recognition
  • the weighted finite state transducer (WFST) model generation module 310 may generate a WFST model for a list of entity names including a plurality of named entities stored in the memory 220 .
  • the plurality of entity names stored in the memory 220 may include a person's name, an application name, a song title, and/or a local name stored in a contact.
  • the present invention is not limited thereto, and the plurality of entity names stored in the memory 220 may include a word or a word sequence having a unique meaning.
  • the WFST model may include a language model that outputs a word or a word sequence when a sub-word is input based on a probability that the sub-word is predicted as a word or a word sequence representing an entity name.
  • the WFST model may include a finite state transducer. The finite state transducer may perform a state transition on an input sub-word based on a preset rule, and may determine a sub-word that can be listed after the input sub-word according to the state transformation result.
  • the WFST model generation module 310 may divide each of a plurality of named entities stored in the memory 220 into sub words.
  • a sub-word is a basic unit constituting a word, and may mean, for example, a phoneme or a syllable.
  • the WFST model generation module 310 may generate a WFST model for each sub-word of a plurality of entity names based on a sub word sequence.
  • the electronic device 201 eg, the WFST model generation module 310
  • the generated WFST model may be stored in the memory 220 in the form of a binary format.
  • a data processing speed based on the WFST model may be fast in a decoding operation performed by the decoder 323 to be described later, and the memory 220 space may also be saved.
  • the WFST model for each sub-word of each of the generated entity names may be used when generating candidate texts to be described later.
  • the WFST model generation module 310 may generate a plurality of WFST models.
  • the plurality of WFST models may include a language model specialized in a specific domain.
  • the domain may include a field or category related to a voice signal.
  • a particular domain may include a phone domain, a contact domain, and/or a music domain.
  • the present invention is not limited thereto.
  • the automatic speech recognition (ASR) module 320 may convert a voice signal received from the microphone 241 into text data.
  • the ASR module 320 may include an encoder 321 that extracts a feature value (eg, a feature vector) from a speech signal, and a decoder 323 that outputs a candidate text based on the feature value extracted by the encoder 321.
  • a feature value eg, a feature vector
  • the ASR module 320 may be implemented as a model configured to perform speech recognition on a speech signal in an end-to-end automatic speech recognition (ASR) method.
  • ASR end-to-end automatic speech recognition
  • the encoder 321 and the decoder 323 may be implemented as one neural network or may be implemented as separate neural networks.
  • the encoder 321 of the ASR module 320 may extract information indicating the characteristics of the voice from the voice signal received from the microphone 241 .
  • the encoder 321 of the ASR module 320 may extract voice feature information (eg, a voice feature vector) including information indicating a feature of the voice signal.
  • voice feature information eg, a voice feature vector
  • the encoder 321 of the ASR module 320 extracts voice feature information based on the recognition of an utterance (eg, hi Bixby) designated to call (or drive) an intelligent service (eg, AI voice assistant) action can be performed.
  • the encoder 321 of the ASR module 320 is an input designated to call an intelligent service, for example, an input that is pressed twice consecutively for the power key of the input module 150 (or a touch of the display 231 ) Based on detecting a touch input received from the circuit), an operation of extracting voice characteristic information may be performed.
  • the encoder 321 of the ASR module 320 removes noise from the voice signal and detects a voice section, and can extract voice feature information (eg, a voice feature vector) to be used for voice recognition from the voice section. have.
  • the encoder 321 of the ASR module 320 may perform an operation of extracting speech feature information based on a model learned using pre-prepared training data and/or an artificial intelligence algorithm.
  • the present invention is not limited thereto.
  • the decoder 323 of the ASR module 320 may generate a candidate text based on speech characteristic information extracted from a speech signal.
  • the candidate text generated by the decoder 323 of the ASR module 320 may include one or more words.
  • the decoder 323 of the ASR module 320 may generate a language model for the speech signal using an artificial intelligence model (eg, an end-to-end automatic speech recognition (ASR) model).
  • an artificial intelligence model eg, an end-to-end automatic speech recognition (ASR) model
  • the language model as a component constituting the natural language processing engine, may provide probability values associated with each word (including subwords), phrases, and/or sentences.
  • the decoder 323 of the ASR module 320 may generate a plurality of candidate texts based on the text representations provided by the above-described language model and probability values of the corresponding text representations.
  • the decoder 323 of the ASR module 320 may generate a plurality of candidate texts by decoding the WFST model through a specified decoding method (eg, shallow fusion) during a beam search. .
  • a specified decoding method eg, shallow fusion
  • the decoder 323 of the ASR module 320 determines the probability of the WFST model during beam search and the probability of the language model for the speech signal learned by the ASR module 320 in a designated decoding method ( For example, a shallow fusion method) is used to linearly combine each other, and based on this, a plurality of candidate texts with high probability can be generated.
  • a plurality of candidate texts having a high probability may mean a plurality of candidate texts having a high probability of corresponding to the target text.
  • the priority setting module 330 may determine whether at least one entity name among the plurality of entity names is included in the plurality of candidate texts. When at least one entity name among the plurality of entity names is included in the plurality of candidate texts, the priority setting module 330 may check the probability of the at least one entity name. If it is confirmed that the probability of the candidate text corresponding to the at least one entity name is lower than the probability of the other candidate texts, the priority setting module 330 sets the candidate text corresponding to the at least one entity name as a word corresponding to the voice signal. To be predicted, the probability of the candidate text corresponding to the at least one entity name may be set higher than that of the other candidate texts.
  • the target text determination module 340 may determine a candidate text set with a high probability as the target text.
  • the electronic device 201 includes a microphone 241 , a memory 220 , and a processor 250 operatively connected to the microphone 241 and the memory 220 , the processor At 250, each of a plurality of named entities stored in the memory 220 is divided into sub-words, and based on a sub-word sequence, each of the plurality of entity names is Generates a weighted finite state transducer (WFST) model for a sub-word, receives a voice signal according to a user's utterance through the microphone 241, and generates a plurality of candidate texts related to the voice signal using the WFST model and, when at least one entity name among the plurality of entity names is included in the plurality of candidate texts, the candidate text corresponding to the at least one entity name is predicted as a word corresponding to the voice signal. It may be configured to set a probability value of the candidate text corresponding to the entity name of , higher than that of other candidate texts, and determine the candidate text having the high probability value as
  • the processor 250 checks the length of the sub-word sequence, and when the length of the sub-word sequence exceeds the designated length of the sub-word sequence, the designated length of the sub-word sequence and generating the WFST model for each sub-word of the plurality of entity names based on a specified window.
  • the WFST model may include a plurality of WFST models generated for each domain.
  • the domain may include a field or category related to the voice signal.
  • the processor 250 checks domain information of the received voice signal, and uses a WFST model corresponding to the identified domain information among the plurality of WFST models related to the voice signal. It may be set to generate a plurality of candidate texts.
  • the processor 250 may be configured to generate a language model for the voice signal using an artificial intelligence model based on receiving the voice signal.
  • the processor 250 linearly combines the probability of the WFST model and the probability of the language model for the speech signal generated using the artificial intelligence model, and based on this, the probability of the high probability is It may be configured to generate the plurality of candidate texts related to the speech signal.
  • the processor 250 refers to a plurality of entity names stored in the memory 220 to determine whether the at least one entity name among the plurality of entity names is included in the plurality of candidate texts. and, when at least one entity name among the plurality of entity names is included in the plurality of candidate texts, compare the probability value of the candidate text corresponding to the at least one entity name with the probability value of another candidate text can be set to
  • the processor 250 may be configured to The probability value of the candidate text corresponding to the entity name may be set higher than that of the other candidate texts.
  • the electronic device 201 further includes a display 231 , and the processor 250 executes a function of the electronic device 201 based on the target text, and executes the function. It may be set to display a user interface according to the user interface on the display 231 .
  • FIG. 4 is a flowchart 400 for explaining a method of recognizing a voice of the electronic device 201 according to a user's utterance, according to various embodiments of the present disclosure.
  • the electronic device serves each of a plurality of named entities stored in a memory (eg, the memory 220 of FIG. 2 ). It can be divided into sub-words.
  • the plurality of object names stored in the memory 220 may include words or word strings having unique meanings, such as person names and application names stored in the contact list.
  • the plurality of object names may include, for example, words or word sequences corresponding to the plurality of object names included in the running application.
  • a music application an entity name including a song name, an artist name, and/or a composer name related to music included in a playlist or being played (or music having a playback history) is acquired and stored in the memory 220 .
  • the electronic device 201 may store a plurality of object names obtained by crawling a web page in the memory 220 .
  • a plurality of entity names are stored in the memory 220 , but the present invention is not limited thereto.
  • the plurality of entity names may include a plurality of entity names received through a user input and/or a plurality of entity names received from an external server.
  • a sub-word is a basic unit constituting a word, and may mean, for example, a phoneme or a syllable.
  • the electronic device 201 may generate a weighted finite state transducer (WFST) model for each sub-word of a plurality of entity names based on a sub-word sequence.
  • WFST weighted finite state transducer
  • the electronic device 201 may generate a WFST model for each sub-word of a plurality of entity names based on the sub-word sequence. For example, when the length of the sub-word sequence is less than or equal to the specified length, the electronic device 201 generates a full variation for each sub-word of each of the plurality of entity names, and based on this, the WFST for each sub-word of the plurality of entity names You can create a model. As another example, when the length of the sub-word sequence exceeds the specified length, the electronic device 201 may configure each sub-word of each of the plurality of entity names based on the specified length and the specified window of the sub-word sequence. It is possible to generate a partial variation for , and based on this, a WFST model for each subword of a plurality of entity names can be generated.
  • the entity name is the name of a person stored in the contact list
  • a WFST model is generated for “Kang Geum-dan” among the person names stored in the contact list.
  • the electronic device 201 may perform an operation of converting (eg, encoding) a person's name “Kang Geum-dan” stored in the contact list into one sub-word sequence.
  • the output value “247 316 742” can be generated through the operation of converting (eg, encoding) the name “Kang Geum-dan” (eg, input value) stored in the contact into one sub-word sequence.
  • the output value may mean indicating the order of each sub-word (eg, ga, ⁇ , da _) mapped to the input value.
  • the output value “247” may mean that “a” indicates the 247th sub-word
  • the output value “316” may mean that “the” indicates the 316th sub-word
  • the output value “742” may mean that “da_” indicates the 742th sub-word.
  • Kanggeumdan Output 247 (a) 316 (g) 742 (db_)
  • the electronic device 201 converts one generated sub-word sequence (eg, 247 (a)) 316 (g) 742 (da _)) to a decoder (eg, the decoder 323 of FIG. 3 ). ))), which can be extended to the possible sub word sequence variations that can be generated.
  • a decoder eg, the decoder 323 of FIG. 3 .
  • the length of the sub-word sequence with respect to the input value “Kanggeumdan” is “3”
  • the specified length of the sub-word sequence and the specified window may be set differently based on the length of the entity name and/or language characteristics (eg, Korean, English).
  • the WFST model has been described as a WFST model for a person name stored in a contact, but is not limited thereto.
  • the WFST model may include a plurality of WFST models generated for each domain.
  • a domain may include a field or category related to a voice signal.
  • a domain may include a phone domain, a contacts domain, and/or a music domain.
  • the present invention is not limited thereto.
  • the electronic device 201 may receive a voice signal according to a user's utterance through a microphone (eg, the microphone 241 of FIG. 2 ).
  • the speech signal may include a sequence of words.
  • Each word sequence may include a word, a sub-word that is a sub-unit of a word, a phrase, or a sentence.
  • the electronic device 201 may generate a plurality of candidate texts related to the voice signal by using the WFST model.
  • the WFST model may include a plurality of WFST models specialized in a specific domain.
  • the electronic device 201 may check domain information of the received voice signal and generate a plurality of candidate texts related to the voice signal by using a WFST model corresponding to domain information of the identified voice signal among the plurality of WFST models. .
  • the electronic device 201 may check domain information about the voice signal while receiving the voice signal according to the user's utterance through the microphone 241 .
  • the electronic device 201 may perform partial decoding based on at least some voice signals according to the received user utterance. Based on the partial decoding result, the electronic device 201 may first check domain information on the voice signal, and then generate a plurality of candidate texts related to the voice signal using a WFST model corresponding to the verified domain information.
  • the electronic device 201 may generate a plurality of candidate texts by decoding a WFST model corresponding to domain information of a voice signal through a specified decoding method (eg, shallow fusion) during beam search. have. For example, when performing a beam search, the electronic device 201 determines the probability of the WFST model and the probability of the language model for the speech signal learned by the ASR module 320 in a specified decoding method (eg, shallow fusion method). ) are linearly combined with each other, and based on this, it is possible to generate a plurality of candidate texts with high probability.
  • a specified decoding method eg, shallow fusion method
  • the electronic device 201 when at least one entity name among the plurality of entity names is included in the plurality of candidate texts, the electronic device 201 corresponds to the candidate text corresponding to the at least one entity name in operation 425 .
  • the probability value of the candidate text corresponding to the at least one entity name may be set to be higher than the probability value of the other candidate texts to be predicted as a word.
  • the electronic device 201 may refer to the plurality of object names stored in the memory 220 to determine whether at least one object name among the plurality of object names is included in the plurality of candidate texts.
  • the electronic device 201 may compare a probability value of the candidate text corresponding to the at least one entity name with a probability value of another candidate text. .
  • the probability value of the candidate text corresponding to the at least one entity name is less than or equal to the probability value of another candidate text
  • the electronic device 201 sets the probability value of the candidate text corresponding to the at least one entity name to be higher than the probability value of the other candidate texts. It can be set high.
  • the operation of setting the probability value of the candidate text corresponding to the at least one entity name to be higher than the probability value of the other candidate texts may be an operation for selecting the candidate text corresponding to the at least one entity name as the target text.
  • operation 425 when a probability value of a candidate text corresponding to at least one entity name exceeds a probability value of another candidate text, operation 425 may be omitted.
  • the electronic device 201 may determine a candidate text having a high probability value as the target text.
  • the electronic device 201 may output a user interface according to the execution of a function of the electronic device 201 based on the target text on a display (eg, the display 231 of FIG. 2 ). .
  • a probability value of the candidate text corresponding to the at least one entity name is set to another candidate text.
  • the operation of determining the candidate text having the high probability value of operation 430 as the target text has been described as being performed, for example, as two operations, but the present invention is not limited thereto.
  • operations 425 and 430 may be integrated and performed as one operation.
  • the electronic device 201 may be integrated with operation 430 to directly select the found candidate text as the target text.
  • the operation of setting the probability value of the candidate text corresponding to the at least one entity name to be higher than the probability value of the other candidate texts may be omitted.
  • the electronic device 201 may determine the candidate text corresponding to the at least one entity name as the target text.
  • the electronic device 201 may more accurately recognize the entity name stored in the memory 220 corresponding to the voice signal according to the user's utterance, , the performance of voice recognition may be improved accordingly.
  • each of a plurality of named entities stored in the memory 220 of the electronic device 201 is divided into sub words. and generating a WFST (weighted finite state transducer) model for each sub-word of the plurality of entity names based on the sub word sequence, and a voice signal according to the user's utterance through the microphone 241 receiving, generating a plurality of candidate texts related to the speech signal using the WFST model, and when at least one of the plurality of object names is included in the plurality of candidate texts, the at least one setting the probability value of the candidate text corresponding to the at least one object name to be higher than that of other candidate texts so that the candidate text corresponding to the object name is predicted as a word corresponding to the speech signal, and the probability value It may include an operation of determining the highly set candidate text as the target text.
  • WFST weighted finite state transducer
  • the generating of the WFST model includes checking the length of the sub-word sequence, and when the length of the sub-word sequence exceeds a specified length of the sub-word sequence, the sub-word and generating the WFST model for each subword of the plurality of entity names based on the specified length of the sequence and the specified window.
  • the WFST model may include a plurality of WFST models generated for each domain.
  • the domain may include a field or category related to the voice signal.
  • the generating of the plurality of candidate texts includes checking domain information of the received voice signal, and using a WFST model corresponding to the checked domain information among the plurality of WFST models. and generating a plurality of candidate texts related to the voice signal.
  • the method for performing voice recognition of the electronic device 201 may further include generating a language model for the voice signal using an artificial intelligence model based on receiving the voice signal.
  • the generating of the plurality of candidate texts may include linearly combining a probability of the WFST model and a probability of a language model for the speech signal generated using the artificial intelligence model, and based on this, and generating the plurality of candidate texts related to the speech signal having a high probability.
  • the setting of the probability value of the candidate text corresponding to the at least one object name to be higher than the probability value of the other candidate text may include referring to the plurality of object names stored in the memory 220 . , checking whether the at least one entity name among the plurality of entity names is included in the plurality of candidate texts, and when at least one entity name among the plurality of entity names is included in the plurality of candidate texts , comparing a probability value of a candidate text corresponding to the at least one entity name with a probability value of another candidate text.
  • the setting of the probability value of the candidate text corresponding to the at least one entity name higher than the probability value of the other candidate text may include, based on the comparison result, the at least one entity name. If the probability value of the corresponding candidate text is lower than the probability value of the other candidate text, setting the probability value of the candidate text corresponding to the at least one entity name to be higher than the probability value of the other candidate text. have.
  • a method for performing voice recognition of an electronic device 201 executes a function of the electronic device 201 based on the target text, and displays a user interface according to the function execution on the display 231 .
  • An operation of displaying may be further included.
  • FIG. 5 is a diagram 500 for explaining a method of recognizing a person's name stored in a contact information of the electronic device 201 according to a user's utterance, according to various embodiments of the present disclosure.
  • the electronic device (eg, the electronic device 201 of FIG. 2 ) generates 510 a WFST model for a contact (eg, a contact list), and uses the WFST model for the created contact to create a user Speech recognition according to the utterance may be performed ( 550 ).
  • a WFST model for a contact eg, a contact list
  • uses the WFST model for the created contact to create a user Speech recognition according to the utterance may be performed ( 550 ).
  • the electronic device 201 uses a WFST model generation module (eg, the WFST model generation module 310 of FIG. 3 ) to store a contact list (eg, the memory 220 of FIG. 2 ) stored in a memory (eg, the memory 220 of FIG. 2 ). 515) (eg, a list of entity names) may generate 510 a WFST model.
  • the WFST model generation module 310 may convert (eg, encode) a person name (eg, a person name set by the user of the electronic device 201 ) stored in the contact list 515 into one sub-word sequence. have.
  • the WFST model creation module 310 generates (510) a WFST model for a person's name stored in the contact list 515 using a dynamic language model builder 520, for example, a contact language model 525. can do.
  • the electronic device 201 may receive the user utterance 555 through a microphone (eg, the microphone 241 of FIG. 2 ).
  • the ASR module 560 eg, the ASR module 320 of FIG. 3 loads the generated WFST model, for example, the contact language model 525 (540). can do.
  • the ASR module 560 may identify domain information of a voice signal according to the user utterance 555 and generate a plurality of candidate texts related to the voice signal by using a WFST model corresponding to the verified domain information.
  • the ASR module 560 loads 540 the contact language model 525, and the contact language model 525 ) can be decoded (eg, shallow fusion decoding) to generate n candidate texts 565 (eg, “Min-Sam Lee”, “Phone in Sam Lim”, “Phone in Im Im-Sam”, and “Phone Im Im-Sam Im”). .
  • the electronic device 201 in a decoding operation, linearly combines the probability of a word predicted using the contact language model 525 with the probability of the language model learned by the ASR module 560, such that n
  • the candidate text 565 may include the name of a person stored in the contact list 515 .
  • the priority setting module 570 may include n candidate texts 565 , such as “Min-Sam Lee”, “Phone Im Ginseng”, and “Phone Im Im-Sam”. , or whether the name of a person stored in the contact list 515 is included in the “call by Imimsam Im”.
  • the priority setting module 570 may set priorities for the n candidate texts 565 based on the contact list 515 ( 545 ). For example, the priority setting module 570 may configure the name of a person stored in the contact list 515 among the n candidate texts 565, for example, “Min-Sam Lee”, “Phone Insam Im”, “Phone Im Im-Sam”, or “Phone Im Im-Sam”. can be set to a higher priority.
  • the electronic device 201 may convert at least one person's name stored in the contact list 515 into a sub-word sequence and then store it in the memory 220 in a list format.
  • the electronic device 201 may detect a portion of the n candidate texts that is likely to appear in the contact list 515 and convert the detected entity name into a phoneme sequence.
  • the electronic device 201 includes an entity name similar to at least one person name stored in the contact list based on an editing distance algorithm (eg, Levenshtein distance) using a search engine (eg, Lucene engine query) operation. Candidate text can be searched.
  • the electronic device 201 may calculate an edit distance between the object name included in the candidate text and the name of at least one person stored in the memory 220 in the form of a list.
  • the edit distance may be calculated using a distance predefined for each of the initial, middle, and final vowels.
  • the electronic device 201 may determine the candidate text of the entity name exceeding the specified reference value as the person's name stored in the contact list 515 .
  • the electronic device 201 may include n candidate texts 565, for example, “Min-Sam Lee”, “Phone In-Sam Im”, and “Phone Im Im-Sam”. ” or “Im-Sam Im’s phone”, it can be confirmed that the person’s name stored in the contact list, for example, “Im-Sam Im” exists.
  • the priority setting module 570 may set the probability of "Im-im-sam” to be higher than that of other candidate texts, for example, "Lee Min-sam”, “Im-ginseng", and "Im-im-sam”.
  • the operation of setting the probability of "Im-im-sam” to be higher than that of other candidate texts may be an operation to increase the probability of predicting "Im-im-sam" as a word corresponding to a voice signal.
  • the electronic device 201 determines a candidate text with a high probability, for example, “Im-Im-Sam”, as the target text, and calls a function of the electronic device 201 according to the user’s utterance, for example, “Im-Im-Sam”.
  • a call function 580 may be performed.
  • the electronic device 201 may display a user interface including a processing result for performing a call function on a display (eg, the display 231 of FIG. 2 ).
  • the operation of setting the above-described probability of “Im Im-Sam” to be higher than that of other candidate texts may be omitted.
  • the electronic device 201 may display the name of a person stored in the contact list among the n candidate texts 565, for example, “Min-Sam Lee,” “Ph. Based on it is confirmed that there is, "Im-Im-Im-Im" is determined as the target text, and a function of the electronic device 201 according to the user's utterance, for example, a call function 580 of making a call to "Im-Im-Im" may be performed.
  • FIG. 6 is a diagram 600 for explaining a method of recognizing a person's name stored in a contact information of the electronic device 201 according to a user's utterance, according to various embodiments of the present disclosure.
  • FIG. 6 is a view for explaining a method of performing voice recognition when “Juni Kim” is stored in a contact list according to various embodiments of the present disclosure.
  • the electronic device receives a voice signal according to a user utterance, for example, “Call Juni Kim” through a microphone (eg, the microphone 241 of FIG. 2 ). can do.
  • the electronic device 201 generates a contact language model (eg, the contact language model 525 of FIG. 5 ) pre-generated through the WFST model generation module (eg, the WFST model generation module 310 of FIG. 3 ).
  • the probability of and the probability of the language model learned by the ASR module are linearly combined with each other in a specified decoding method (eg, shallow fusion method), and based on this, Candidate texts of, for example, four candidate texts 610 may be generated.
  • the four candidate texts 610 may include “Call Junhee Kim”, “Call Juni Kim”, “Call Junhee Kim”, and “Call Juni Kim”.
  • the probability for “Call Jun-Hee Kim” is “-0.568751”
  • the probability for “Call Jun-Hee Kim” is “-1.504663”
  • the probability for “Call Jun-Hee Kim” is “-2.433038”
  • the probability for “Call Juni Kim” may be “-3.106113”.
  • the electronic device 201 displays the name of a person stored in the contact list among four candidate texts 610 , for example, “Call Junhee Kim”, “Call Juni Kim”, “Call Junhee Kim”, or “Call Juni Kim”. You can check whether this exists.
  • the electronic device 201 may provide four candidate texts 610 based on personal data sync service (PDSS) information, for example, “Call Junhee Kim”, “Call Juni Kim”, “Call Junhee Kim”, or “Kim Junhee phone”. You can check whether the name of the person saved in the contact list exists during the “Zuni Call”.
  • “Juni Kim” 620 may be stored in the contact list, and the electronic device 201 determines that “Juni Kim” 620 is included in the four candidate texts 610 based on this. that can be checked
  • the electronic device ( 201) may set 630 the probability of “Call Jun-Hee Kim” to be higher than the probability of other candidate texts (eg, “Call Jun-Hee Kim”, “Call Jun-Hee Kim”, and/or “Call Jun-Hee Kim”).
  • the electronic device 201 may set the probability of “Calling Juni Kim” from “-3.106113” to “-0.000001” ( 635 ).
  • the electronic device 201 may determine ( 640 ) “Call Juni Kim” with a high probability as the target text, and execute a function corresponding to a user utterance, for example, “Call Juni Kim”. For example, the electronic device 201 may perform a call function to make a call from “Juni Kim”.
  • the electronic device 201 may display a user interface including a processing result for performing a call function on a display (eg, the display 231 of FIG. 2 ).
  • the operation of setting ( 635 ) the probability of the aforementioned “Call Juni Kim” from “-3.106113” to “-0.000001” may be omitted.
  • “Juni Kim” 620 eg, contact list
  • the electronic device 201 determines "Call Juni Kim” as the target text (640), and corresponds to the user's utterance, for example, "Call Juni Kim”. function can be executed.
  • FIG. 7 is a diagram 700 for explaining a method of recognizing an application name installed in the electronic device 201 according to a user's utterance, according to various embodiments of the present disclosure.
  • FIG. 7 is a diagram for explaining a method of recognizing an application name, for example, “hogaengnono” installed in an electronic device (eg, the electronic device 201 of FIG. 2 ).
  • the electronic device 201 may receive a voice signal according to a user's utterance, for example, “Hogaengno no open” through a microphone (eg, the microphone 241 of FIG. 2 ).
  • a microphone eg, the microphone 241 of FIG. 2 .
  • the electronic device 201 uses a pre-generated application language model (not shown) (eg, the electronic device 201 ) through a WFST model generation module (eg, the WFST model generation module 310 of FIG. 3 ).
  • the probability of the application language model for the list of installed applications) and the probability of the language model learned by the ASR module are linearly combined with each other in a specified decoding method (eg, shallow fusion method) and, based on this, a plurality of candidate texts with high probability, for example, four candidate texts 710 may be generated.
  • a specified decoding method eg, shallow fusion method
  • the four candidate texts 710 may include “open the solicitor no no”, “open the solicitor no no”, “open the solicitor no no”, and “open the solicitor no no”.
  • Probability for “Hook No No Open” is “-0.568751”
  • Probability for “Hook No No Open” is “-1.504663”
  • Probability for “Hook No No Open” is “-2.433038”
  • Probability for “Hook No No Open” may be “-3.106113”.
  • the electronic device 201 selects the electronic device 201 from among the four candidate texts 710 , for example, “open the solicitor no no”, “open the solicitor no no”, “open the solicitor no no”, or “open the solicitor no no”. You can check whether the name of the application installed in . For example, based on the personal data sync service (PDSS) information, the electronic device 201 provides four candidate texts 710 , for example, “Solicitor no no open”, “Solicitor no no open”, “Hogaeng no no open”, or “Hogaeng no no open it”. It may be checked whether the name of the application installed in the electronic device 201 exists among “open nonon”. As described above, “hogaengnono”720 may be stored in the contact information, and the electronic device 201 detects that the four candidate texts 710 include “hogaengnono” 720 based on this. can be checked
  • the electronic device ( 201) may set 730 the probability of “hogaeng no no open” to be higher than the probability of other candidate texts (eg, “here no no no open”, “hook no no no open”, and/or “hogagan no no open”). For example, the electronic device 201 may set ( 735 ) the probability of “hogaeng no no open” from “-3.106113” to “-0.000001”.
  • the electronic device 201 may determine “hogaengno no open” with a high probability as the target text 740 and execute a function corresponding to a user utterance, for example, “hogaengno no open”. For example, the electronic device 201 may perform a function of executing a “hogaengnono” application.
  • the electronic device 201 may display a user interface according to the execution of the “Hogaengnono” application on the display (eg, the display 231 of FIG. 2 ).
  • the operation of setting ( 735 ) the probability of the above-described “hogaeng no no open” from “-3.106113” to “-0.000001” may be omitted.
  • the “hogaeng no no” 720 eg, the electronic device
  • the electronic device 201 determines “Hogaengnono open” as the target text 740 and executes the “Hogaengnono” application. You can run a function that runs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Telephone Function (AREA)

Abstract

Selon divers modes de réalisation divulgués dans la présente invention, un dispositif électronique comprend un microphone, une mémoire et un processeur fonctionnellement relié au microphone et à la mémoire, le processeur pouvant être configuré pour : diviser chaque entité d'une pluralité d'entités nommées stockées dans la mémoire en sous-mots ; générer un modèle de transducteur d'état fini pondéré (WFST) pour les sous-mots de chaque entité de la pluralité d'entités nommées sur la base d'une séquence de sous-mots ; recevoir, par l'intermédiaire du microphone, un signal de parole correspondant à un énoncé d'un utilisateur ; générer une pluralité d'éléments de texte candidats relatifs au signal de parole à l'aide du modèle WFST ; lorsqu'au moins une entité nommée parmi la pluralité d'entités nommées est comprise dans la pluralité d'éléments de texte candidats, définir les valeurs de probabilité des éléments de texte candidats correspondant à ladite entité nommée comment étant plus élevées que les valeurs de probabilité des autres éléments de texte candidats de telle sorte que les éléments de texte candidats correspondant à ladite entité nommée sont prédits en tant que mots correspondant au signal de parole ; et déterminer, en tant que texte cible, les éléments de texte candidats pour lesquels les valeurs de probabilité ont été définies comme étant élevées. Divers autres modes de réalisation sont également possibles, outre ceux divulgués dans le présent document.
PCT/KR2022/001821 2021-03-29 2022-02-07 Dispositif électronique et procédé de reconnaissance de la parole l'utilisant WO2022211257A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020210040515A KR20220135039A (ko) 2021-03-29 2021-03-29 전자 장치 및 이를 이용한 음성 인식 수행 방법
KR10-2021-0040515 2021-03-29

Publications (1)

Publication Number Publication Date
WO2022211257A1 true WO2022211257A1 (fr) 2022-10-06

Family

ID=83459310

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/001821 WO2022211257A1 (fr) 2021-03-29 2022-02-07 Dispositif électronique et procédé de reconnaissance de la parole l'utilisant

Country Status (2)

Country Link
KR (1) KR20220135039A (fr)
WO (1) WO2022211257A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070045748A (ko) * 2005-10-28 2007-05-02 삼성전자주식회사 개체명 검출 장치 및 방법
US20180254036A1 (en) * 2015-11-06 2018-09-06 Alibaba Group Holding Limited Speech recognition method and apparatus
KR20190083629A (ko) * 2019-06-24 2019-07-12 엘지전자 주식회사 음성 인식 방법 및 음성 인식 장치
KR20200097993A (ko) * 2019-02-11 2020-08-20 삼성전자주식회사 전자 장치 및 이의 제어 방법
KR20210001937A (ko) * 2019-06-28 2021-01-06 삼성전자주식회사 사용자의 음성 입력을 인식하는 디바이스 및 그 동작 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070045748A (ko) * 2005-10-28 2007-05-02 삼성전자주식회사 개체명 검출 장치 및 방법
US20180254036A1 (en) * 2015-11-06 2018-09-06 Alibaba Group Holding Limited Speech recognition method and apparatus
KR20200097993A (ko) * 2019-02-11 2020-08-20 삼성전자주식회사 전자 장치 및 이의 제어 방법
KR20190083629A (ko) * 2019-06-24 2019-07-12 엘지전자 주식회사 음성 인식 방법 및 음성 인식 장치
KR20210001937A (ko) * 2019-06-28 2021-01-06 삼성전자주식회사 사용자의 음성 입력을 인식하는 디바이스 및 그 동작 방법

Also Published As

Publication number Publication date
KR20220135039A (ko) 2022-10-06

Similar Documents

Publication Publication Date Title
WO2020180000A1 (fr) Procédé d'expansion de langues utilisées dans un modèle de reconnaissance vocale et dispositif électronique comprenant un modèle de reconnaissance vocale
WO2023048379A1 (fr) Serveur et dispositif électronique pour traiter un énoncé d'utilisateur, et son procédé de fonctionnement
WO2022191395A1 (fr) Appareil de traitement d'une instruction utilisateur et son procédé de fonctionnement
WO2022131566A1 (fr) Dispositif électronique et procédé de fonctionnement de dispositif électronique
WO2022211257A1 (fr) Dispositif électronique et procédé de reconnaissance de la parole l'utilisant
WO2022139420A1 (fr) Dispositif électronique et procédé de partage d'informations d'exécution d'un dispositif électronique concernant une entrée d'utilisateur avec continuité
WO2022065879A1 (fr) Dispositif d'apprentissage d'authentification de locuteur d'un utilisateur enregistré pour service de reconnaissance vocale, et son procédé de fonctionnement
WO2023149644A1 (fr) Dispositif électronique et procédé de génération de modèle de langage personnalisé
WO2023113404A1 (fr) Dispositif électronique pour fournir un service de reconnaissance vocale à l'aide de données d'utilisateur, et son procédé de fonctionnement
WO2024058597A1 (fr) Dispositif électronique et procédé de traitement d'énoncé d'utilisateur
WO2022231126A1 (fr) Dispositif électronique et procédé de génération de modèle tts permettant la commande prosodique d'un dispositif électronique
WO2023058944A1 (fr) Dispositif électronique et procédé de fourniture de réponse
WO2024039191A1 (fr) Dispositif électronique et procédé de traitement d'énoncé d'utilisateur
WO2024029845A1 (fr) Dispositif électronique et son procédé de reconnaissance vocale
WO2021261882A1 (fr) Dispositif électronique et procédé de conversion de phrase sur la base de néologisme pour dispositif électronique
WO2024117658A1 (fr) Procédé de traduction de langue et dispositif électronique
WO2022177165A1 (fr) Dispositif électronique et procédé permettant d'analyser un résultat de reconnaissance vocale
WO2022145883A1 (fr) Procédé de réponse à une entrée vocale et dispositif électronique le prenant en charge
WO2024029850A1 (fr) Procédé et dispositif électronique pour traiter un énoncé d'utilisateur sur la base d'un modèle de langage
WO2024029827A1 (fr) Appareil électronique et support de stockage lisible par ordinateur pour recommandation de commande
WO2024029875A1 (fr) Dispositif électronique, serveur intelligent et procédé de reconnaissance vocale adaptative d'orateur
WO2022196994A1 (fr) Dispositif électronique comprenant un module de conversion de texte en parole personnalisé, et son procédé de commande
WO2022182038A1 (fr) Dispositif et procédé de traitement de commande vocale
WO2024029851A1 (fr) Dispositif électronique et procédé de reconnaissance vocale
WO2022010158A1 (fr) Dispositif électronique et procédé de fonctionnement de dispositif électronique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22781378

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22781378

Country of ref document: EP

Kind code of ref document: A1