CN112639965A - Speech recognition method and device in an environment comprising a plurality of devices - Google Patents

Speech recognition method and device in an environment comprising a plurality of devices Download PDF

Info

Publication number
CN112639965A
CN112639965A CN201980055917.2A CN201980055917A CN112639965A CN 112639965 A CN112639965 A CN 112639965A CN 201980055917 A CN201980055917 A CN 201980055917A CN 112639965 A CN112639965 A CN 112639965A
Authority
CN
China
Prior art keywords
speaker
speech recognition
speech
recognition
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980055917.2A
Other languages
Chinese (zh)
Inventor
曹根硕
卢在英
邢知远
张东韩
李在原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority claimed from PCT/KR2019/013903 external-priority patent/WO2020085769A1/en
Publication of CN112639965A publication Critical patent/CN112639965A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

An Artificial Intelligence (AI) system and application of the AI system is provided that utilizes machine learning algorithms, such as deep learning and the like. A speech recognition method performed by a speech recognition device for performing speech recognition in a space in which a plurality of speech recognition devices exist, comprising: extracting a speaker's voice signal from an input audio signal; obtaining a first speaker recognition score indicative of a similarity between the speech signal and a speech signal of a registered speaker; and outputting a speech recognition result for the speech signal based on the first speaker recognition score and the second speaker recognition score obtained from another speech recognition device of the plurality of speech recognition devices.

Description

Speech recognition method and device in an environment comprising a plurality of devices
Technical Field
The present disclosure relates to a voice recognition method and apparatus, and for example, to a method of recognizing and outputting voice performed by one voice recognition apparatus selected in an environment including a plurality of voice recognition apparatuses.
Background
As electronic devices that perform various functions in combination have been developed, electronic devices equipped with a voice recognition function have been released to improve operability. The voice recognition function can easily control the device by recognizing the voice of the user without a separate button operation or contact with the touch module.
According to the voice recognition function, for example, a portable terminal such as a smart phone and a home appliance such as a TV, a refrigerator, etc. can perform a call function or write a text message without pressing a separate button, and various functions such as directions, internet search, alarm setting, etc. can be easily set.
Further, an Artificial Intelligence (AI) system may refer to, for example, a computer system with human-level intelligence. Unlike existing rule-based intelligent systems, AI systems are systems that train themselves autonomously, make decisions, and become increasingly intelligent. The more the AI system is used, the more the recognition rate of the AI system can be improved, and the more accurately the AI system can understand the user preferences, and thus, the existing rule-based intelligent system is being gradually replaced by the deep learning based AI system.
AI techniques refer to machine learning (deep learning) and meta techniques that utilize machine learning.
Machine learning may refer to, for example, algorithmic techniques that autonomously classify/learn features of input data. Meta-techniques may refer to techniques that utilize machine learning algorithms (such as deep learning), for example, and may include technical fields such as language understanding, visual understanding, inference/prediction, knowledge representation, and motion control.
AI technology is applied to various fields as follows. Language understanding may refer to techniques such as recognition and application/processing of human language/characters, and includes natural language processing, machine translation, dialog systems, query response, speech recognition/synthesis, and so forth. Inferential forecasting may refer to techniques such as, for example, obtaining information and logically inferring and forecasting, and includes knowledge/probabilistic based reasoning, optimization forecasting, preference based planning, recommendations, and so forth. Knowledge representation may refer to, for example, techniques for automatically converting human experience information into knowledge data, and includes knowledge construction (data generation/classification), knowledge management (data utilization), and the like.
Disclosure of Invention
Technical problem
According to the embodiments of the present disclosure, a voice recognition apparatus (of a plurality of voice recognition apparatuses) closest to a user is correctly selected in a space where the plurality of voice recognition apparatuses exist, and thus the selected voice recognition apparatus provides a service that meets the user's needs.
Technical scheme
Additional aspects will be set forth in part in the description which follows and, in part, will be obvious from the description.
According to an example embodiment of the present disclosure, a voice recognition method performed by a voice recognition apparatus to perform voice recognition in a space in which a plurality of voice recognition apparatuses exist includes: extracting a speaker's voice signal from an input audio signal; obtaining a first speaker recognition score indicative of a similarity between the speech signal and a speech signal of a registered speaker; and outputting a speech recognition result for the speech signal based on the first speaker recognition score and the second speaker recognition score obtained from another speech recognition device of the plurality of speech recognition devices.
Drawings
The above and other aspects, features and advantages of particular embodiments of the present disclosure will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which:
fig. 1 is a flowchart of a method of selecting one voice recognition device in a space where a plurality of voice recognition devices exist and performing voice recognition according to the prior art;
FIG. 2A is a diagram illustrating an example speech recognition system according to an embodiment of the present disclosure;
FIG. 2B is a diagram illustrating an example speech recognition system according to an embodiment of the present disclosure;
FIG. 2C is a diagram illustrating an example speech recognition system, according to an embodiment of the present disclosure;
FIG. 3A is a block diagram illustrating an example speech recognition device according to an embodiment of the present disclosure;
FIG. 3B is a block diagram illustrating an example speech recognition device according to an embodiment of the present disclosure;
FIG. 3C is a block diagram illustrating an example speech recognition device, according to an embodiment of the present disclosure;
fig. 4 is a flow diagram illustrating an example speech recognition method according to an embodiment of the invention;
FIG. 5 is a block diagram illustrating an example processor in accordance with an embodiment of the present disclosure;
FIG. 6 is a flow diagram illustrating an example speech recognition method according to an embodiment of the present disclosure;
FIG. 7 is a flow diagram of an example speech recognition method according to an embodiment of the present disclosure;
fig. 8 is a diagram illustrating an example in which a speech recognition apparatus outputs a speech recognition result according to an embodiment of the present disclosure;
fig. 9A is a diagram illustrating an example in which a speech recognition system outputs a speech recognition result according to an embodiment of the present disclosure;
fig. 9B is a diagram illustrating an example in which a speech recognition system outputs a speech recognition result according to an embodiment of the present disclosure;
fig. 10A is a diagram illustrating an example in which a speech recognition system outputs a speech recognition result according to an embodiment of the present disclosure; and
fig. 10B is a diagram illustrating an example in which a speech recognition system outputs a speech recognition result according to an embodiment of the present disclosure.
Best mode
According to an example embodiment of the present disclosure, a voice recognition method performed by a voice recognition apparatus to perform voice recognition in a space in which a plurality of voice recognition apparatuses exist includes: extracting a speaker's voice signal from an input audio signal; obtaining a first speaker recognition score indicative of a similarity between the speech signal and a speech signal of a registered speaker; and outputting a speech recognition result for the speech signal based on the first speaker recognition score and the second speaker recognition score obtained from another speech recognition device of the plurality of speech recognition devices.
According to an example embodiment of the present disclosure, a voice recognition apparatus of a plurality of voice recognition apparatuses located in the same space includes: a receiver comprising a receiving circuit, wherein the receiver is configured to receive an input audio signal; a processor configured to control the speech recognition device to perform the following operations: extracting a speech signal of a speaker from an input audio signal and obtaining a first speaker recognition score indicating a similarity between the speech signal and a speech signal of a registered speaker; and an outputter including an output circuit, wherein the outputter is configured to output a speech recognition result for the speech signal, wherein the processor is further configured to control the outputter to output the speech recognition result for the speech signal based on the first speaker recognition score and the second speaker recognition score obtained from another speech recognition device among the plurality of speech recognition devices.
According to an example embodiment of the present disclosure, a voice recognition method of performing voice recognition performed by an apparatus connected to a plurality of voice recognition devices located in the same space includes: obtaining a first speaker recognition score indicative of a similarity between a speech signal received by the first speech recognition device and a speech signal of a registered speaker; obtaining a second speaker recognition score indicative of a similarity between the speech signal received by the second speech recognition device and the speech signal of the registered speaker; determining a device of the first and second speech recognition devices that is closer to the speaker based on the first and second speaker recognition scores; and outputting a speech recognition result for the first speech signal to the first speech recognition device based on the device closer to the speaker being determined as the first speech recognition device.
According to an example embodiment of the present disclosure, an apparatus connected to a plurality of voice recognition devices located in the same space includes: a communicator, comprising: a communication circuit configured to receive a voice signal from each of a first voice recognition device and a second voice recognition device; and a processor configured to control the apparatus to perform the following operations: obtaining a first speaker recognition score indicative of a similarity between a speech signal received by the first speech recognition device and a speech signal of a registered speaker; obtaining a second speaker recognition score indicative of a similarity between the speech signal received by the second speech recognition device and the speech signal of the registered speaker; and determining a device closer to the speaker of the first and second speech recognition devices based on the first and second speaker recognition scores, wherein the processor is further configured to control the apparatus to output a speech recognition result for the first speech signal to the first speech recognition device based on the device closer to the speaker being determined as the first speech recognition device.
According to an example embodiment of the present disclosure, there is provided a speech recognition system including a plurality of speech recognition devices located in the same space and an apparatus connected to the plurality of speech recognition devices, wherein, among the plurality of speech recognition devices, a first speech recognition device is configured to receive a first speech signal for an utterance of a speaker and transmit the first speech signal to the apparatus, wherein, among the plurality of speech recognition devices, a second speech recognition device is configured to receive a second speech signal for the same utterance of the speaker and transmit the second speech signal to the apparatus, and wherein the apparatus is configured to: obtaining a first speaker recognition score indicative of a similarity between the first speech signal and speech signals of registered speakers; obtaining a second speaker recognition score indicative of a similarity between the second speech signal and the speech signal of the registered speaker; determining a device of the first and second speech recognition devices that is closer to the speaker based on the first and second speaker recognition scores; and based on the device closer to the speaker being determined as the first speech recognition device, outputting a speech recognition result for the first speech signal to the first speech recognition device.
Detailed Description
Hereinafter, various example embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. The present disclosure may, however, be embodied in many different forms and is not limited to the example embodiments of the disclosure described herein. For clarity of description of the present disclosure, portions irrelevant to the description may be omitted, and the same reference numerals in the drawings denote the same elements.
It will be understood that when an area is referred to as being "connected to" another area, it can be directly connected to the other area or electrically connected to the other area through intervening areas. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features or components, but do not preclude the presence or addition of one or more other features or components.
The expression "according to an embodiment" used in the present disclosure does not necessarily indicate the same embodiment of the present disclosure.
The foregoing embodiments of the disclosure may be described in terms of functional block components and various processing steps. Some or all of the functional blocks may be implemented by any number of hardware and/or software components configured to perform the specified functions. For example, functional blocks according to the present disclosure may be implemented by one or more microprocessors or by circuit components for predetermined functions. Additionally, for example, functional blocks according to the present disclosure may be implemented using any programming or scripting language. The functional blocks may be implemented in algorithms that are executed on one or more processors. Further, the disclosure described herein may employ any number of techniques in accordance with the relevant techniques for electronic configuration, signal processing and/or control, data processing, and the like. The words "module" and "configuration" are used broadly and are not limited to the mechanical or physical embodiments of the disclosure.
Furthermore, the connecting lines or connectors shown in the various figures presented are intended to represent example functional relationships and/or physical or logical combinations between the various elements. It should be noted that many alternative or additional functional relationships, physical connections or logical connections may be present in a practical device.
Throughout the disclosure, the expression "at least one of a, b or c" means only a, only b, only c, both a and b, both a and c, both b and c, all or a variation thereof.
Hereinafter, the present disclosure will be described in more detail with reference to the accompanying drawings.
Since the voice recognition technology has been recently installed on various devices, a device closest to the user is selected from among the various devices, and thus the selected voice recognition device may be required to perform voice recognition. The related art voice recognition apparatus uses a method of selecting the closest apparatus based on a signal-to-noise ratio (SNR) of a received voice signal.
Fig. 1 is a flowchart illustrating a method of selecting one voice recognition device and performing voice recognition performed by a voice recognition system including a plurality of voice recognition devices according to the related art. The related art speech recognition system may select a speech recognition device based on the SNR representing the actual speech to noise ratio of the surrounding environment.
Specifically, the related art voice recognition system may receive voice signals from a plurality of voice recognition apparatuses (S110). The speech recognition system may determine the SNR of each speech signal by analyzing the received speech signal. The speech recognition system may receive an audio signal including a speech signal and noise and determine an energy ratio of the speech signal and the noise. The voice recognition system may select the voice recognition device that received the voice signal having the highest SNR from the voice recognition device (S120). That is, the speech recognition system may select the speech recognition device having the greatest strength of the received speech signal. The voice recognition system may output a voice recognition result through the selected voice recognition device (S130).
In a quiet environment, the SNR decreases as the distance between the speaker and the speech recognition device increases. Therefore, according to the voice recognition method according to the related art shown in fig. 1, it is possible to relatively accurately select a device closest to a speaker from among a plurality of voice recognition devices located in a quiet environment. However, in a noisy general environment, the method of selecting the device closest to the speaker based on the SNR according to the prior art is limited in that its performance is significantly reduced.
According to an embodiment of the present disclosure, in order to solve the problem of the voice recognition system for performing the voice recognition method illustrated in fig. 1, a voice recognition system for performing voice recognition based on speaker recognition may be provided. Fig. 2A is a diagram illustrating an example speech recognition system, according to an embodiment of the present disclosure. Fig. 2B is a diagram illustrating an example speech recognition system, according to an embodiment of the present disclosure.
Fig. 2C is a diagram illustrating an example speech recognition system, according to an embodiment of the present disclosure.
As shown in fig. 2A, a speech recognition system according to an embodiment of the present disclosure may include a plurality of speech recognition devices 301a and 301 b. The first speech recognition device 301a and the second speech recognition device 301b may be collectively referred to as a speech recognition device 301.
For example, the speech recognition device 301 may be, but is not limited to, a home appliance (e.g., without limitation, a TV, a refrigerator, a washing machine, etc.), a smart phone, a PC, a wearable, a Personal Digital Assistant (PDA), a media player, a microserver, a Global Positioning System (GPS) device, an electronic book terminal, a digital broadcast terminal, a navigation, a kiosk, an MP3 player, a digital camera, another mobile or non-mobile computing device, etc.
The voice recognition apparatus 301 according to an embodiment of the present disclosure may activate a conversation, receive an audio signal including a voice signal uttered by the speaker 10, and perform voice recognition on the voice signal. The speech recognition device 301 may output the speech recognition result.
As shown in fig. 2A, the first voice recognition device 301a and the second voice recognition device 301b may be connected by wire or wirelessly, and may share data.
Each of the first and second voice recognition devices 301a and 301b according to an embodiment of the present disclosure may obtain a speaker recognition score based on a received voice signal. The speaker recognition score may represent a similarity between the received speech signal and speech signals of previously registered speakers. The first speech recognition device 301a and the second speech recognition device 301b according to the embodiment of the present disclosure may share the speaker recognition score obtained by each speech recognition device.
When the distance between the speaker and the speech recognition device is far, a low speaker recognition score is measured. Therefore, when a registered speaker speaks in an environment including a plurality of speech devices, a speaker recognition score obtained by a speech recognition device closer to the speaker is higher than a speaker recognition score obtained by a speech recognition device farther from the speaker. Since the speaker recognition score is obtained based on the features of the speech signal, the above-described characteristics exist even in a very noisy environment. For example, even when a registered speaker is speaking in a very noisy environment, the speaker recognition score obtained by a speech recognition device closer to the speaker may be higher than the speaker recognition score obtained by a speech recognition device further away from the speaker. Therefore, in a practical very noisy environment, the method of selecting a near device based on a speaker recognition score according to an embodiment of the present disclosure may be more accurate than the method of selecting a near device based on an SNR according to the related art.
For example, the first speech recognition device 301a may determine that the speech recognition device is closer to the speaker 10 based on a first speaker recognition score obtained by the first speech recognition device 301a and a second speaker recognition score obtained by the second speech recognition device 301 b. When it is determined that the first speech recognition device 301a is the speech recognition device closest to the speaker 10, the first speech recognition device 301a may output a speech recognition result.
In addition, as shown in fig. 2B, a voice recognition system according to an embodiment of the present disclosure may include a first voice recognition device 301a, a second voice recognition device 301B, and an apparatus 303. The first speech recognition device 301a, the second speech recognition device 301b and the apparatus 303 may be connected by wire or wirelessly.
The apparatus 303 may share data, resources, and services with a plurality of voice recognition devices 301a and 301b, or perform control of the voice recognition device 301, file management, or monitoring of the entire network. For example, the device 303 may be a mobile or non-mobile computing device, a device that configures a home network by connecting a plurality of voice recognition apparatuses 300, an edge device that processes data in the edge of the network, or a cloudlet representing a small-scale cloud data center.
The speech recognition device 301 may receive an audio signal comprising a speech signal uttered by the speaker 10 and send the input audio signal to the means 303. The speech recognition device 301 may receive an audio signal comprising a speech signal uttered by the speaker 10 and send a speech signal detected from the input audio signal to the means 303. The speech recognition device 301 may receive an audio signal comprising a speech signal uttered by the speaker 10 and send a feature of the speech signal or a speaker recognition score detected from the input audio signal to the means 303.
The means 303 may obtain a speaker recognition score based on a signal received from the speech recognition device 301. The means 303 may compare the speech signal of a previously registered speaker with the speech signal received from the speech recognition device 301, thereby obtaining a speaker recognition score indicating the degree of similarity between the two speech signals.
The apparatus 303 may determine a speech recognition device closer to the speaker 10 based on the first speaker recognition score obtained by the first speech recognition device 301a and the second speaker recognition score obtained by the second speech recognition device 301 b. When the means 303 determines that the first voice recognition device 301a is the voice recognition device closest to the speaker 10, the means 303 may transmit the voice recognition result to the first voice recognition device 301a, or may control the first voice recognition device 301a to output the voice recognition result. The speech recognition device 301 may output the speech recognition result.
Although not shown in fig. 2B, the apparatus 303 may be connected to an external server to update information for voice recognition, or update information regarding a change in speaker recognition score according to a distance from the speaker 10 to the voice recognition device 301. The device 303 may transmit a voice signal to an external server and receive a result of voice recognition performed by the external server from the external server. The means 303 may re-transmit the speech recognition result received from the external server to the speech recognition apparatus 301 again.
In addition, as shown in fig. 2C, the voice recognition system according to the embodiment of the present disclosure may include a voice recognition device 301a, a second voice recognition device 301b, and a voice recognition server 305. The speech recognition device 301 and the speech recognition server 305 may be connected by wire or wirelessly.
The voice recognition server 305 according to an embodiment of the present disclosure may share data with the voice recognition device 301. A speech recognition device 301 according to an embodiment of the present disclosure may activate a conversation and receive an audio signal including a speech signal uttered by a speaker 10. The speech recognition device 301 may send the input audio signal to the speech recognition server 305. The voice recognition device 301 may transmit a voice signal detected from an input audio signal to the voice recognition server 305. The speech recognition device 301 may transmit the features of the speech signal or the speaker recognition score detected from the input audio signal to the speech recognition server 305.
The speech recognition server 305 may obtain a speaker recognition score based on the signal received from the speech recognition device 301. The speech recognition server 305 may compare the speech signal of the previously registered speaker with the speech signal received from the speech recognition device 301, thereby obtaining a speaker recognition score indicating the degree of similarity between the two speech signals.
The speech recognition server 305 may determine a speech recognition device closer to the speaker 10 based on the first speaker recognition score obtained by the first speech recognition device 301a and the second speaker recognition score obtained by the second speech recognition device 301 b. When the apparatus 303 determines that the first voice recognition device 301a is the voice recognition device closest to the speaker 10, the voice recognition server 305 may transmit the voice recognition result to the first voice recognition device 301a, or may control the first voice recognition device 301a to output the voice recognition result.
The speech recognition server 305 may perform speech recognition based on the signal received from the speech recognition device 301. For example, the voice recognition server 305 may perform voice recognition on a voice signal detected from an audio signal input from the voice recognition apparatus 301. The speech recognition server 305 may transmit the speech recognition result to the speech recognition device 301. The speech recognition device 301 may output the speech recognition result.
As shown in fig. 2A, 2B, and 2C, in the voice recognition system according to the embodiment of the present disclosure, when a registered speaker transmits a voice command, each of a plurality of voice recognition devices may calculate a speaker recognition score with respect to the voice command. The speaker recognition score may differ depending on the distance between the speaker and the speech recognition device, where the speaker recognition score may be used to select the device closest to the speaker. In the voice recognition system according to the embodiment of the present disclosure, the selected voice recognition device may recognize a voice command of a speaker, perform an operation corresponding to a voice recognition result, and thus provide a service capable of satisfying a requirement (demand) of a user.
In addition, the speech recognition system according to the embodiment of the present disclosure may be notified of information about the position of the speech recognition apparatus in advance. The speech recognition system may perform adaptive training on the speaker/distance information using at least one of "location information of the speech recognition device" or "distance between the speaker and the speech recognition device estimated based on the speaker recognition score". The speaker/distance information may include previously stored information for changes in speaker recognition scores as a function of distance between the speaker and the speech recognition device. For example, the speaker/distance information may include a base table map, an updated table map, or a data recognition model, which will be described in more detail below with reference to fig. 4 and 5.
In addition, the voice recognition apparatus according to the embodiment of the present disclosure may collect external environment information of the voice recognition apparatus by transmitting a pulse signal, and perform adaptive training on a previously stored registered speaker model and/or speaker/distance information related to a voice signal of a registered speaker based on the external environment information.
In addition, since the position of the speaker changes when the user moves while speaking, the voice recognition system according to the embodiment of the present disclosure may utilize previously stored speaker/distance information so that another voice recognition device may continuously perform voice recognition.
As shown in fig. 2A, 2B, and 2C, a speech recognition system according to an embodiment of the present disclosure may include a plurality of speech recognition devices, and may further include an apparatus and/or a speech recognition server. Hereinafter, for convenience of description, a voice recognition method performed by the "voice recognition apparatus" will be described. However, some or all of the operations of the voice recognition device described below may be performed by a device for connecting the voice recognition device and the voice recognition server, and may be partially performed by a plurality of voice recognition devices. Fig. 3A is a block diagram illustrating an example speech recognition device, according to an embodiment of the present disclosure. Fig. 3B is a block diagram illustrating an example speech recognition device, according to an embodiment of the present disclosure.
Fig. 3C is a block diagram illustrating an example speech recognition device, according to an embodiment of the present disclosure.
As shown in fig. 3A, a speech recognition device 301 according to an embodiment of the present disclosure may include a receiver (e.g., including receiver circuitry) 310, a processor (e.g., including processing circuitry) 320, and an outputter (e.g., including output circuitry) 330. However, the speech recognition device 301 may be implemented with more than all of the components shown in FIG. 3A. For example, as shown in fig. 3B, a speech recognition device 301 according to an embodiment of the present disclosure may further include a communicator (e.g., including communication circuitry) 340 and a memory 350.
Also, fig. 3A, 3B, and 3C illustrate that the voice recognition apparatus 301 includes one processor 320 for convenience, but embodiments of the present disclosure are not limited thereto, and the voice recognition apparatus 301 may include a plurality of processors. When the speech recognition device 301 includes multiple processors, the operations of the processor 320 described below may be performed by the multiple processors individually.
The receiver 310 may include various receiver circuits and receive audio signals. For example, the receiver 310 may directly receive an audio signal by converting an external sound into electro-acoustic data using a microphone. The receiver 310 may receive an audio signal transmitted from an external device. In fig. 3A and 3B, the receiver 310 is included in the voice recognition device 301, but the receiver 310 according to another embodiment of the present disclosure may be included in a separate device and connected to the voice recognition device 301 by wire and/or wirelessly to the voice recognition device 301.
The receiver 310 may activate a session for receiving an audio signal based on the control of the processor 320. The conversation may indicate the time it takes for the speech recognition device 301 to receive the audio signal from the beginning to the end. Activating a session may refer to, for example, the speech recognition device 301 beginning to receive audio signals. The receiver 310 may send the input audio signal input to the processor 320 while maintaining the conversation.
In addition, the receiver 310 may receive user input for controlling the speech recognition device 301. The receiver 310 may include various receiver circuits such as, for example and without limitation, a user input device including a touch panel to receive a touch of a user, a button to receive a pressing operation of a user, a wheel to receive a rotating operation of a user, a keyboard, a dome switch, and the like. The receiver 310 may receive the user input received through a separate user input device without directly receiving the user input.
For example, receiver 310 may receive user input for storing a particular speaker as a registered speaker and user input for activating a session.
The processor 320 may include various processing circuits, and extracts a voice signal from an input audio signal input from the receiver 310 and performs voice recognition on the voice signal. In an embodiment of the present disclosure, the processor 320 may extract frequency characteristics of a speech signal from an input audio signal and perform speech recognition using an acoustic model and a language model. The frequency characteristic may refer to, for example, a distribution of frequency components of the acoustic input, wherein the frequency components are extracted by analyzing a frequency spectrum of the acoustic input. Thus, as shown in FIG. 3B, the speech recognition device 301 may also include a memory 350 that stores acoustic models and language models.
In an embodiment of the present disclosure, processor 320 may obtain a speaker recognition score from the speech signal. The speaker recognition score may indicate a similarity between the received speech signal and the speech signal of the registered speaker.
The processor 320 may determine whether the speaker of the speech signal is a registered speaker based on the speaker recognition score obtained from the received speech signal. The processor 320 may determine whether to maintain the session based on the determination result.
For example, the processor 320 may set the session to be maintained for a previously determined session duration when the session is activated, and end after the session duration. When the speaker of the voice signal detected from the input audio signal received when the session is activated is a registered speaker, the processor 320 may reset the session to be activated for a previously determined extension time and to end after the extension time.
The processor 320 may determine a speech recognition device of the plurality of speech recognition devices that is closest to the speaker based on the speaker recognition score. When it is determined that the voice recognition device 301 is closest to the speaker, the processor 320 may control the outputter 330 to output the voice recognition result.
For example, processor 320 in accordance with embodiments of the present disclosure may obtain a first speaker recognition score from a speech signal received by receiver 310. The processor 320 may control the outputter 330 to output a voice recognition result for the voice signal received by the receiver 310 based on the second speaker recognition score and the first speaker recognition score obtained from another voice recognition device among the plurality of voice recognition devices.
The processor 320 according to an embodiment of the present disclosure may determine a device closer to a speaker among the voice recognition device 301 and the another voice recognition device based on a result of comparing the first speaker recognition score with the second speaker recognition score. When it is determined that the voice recognition device 301 is a device closer to the speaker, the processor 320 may control the outputter 330 to output the voice recognition result.
The processor 320 according to an embodiment of the present disclosure may further determine a device closer to the speaker among the voice recognition device 301 and the another voice recognition device, considering the location of the voice recognition device 301, the location of the another voice recognition device, and previously stored information for a change in the speaker recognition score according to the distance between the speaker and the voice recognition device 301.
The processor 320 according to an embodiment of the present disclosure may determine a device closer to a speaker among the voice recognition device 301 and the another voice recognition device in consideration of the speaker/distance information, the first speaker recognition score and the second speaker recognition score. The speaker/distance information may include previously stored information for a change in speaker recognition score according to the distance between the speaker and the speech recognition device 301. In this case, when the first speaker recognition score is equal to or greater than the threshold value, the processor 320 according to an embodiment of the present disclosure may update the speaker/distance information based on a result of determining that the device is closer to the speaker.
The outputter 330 may include various output circuits and output a result of speech recognition performed on the speech signal. The outputter 330 may notify the user of the result of the voice recognition or transmit the result to an external device (e.g., a smartphone, a home appliance, a wearable device, an edge device, a server, etc.). For example, the outputter 330 may include a display capable of outputting an audio signal or a video signal.
The outputter 330 may perform an operation corresponding to the result of the voice recognition. For example, the voice recognition apparatus 301 may determine a function of the voice recognition apparatus 301 corresponding to the result of the voice recognition and output a screen for performing the function through the outputter 330. The voice recognition apparatus 301 may transmit a keyword corresponding to a voice recognition result to an external server, receive information related to the transmitted keyword from the server, and output the information on a screen through the outputter 330.
The communicator 340 of fig. 3B may include various communication circuits and communicate with an external device, apparatus, or server through wired communication or wireless communication. The communicator 340 may receive an audio signal, a voice signal, a feature of the voice signal, a speaker recognition score, a voice recognition result, etc. from an external device. The communicator 340 may transmit the audio signal, the voice signal, the feature of the voice signal, the speaker recognition score, or the voice recognition result to the external device. The communicator 340 according to an embodiment of the present disclosure may include various modules including various communication circuits (such as, for example and without limitation, a short-range communication module, a wired communication module, a mobile communication module, a broadcast receiving module, etc.).
The memory 350 of fig. 3B may store an acoustic model for performing voice recognition, a language model, a registered speaker model for a voice signal of a registered speaker for performing speaker recognition, a voice recognition history, speaker/distance information on a relationship between a distance between a speaker and a voice recognition device and a speaker recognition score, location information of the voice recognition device, and the like.
As shown in fig. 3C, a speech recognition device 301 according to an embodiment of the present disclosure may include a communicator (e.g., including communication circuitry) 340 and a processor (e.g., including processing circuitry) 320. The block diagram in fig. 3C may also be applied to the apparatus 303 and the speech recognition server 305 shown in fig. 2B and 2C. The communicator 340 and the processor 320 of fig. 3C correspond to the communicator 340 and the processor 320 of fig. 3A and 3B, and thus redundant description may not be repeated here.
The voice recognition device 301 according to an embodiment of the present disclosure may receive a voice signal from each of the first voice recognition device and the second voice recognition device through the communicator 340.
The speech recognition device 301 may obtain a first speaker recognition score based on a first speech signal received from the first speech recognition device. The first speaker recognition score may indicate a similarity between the first speech signal and a speech signal of a registered speaker. The speech recognition device 301 may obtain a second speaker recognition score based on a second speech signal received from a second speech recognition device. The second speaker recognition score may indicate a similarity between the second speech signal and the speech signal of the registered speaker.
The voice recognition device 301 according to the embodiment of the present disclosure may obtain the speaker recognition score directly from each of the first voice recognition device and the second voice recognition device through the communicator 340.
The speech recognition device 301 may determine the device of the first speech recognition device and the second speech recognition device that is closer to the speaker based on the first speaker recognition score and the second speaker recognition score.
When the device closer to the speaker is determined to be the first voice recognition device, the voice recognition device 301 may control the communicator 340 to output a voice recognition result with respect to the first voice signal to the first voice recognition device.
Hereinafter, an example operation method of the voice recognition apparatus 301 according to an embodiment of the present disclosure will be described. Each operation of the method described below can be performed by each configuration of the above-described voice recognition device 301. For convenience of description, only the case where the voice recognition device 301 is the subject of operation is described, but the following description may be applied to the case where an apparatus connecting a plurality of voice recognition devices or a voice recognition server is the subject of operation.
FIG. 4 is a flow diagram illustrating an example speech recognition method according to an embodiment of the present disclosure.
In operation S410, the voice recognition apparatus 301 according to an embodiment of the present disclosure may extract a voice signal of a speaker from an input audio signal. The speech recognition device 301 may be located in the same space as the other speech recognition devices. The plurality of speech recognition devices being located in the same space may mean, for example, that the plurality of speech recognition devices are located within a range that can produce a speech signal produced by an utterance of a speaker.
In operation S420, the voice recognition apparatus 301 according to the embodiment may obtain a first speaker recognition score indicating a similarity between a voice signal and a voice signal of a registered speaker.
The registered speaker may be the primary user of the speech recognition device 301. For example, when the voice recognition device 301 is a smartphone, the owner of the smartphone may be a registered speaker, and when the voice recognition device 301 is a home appliance, a family member living in a house in which the home appliance is located may be a registered speaker. The speech recognition device 301 may register a speaker based on user input or store a predetermined speaker as a registered speaker as a default value. The voice recognition device 301 may store one speaker as a registered speaker, and may store a plurality of speakers as registered speakers.
In an embodiment of the present disclosure, the speech recognition device 301 may store speech characteristics of a specific speaker as registered speaker information. For example, the speech recognition device 300 may extract and store registered speaker information from feature vectors extracted from a plurality of speech signals uttered by a specific speaker before the conversation is activated.
In an embodiment of the present disclosure, the speech recognition device 301 may calculate a speaker recognition score indicating a similarity between previously stored registered speaker information and newly generated speaker information. The speech recognition device 301 may determine whether the speaker of the speech signal is a registered speaker based on the result of comparing the calculated speaker recognition score with a predetermined threshold.
The voice recognition apparatus 301 may obtain a candidate speaker recognition score indicating a degree of similarity between the voice signal of the registered speaker and the voice signal received in operation S410. When there are a plurality of registered speakers, the voice recognition apparatus 301 may obtain a plurality of candidate speaker recognition scores indicating the degree of similarity between the voice signal extracted in operation S410 and the voice signal of each of the plurality of registered speakers. The voice recognition apparatus 301 may obtain a plurality of candidate speaker recognition scores with respect to a plurality of registered speakers by comparing the features of the voice signal received in operation S410 with the features of the voice signals of all the registered speakers.
The speech recognition device 301 may select a first registered speaker corresponding to the first candidate speaker recognition score having the highest value from among a plurality of candidate speaker recognition scores (speaker identifications). When the first candidate speaker recognition score is greater than or equal to the threshold, the speech recognition device 301 may determine the first candidate speaker recognition score as the first speaker recognition score. When the first candidate speaker recognition score is less than the threshold value, the speech recognition apparatus 301 may end the process without outputting the speech recognition result for the speech signal received in operation S410. The voice recognition device 301 may perform voice recognition (speaker authentication) only when the registered speaker speaks (i.e., only when the speaker recognition score is equal to or greater than the threshold value).
The speech recognition apparatus 301 according to the embodiment of the present disclosure can filter the utterance of another person interrupting the utterance of the speaker by speaker recognition.
In addition, the voice recognition device 301 according to the embodiment of the present disclosure may obtain the second speaker recognition score obtained by a voice recognition device other than the voice recognition device 301 among the plurality of voice recognition devices. The speech recognition device 301 may obtain the second speaker recognition score from at least one of another speech recognition device, an apparatus connected to the speech recognition device, a server, or an external memory. The second speaker recognition score may be a speaker recognition score obtained for the same utterance as that of the speaker that is the basis of the speech signal extracted in operation S410. The second speaker recognition score may indicate a similarity between a speech signal received by the other speech recognition device and a speech signal of the registered speaker for the same utterance.
In operation S430, the voice recognition device 301 may output a voice recognition result for the voice signal based on the second speaker recognition score and the first speaker recognition score.
The speech recognition device 301 may determine a device closer to the speaker among the speech recognition device 301 and the other speech recognition device based on a result of comparing the first speaker recognition score with the second speaker recognition score. When the voice recognition device 301 is determined to be a device closer to the speaker, the voice recognition device 301 may output a voice recognition result for the voice signal received in operation S410.
For example, when the first speaker recognition score is greater than the second speaker recognition score, the speech recognition device 301 may determine that the speech recognition device 301 is closer to the speaker than another speech recognition device. When it is determined that the voice recognition device 301 is the device closest to the speaker, the voice recognition device 301 may output a voice recognition result for the voice signal received in operation S410.
In this example, the speech recognition device 301 may also consider not only the speaker recognition score, but also at least one of the location of the speech recognition device 301, the location of another speech recognition device, or speaker/distance information when determining the device closest to the speaker. The speaker/distance information may include previously stored information for changes in speaker recognition scores obtained by the speech recognition device 301 as the distance between the speaker and the speech recognition device 301 changes.
In determining the device closest to the speaker, the speech recognition device 301 may predict the distance between the speaker and the speech recognition device 301 taking into account at least one of the first speaker recognition score, the second speaker recognition score, or the speaker/distance information. The speech recognition device 301 may determine the device of the speech recognition device 301 and the other speech recognition device that is closer to the speaker based on the predicted distance. The voice recognition device 301 can determine the device closer to the speaker among the voice recognition device 301 and the other voice recognition device by comparing the predicted distance between the speaker and the voice recognition device 301 with the predicted distance between the speaker and the other voice recognition device.
According to embodiments of the present disclosure, a speech recognition system may include a base table mapping for a distance between a speech recognition device and an unlabeled speaker by a speaker recognition score of an utterance of the speaker. At this time, since the distribution of the speaker recognition score according to the distance may vary depending on the speaker, the base table map may be updated based on the speaker recognition score of the actual utterance of the speaker and the predicted distance.
For example, a base table mapping of speaker recognition scores to distances between the speech recognition device 301 and the speaker may include information as shown in table 1 below, for example.
[ Table 1]
Figure BDA0002950679880000161
Table 1 above is an example of speaker recognition scores for utterances of registered speakers that match based on the distance between the speech recognition device 301 and the registered speakers. The speech recognition device 301 according to the embodiment of the present disclosure may generate an extended table based on table 1 so that the intervals between the distance values indicated by the table may be denser. In addition, the voice recognition apparatus 301 may include a table map that reflects speaker recognition scores that vary according to the external environment based on information about the external environment.
In an environment where a plurality of voice recognition apparatuses exist, the position information of each voice recognition apparatus may be shared with each other. Each speech recognition device may obtain a speaker recognition score for an utterance input to the speech recognition device, and predict a distance between a speaker and the speech recognition device based on a location of the speech recognition device, the speaker recognition score, and a base table mapping. In addition, the voice recognition system according to an embodiment of the present disclosure may additionally update the table mapping updated for each speaker based on information stored in an account related to speaker information.
The voice recognition device 301 according to an embodiment of the present disclosure may perform adaptive training on speaker/distance information based on at least one of location information of the voice recognition device, an obtained speaker recognition score, a distance predicted based on the speaker recognition score, or information about an external environment.
For example, the voice recognition device 301 may output the pulse signal to the outside of the voice recognition device 301. The voice recognition device 301 may output a pulse signal toward a space in which a plurality of voice recognition devices including the voice recognition device 301 are located. The voice recognition apparatus 301 may obtain information on the external environment of the voice recognition apparatus 301 by analyzing the audio signal received in response to the pulse signal. The information about the external environment may include, for example, but not limited to, time delay of the received signal, noise, and the like. The voice recognition device 301 may update the speaker information previously stored or the speaker/distance information related to the voice signal of the registered speaker based on the information on the external environment.
The voice recognition apparatus 301 according to an embodiment of the present disclosure may recognize information about a space in which the voice recognition apparatus 301 is used using a pulse signal. For example, the pulse signal transmitted from the voice recognition device 301 may be finally received by the voice recognition device 301 after hitting a wall or an object in the space. Accordingly, the voice recognition apparatus 301 can recognize the echo characteristics of the sound in the space by analyzing the received audio signal in response to the impulse signal.
The speech recognition device 301 according to the embodiment of the present disclosure can adjust the threshold value used in the speaker recognition operation based on the audio signal received in response to the pulse signal. The speech recognition device 301 may update the speaker/distance information based on the adjusted threshold. For example, the voice recognition device 301 may adjust the table values regarding the speaker recognition scores according to the distance between the speaker and the voice recognition device 301 based on the adjusted threshold value. The speech recognition device 301 may change the reference value for determining the speaker recognition score according to the external environment.
For another example, when the first speaker recognition score is greater than or equal to the threshold, the speech recognition device 301 may update the speaker/distance information based on the first speaker recognition score and a predicted distance between the speaker and the speech recognition device 301.
Speech recognition systems according to embodiments of the present disclosure may include Artificial Intelligence (AI) systems that utilize machine learning algorithms, such as deep learning. For example, a speech recognition system according to embodiments of the present disclosure may use AI to identify a speaker, perform speech recognition, and select a device closest to the speaker.
The AI-related functions according to the present disclosure are operated by a processor and a memory. The processor may comprise one processor or a plurality of processors. In this case, the processor or processors may include, for example, but are not limited to, a general purpose processor (such as a CPU), an AP, a Digital Signal Processor (DSP), a graphics special purpose processor (such as a GPU), a Visual Processing Unit (VPU), an AI special purpose processor (such as an NPU), and the like. The processor or processors may be controlled to process the input data according to predefined operating rules or AI models stored in memory. When the processor or processors are AI-specific processors, the AI-specific processors may be designed with hardware structures that are dedicated to processing a particular AI model.
Predefined operating rules or AI models may be generated through training. For example, generating by training may refer to training a basic AI model using a plurality of training data, e.g., by a learning algorithm, such that predefined operating rules or AI models are generated that are set to perform a desired characteristic (or purpose). Such training may be performed in the device itself that performs AI according to the present disclosure, or may be performed by a separate server and/or system. Examples of learning algorithms may include, for example, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and the like, but are not limited to the examples described above.
The AI model may include a plurality of neural network layers. Each of the plurality of neural network layers may have a plurality of weight values, and the neural network operation is performed by an operation between an operation result of a previous layer and the plurality of weights. The plurality of weights for the plurality of neural network layers may be optimized by a training result of the AI model. For example, a plurality of weights may be updated to reduce or minimize a loss value or cost value obtained in the AI model during the training process. The AI network may include a Deep Neural Network (DNN) such as, for example and without limitation, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), a Deep Belief Network (DBN), a Bidirectional Recursive Deep Neural Network (BRDNN), a deep Q network, and the like, without limitation to the above examples. Fig. 5 is a block diagram illustrating an example processor 320 in accordance with an embodiment of the present disclosure.
Some or all of the blocks shown in fig. 5 may be implemented in hardware and/or software configurations that perform the specified functions. The functions performed by the blocks shown in fig. 5 may be implemented by one or more microprocessors or by a circuit configuration for the functions, and may include executable program elements. For example, some or all of the blocks shown in fig. 5 may be software modules configured in various programming or scripting languages for execution on processor 320.
After the conversation is activated, when the speaker inputs an utterance targeted for speech recognition, a speech pre-processor (e.g., including processing circuitry and/or executable program elements) 510 may extract a speech signal corresponding to the utterance from the input audio signal. The voice preprocessor 510 may send the extracted voice signal to the feature extractor 520.
Feature extractor (e.g., comprising processing circuitry and/or executable program elements) 520 may extract speaker recognition feature vectors that are robust to speaker recognition from the detected speech signal and extract speech recognition feature vectors that are robust to speech recognition from the speech signal.
Speaker recognizer (e.g., comprising processing circuitry and/or executable program elements) 530 may use the speaker recognition feature vectors, the post information received in real-time from a speech recognition decoder for performing speech recognition, a general background model, and the overall variability transformation information obtained by big-data based training to produce information about the speaker of the speech signal. The speaker recognizer 530 may compare the generated speaker information with information 540 of a previously registered speaker and calculate a speaker recognition score indicating a similarity between the speaker information and the registered speaker information 540. In an embodiment of the present disclosure, the information 540 of the voice signal of the registered speaker may be stored in advance.
Speaker recognizer (e.g., comprising processing circuitry and/or an executable program element) 530 may determine whether the speaker of the detected speech signal and the previously registered speaker are the same by comparing the speaker recognition score to a predetermined threshold. The speaker recognizer 530 may transmit the determination result to the device selection calculator 550.
The device selection calculator (e.g., comprising processing circuitry and/or executable program elements) 550 may receive speaker recognition scores for a plurality of speech recognition devices and select a speech recognition device closest to the speaker based on the speaker recognition scores. For example, the apparatus selection calculator 550 may select a speech recognition device having a relatively highest speaker recognition score as the speech recognition device closest to the speaker.
The device selection calculator 550 may select a speech recognition apparatus closest to the speaker in consideration of not only the speaker recognition score but also the speaker/distance information 570. In determining the device closest to the speaker, the apparatus selection calculator 550 may predict the distance between each speech recognition device and the speaker in consideration of the speaker recognition scores obtained from the plurality of speech recognition devices and the speaker/distance information 570. The device selection calculator 550 may select a speech recognition device closest to the speaker based on a predicted distance between each speech recognition device and the speaker.
In addition, the device selection calculator 550 may update the speaker/distance information 570 based on the predicted distance between each speech recognition device and the speaker recognition score. Speaker/distance information 570 may include a data recognition model for determining the device closest to the speaker.
For example, the apparatus selection calculator 550 may use the data recognition model using the obtained data as an input value when determining the device closest to the speaker. The data recognition model may be pre-constructed based on a base table mapping for speaker recognition scores according to the distance between the speech recognition device and the speaker. In addition, the device selection calculator 550 may train the data recognition model using the result values output by the data recognition model.
For example, the device selection calculator 550 may train the data recognition model based on utterances of actual speakers. Since the distribution of speaker recognition scores according to distance may vary according to the speaker, the data recognition model may be trained based on the actually obtained speaker recognition scores and the predicted distance.
For another example, the device selection calculator 550 may train the data recognition model based on at least one of location information of the speech recognition apparatus or information about the external environment.
The device selection calculator 550 may predict a distance between each voice recognition device and the speaker by applying a speaker recognition score obtained based on the voice signal input to each voice recognition device to the data recognition model, and determine a device closest to the speaker.
The speech recognition result executor (e.g., including processing circuitry and/or an executable program element) 560 may output a speech recognition result when a speech signal is uttered by a registered speaker and the speech recognition device 301 is determined to be the speech recognition device closest to the speaker. The voice recognition result performer 560 may include a voice recognition decoder. The speech recognition decoder may perform speech recognition through the acoustic model and the language model using the speech recognition feature vectors and generate a speech recognition result. The speech recognition decoder may transmit the after-information extracted through the acoustic model to the speaker recognizer 530 in real time.
Referring to fig. 5, speaker information 540 and speaker/distance information 570 may be stored in processor 320, but embodiments of the present disclosure are not limited thereto. The speaker information 540, the speaker/distance information 570, the acoustic model, the language model, the voice recognition result, the speaker recognition score, etc. may be stored in the memory 350 of the voice recognition device 301, or may be stored in an external device or an external server.
FIG. 6 is a flow diagram illustrating an example method of operating a speech recognition system including multiple speech recognition devices in accordance with an embodiment of the present disclosure. In fig. 6, an example in which a speaker is closer to the first voice recognition device 301a in a space including the first voice recognition device 301a and the second voice recognition device 301b is shown as an example, but embodiments of the present disclosure are not limited thereto. A speech recognition system according to an embodiment of the present disclosure may include three or more speech recognition devices, and the speech recognition device closest to a speaker is determined from the speech recognition devices by the application of fig. 6.
When the speaker speaks, the first and second voice recognition devices 301a and 301b may receive a voice signal corresponding to the corresponding utterance (S610 and S601). The first voice recognition device 301a may obtain a first speaker recognition score indicating a similarity between the first voice signal received in operation S610 and the voice signal of the registered speaker (S620). The second voice recognition device 301b may obtain a second speaker recognition score indicating a degree of similarity between the second voice signal received in operation S601 and the voice signal of the registered speaker (S602).
The first speech recognition device 301a and the second speech recognition device 301b may share the obtained speaker recognition score (S630).
The first speech recognition device 301a may determine the device closest to the speaker based on the result of comparing the first speaker recognition score with the second speaker recognition score (S640). When the first voice recognition device 301a is determined to be the device closest to the speaker, the first voice recognition device 301a may output a voice recognition result for the first voice signal (S650).
FIG. 7 is a flow diagram illustrating an example method of operating a speech recognition system including a plurality of speech recognition devices and a device for connecting the plurality of speech recognition devices in accordance with an embodiment of the present disclosure. In fig. 7, a case where a speaker is closer to the first voice recognition device 301a in a space including the first voice recognition device 301a and the second voice recognition device 301b is shown as an example, but embodiments of the present disclosure are not limited thereto. A speech recognition system according to an embodiment of the present disclosure may include three or more speech recognition devices.
When the speaker speaks, the first and second voice recognition devices 301a and 301b may receive a voice signal corresponding to the corresponding utterance (S710 and S720). The first voice recognition apparatus 301a may transmit the first voice signal received in S710 to the device 303 (S731). The second voice recognition apparatus 301b may transmit the second voice signal received in S720 to the device 303 (S733).
The device 303 may obtain a first speaker recognition score indicating a degree of similarity between the first speech signal and the speech signal of the registered speaker and a second speaker recognition score indicating a degree of similarity between the second speech signal and the speech signal of the registered speaker (S730).
The apparatus 303 may determine a device closest to the speaker based on a result of comparing the first speaker recognition score with the second speaker recognition score (S740). When determining that the first speech recognition device 301a is the device closest to the speaker, the apparatus 303 may transmit the speech recognition result to the first speech recognition device 301a (S750). The first voice recognition apparatus 301a may output a voice recognition result (S760).
Hereinafter, an example in which the voice recognition apparatus 301 outputs a voice recognition result will be described with reference to fig. 9A to 10B. Fig. 9A to 10B show a case where, for example, the voice recognition device 301 is a TV, a refrigerator, a washing machine, or a smartphone equipped with a voice recognition function, and the voice recognition device 301 recognizes a question or a request uttered by a speaker and outputs a response to the question or performs an operation corresponding to the request. However, embodiments of the present disclosure are not limited to the examples shown in fig. 9A to 10B.
In addition, the voice recognition apparatus 301 shown in fig. 9A to 10B can independently recognize and output a voice. The voice recognition apparatus 301 shown in fig. 9A to 10B may be connected to an external apparatus, transmit an input voice to the external apparatus, receive a voice recognition result from the external apparatus, and output the voice recognition result. Fig. 9A to 10B show an example in which the speaker 10 is a registered speaker.
Fig. 8 is a diagram illustrating an example in which the voice recognition apparatuses 901, 902, and 903 output a voice recognition result according to an embodiment of the present disclosure.
As shown in fig. 8, when speaker 10 says "can you tell me a weather forecast of today? "the plurality of speech recognition devices 901, 902, and 903 may calculate and share speaker recognition scores and determine the speech recognition device closest to speaker 10. In the case of fig. 8, the speech recognition device 901 is located closest to the speaker 10. Therefore, it can be determined that the speaker recognition score of the speech recognition device 901 is the highest. Alternatively, it may be determined that the prediction distance from the speaker 10 to the speech recognition device 901 is the shortest based on the speaker recognition score. The speech recognition device 901 can output a speech recognition result from the determination result based on the speaker recognition score. As shown in fig. 9, the voice recognition device 901 can recognize a request of the talker 10 and perform an operation of outputting a screen corresponding to a channel displaying a weather forecast, wherein the operation is an operation corresponding to the request of the talker 10.
Fig. 9A is a diagram showing an example in which the speech recognition system outputs a speech recognition result when the speaker 10 moves while speaking. Fig. 9B is a diagram illustrating an example in which a speech recognition system outputs a speech recognition result according to an embodiment of the present disclosure.
As shown in fig. 9A, the speaker 10 can be closest to the voice recognition device 901 when speaking "in the refrigerator" at the beginning of the utterance. As shown in fig. 9B, speaker 10 may continue to speak "what is there? When moving towards the speech recognition device 902.
Multiple speech recognition devices 901, 902, and 903 may calculate and share speaker recognition scores and determine the speech recognition device closest to speaker 10. In the case of fig. 9B, the speech recognition device 902 is located closest to the speaker 10 at the end of the utterance. Thus, speech recognition device 902 may recognize the question of speaker 10 and output "apple and egg" as a response to the question of speaker 10.
As shown in fig. 9A and 9B, when the speaker 10 moves while speaking, the speech recognition system according to the embodiment of the present disclosure may output the speech recognition result through the speech recognition device closest to the speaker 10 at the end of the speaking. However, embodiments of the present disclosure are not limited thereto, and the speech recognition system may output the speech recognition result through the speech recognition device closest to the speaker 10 at the beginning or in the middle of the utterance.
Fig. 10A is a diagram illustrating an example in which a speech recognition system including devices 1001, 1002, and 1003 outputs a speech recognition result according to an embodiment of the present disclosure. Fig. 10B is a diagram illustrating an example in which a speech recognition system including the devices 1001, 1002, and 1003 outputs a speech recognition result according to an embodiment of the present disclosure.
As shown in FIG. 10A, when speaker 10 says "give me a baseball! | A "the plurality of voice recognition apparatuses 1001, 1002, and 1003 may calculate and share the speaker recognition score. The plurality of speech recognition devices 1001, 1002, and 1003 may determine the speech recognition device closest to the speaker 10. In the example of fig. 10A, the speech recognition device 1003 is located closest to the speaker 10. Accordingly, it can be determined that the speaker recognition score of the voice recognition apparatus 1003 is the highest. It may be determined that the predicted distance from the speaker 10 to the speech recognition device 1003 based on the speaker recognition score is the shortest. The voice recognition apparatus 1003 may output a voice recognition result according to the determination result based on the speaker recognition score. As shown in fig. 10A, the voice recognition apparatus 1003 can recognize the request of the talker 10 and perform an operation of outputting a screen corresponding to the baseball relay channel, wherein the operation is an operation corresponding to the request of the talker 10.
As shown in FIG. 10B, speaker 10 may say "give me watch!after moving from voice recognition device 1003 to voice recognition device 1001! | A | A ". The plurality of speech recognition devices 1001, 1002, and 1003 may calculate and share speaker recognition scores and determine the speech recognition device closest to the speaker 10. In the example of fig. 10B, since the voice recognition device 1001 is located closest to the position of the speaker 10, the voice recognition device 1001 can recognize the request of the speaker 10 and perform an operation corresponding to the request of the speaker 10. The plurality of voice recognition apparatuses 1001, 1002, and 1003 can share the past operation history, voice recognition history, and the like. Accordingly, the voice recognition apparatus 100 can output a screen corresponding to the baseball relay channel with reference to the history of the baseball relay channel output by the voice recognition apparatus 1003 and the utterance "give me" of the speaker 10.
Therefore, even when the speaker 10 speaks while moving, the voice recognition system according to the embodiment of the present disclosure can select a device that is accurately close, thereby outputting a result of performing voice recognition corresponding to the user's intention.
Embodiments of the disclosure may be implemented in a software program comprising instructions stored on a computer-readable storage medium.
The computer may include an image transmission apparatus and an image reception apparatus according to an embodiment of the present disclosure, wherein the image transmission apparatus and the image reception apparatus are apparatuses capable of calling a stored instruction from a storage medium and operating according to an embodiment of the present disclosure according to the called instruction.
The computer-readable storage medium may be provided in the form of a non-transitory storage medium. Here, a "non-transitory" storage medium does not include a signal and is tangible, but does not distinguish whether data is stored semi-permanently or temporarily on the storage medium.
Furthermore, the electronic device or method according to embodiments of the present disclosure may be provided in a computer program product. The computer program product may be used as a product for conducting a transaction between a seller and a buyer.
The computer program product may include a software program and a computer readable storage medium having the software program stored thereon. For example, the computer program product may include a product obtained by a manufacturer of the electronic device or an electronic marketplace (e.g., Google Play Store)TMAnd App StoreTM) Electronic publishingA software program (e.g., a downloadable application). For electronic distribution, at least a part of the program may be stored on a storage medium or may be temporarily generated. In this case, the storage medium may be a storage medium of a server of a manufacturer, a server of an electronic market, or a relay server that temporarily stores the program.
In a system including a server and a terminal (e.g., an image transmitting apparatus or an image receiving apparatus), the computer program product may include a storage medium of the server or a storage medium of the terminal. Alternatively, when there is a third device (e.g., a smartphone) communicating with the server or the terminal, the computer program product may include a storage medium of the third device. The computer program product may comprise the program itself sent from the server to the terminal or the third device or from the third device to the terminal.
In this case, one of the server, the terminal, and the third device may execute the computer program product to perform the method according to the embodiment of the present disclosure. Two or more of the server, the terminal and the third device may execute the computer program product to distribute the method according to embodiments of the present disclosure.
For example, a server (e.g., a cloud server or an AI server, etc.) may execute a computer program product stored in the server to control a terminal in communication with the server to perform a method according to an embodiment of the present disclosure.
For another example, a third device may execute a computer program product to control a terminal in communication with the third device to perform a method according to embodiments of the present disclosure. For example, the third device may remotely control an image transmitting device or an image receiving device to transmit or receive the package image.
When the third device executes the computer program product, the third device may download the computer program product from the server and execute the downloaded computer program product. The third device may execute the provided computer program product provided in a preloaded manner to perform a method according to embodiments of the present disclosure.
While the disclosure has been shown and described with reference to various exemplary embodiments, it is to be understood that the various exemplary embodiments are intended to be illustrative and not restrictive, and that various changes in form and details may be made therein without departing from the true spirit and full scope of the disclosure.

Claims (15)

1. A speech recognition method performed by a speech recognition device for performing speech recognition in a space in which a plurality of speech recognition devices exist,
the voice recognition method comprises the following steps:
extracting a speaker's voice signal from an input audio signal;
obtaining a first speaker recognition score indicative of a similarity between the speech signal and a speech signal of a registered speaker; and
outputting a speech recognition result for the speech signal based on a second speaker recognition score obtained from another speech recognition device of the plurality of speech recognition devices and based on the first speaker recognition score.
2. The speech recognition method of claim 1, further comprising:
a second speaker recognition score is obtained that,
wherein the second speaker recognition score indicates a similarity between the speech signal received by the other speech recognition device and the speech signal of the registered speaker with respect to the utterance of the speaker.
3. The speech recognition method of claim 1, further comprising:
determining a device closer to a speaker from the speech recognition device and the other speech recognition device based on a result of comparing the first speaker recognition score with the second speaker recognition score, wherein the step of outputting the speech recognition result includes:
based on a device closer to a speaker being determined as the speech recognition device, outputting a speech recognition result for the speech signal.
4. The voice recognition method of claim 1, wherein the step of outputting the voice recognition result comprises:
based on the first speaker recognition score being greater than the second speaker recognition score, a speech recognition result for the speech signal is output.
5. The speech recognition method of claim 3, wherein determining that the device is closer to the speaker comprises:
determining a device closer to a speaker based on a location of the speech recognition device, a location of the other speech recognition device, and previously stored information for a change in speaker recognition score based on a distance between the speaker and the speech recognition device.
6. The speech recognition method of claim 1, further comprising:
outputting a pulse signal to the outside of the voice recognition apparatus;
obtaining information about an external environment of the speech recognition device by analyzing an audio signal received in response to an impulse signal; and
the previously stored information about the speech signal of the registered speaker is updated based on the information about the external environment.
7. The speech recognition method of claim 3, wherein the step of determining the device closer to the speaker comprises:
determining a device closer to the speaker based on previously stored speaker/distance information for a change in speaker recognition score based on a distance between the speaker and the speech recognition device, the first speaker recognition score and the second speaker recognition score,
the method further comprises the following steps: the speaker/distance information is updated based on a result of determining a device closer to the speaker based on the first speaker recognition score being equal to or greater than the threshold.
8. The speech recognition method of claim 3, wherein the step of determining the device closer to the speaker comprises:
predicting a distance between a speaker and the speech recognition device based on previously stored speaker/distance information, a first speaker recognition score and a second speaker recognition score for a change in the distance between the speaker and the speech recognition device for a speaker recognition score; and
determining, in the speech recognition device and the other speech recognition device, a device closer to the speaker based on the predicted distance,
the method further comprises the following steps: the speaker/distance information is updated based on the first speaker recognition score and the predicted distance.
9. The speech recognition method of claim 1, wherein obtaining a first speaker recognition score comprises:
obtaining a plurality of candidate speaker recognition scores indicating similarities between the speech signal and speech signals of a plurality of registered speakers;
selecting a first enrolled speaker corresponding to a first candidate speaker identification score having a highest value of the plurality of candidate speaker identification scores; and
based on the first candidate speaker identification score being equal to or greater than the threshold, a first candidate speaker identification score is obtained as the first speaker identification score.
10. A speech recognition device of a plurality of speech recognition devices located in the same space, the speech recognition device comprising:
a receiver configured to receive an input audio signal;
a processor configured to control the speech recognition device to perform the following operations:
extracting a speaker's speech signal from an input audio signal, an
Obtaining a first speaker recognition score indicative of a similarity between the speech signal and a speech signal of a registered speaker; and
an outputter including an output circuit, wherein the outputter is configured to output a speech recognition result for the speech signal,
wherein the processor is further configured to control the outputter to output a speech recognition result for the speech signal based on the second speaker recognition score and the first speaker recognition score obtained from another speech recognition device of the plurality of speech recognition devices.
11. The speech recognition device of claim 10, wherein the processor is further configured to control the speech recognition device to:
determining a device closer to the speaker from the speech recognition device and the other speech recognition device based on a result of comparing the first speaker recognition score with the second speaker recognition score, and
based on a device closer to a speaker being determined as the speech recognition device, outputting a speech recognition result for the speech signal.
12. A speech recognition method of performing speech recognition performed by an apparatus connected to a plurality of speech recognition devices located in the same space, the speech recognition method comprising:
obtaining a first speaker recognition score indicative of a similarity between a speech signal received by the first speech recognition device and a speech signal of a registered speaker;
obtaining a second speaker recognition score indicative of a similarity between the speech signal received by the second speech recognition device and the speech signal of the registered speaker;
determining a device of the first and second speech recognition devices that is closer to the speaker based on the first and second speaker recognition scores; and
based on the device closer to the speaker being determined as the first speech recognition device, a speech recognition result for the first speech signal is output to the first speech recognition device.
13. The speech recognition method of claim 12, wherein the step of determining a device closer to the speaker comprises:
a device closer to the speaker is determined based on the location of the first speech recognition device, the location of the second speech recognition device, and previously stored information for the speaker recognition score based on a change in distance between the speaker and the speech recognition devices.
14. The speech recognition method of claim 12, wherein the step of determining a device closer to the speaker comprises:
determining a device closer to the speaker based on previously stored speaker/distance information for a change in the speaker recognition score based on the distance between the speaker and the speech recognition device, the first speaker recognition score, and the second speaker recognition score; and
the speaker/distance information is updated based on a predicted distance from the speaker to the first speech recognition device, and the first speaker recognition score is updated based on the first speaker recognition score being equal to or greater than a threshold value.
15. A non-transitory computer-readable recording medium on which a program for executing the method according to claim 1 is stored.
CN201980055917.2A 2018-10-24 2019-10-22 Speech recognition method and device in an environment comprising a plurality of devices Pending CN112639965A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
KR20180127696 2018-10-24
KR10-2018-0127696 2018-10-24
KR10-2019-0110772 2019-09-06
KR1020190110772A KR20200047311A (en) 2018-10-24 2019-09-06 Method And Apparatus For Speech Recognition In Multi-device Environment
PCT/KR2019/013903 WO2020085769A1 (en) 2018-10-24 2019-10-22 Speech recognition method and apparatus in environment including plurality of apparatuses

Publications (1)

Publication Number Publication Date
CN112639965A true CN112639965A (en) 2021-04-09

Family

ID=70733911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980055917.2A Pending CN112639965A (en) 2018-10-24 2019-10-22 Speech recognition method and device in an environment comprising a plurality of devices

Country Status (3)

Country Link
EP (1) EP3797414A4 (en)
KR (1) KR20200047311A (en)
CN (1) CN112639965A (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022102893A1 (en) 2020-11-11 2022-05-19 삼성전자주식회사 Electronic device, system, and control method thereof
KR20220099831A (en) * 2021-01-07 2022-07-14 삼성전자주식회사 Electronic device and method for processing user utterance in the electronic device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130073293A1 (en) * 2011-09-20 2013-03-21 Lg Electronics Inc. Electronic device and method for controlling the same
US10026399B2 (en) * 2015-09-11 2018-07-17 Amazon Technologies, Inc. Arbitration between voice-enabled devices
WO2018067528A1 (en) * 2016-10-03 2018-04-12 Google Llc Device leadership negotiation among voice interface devices
US10559309B2 (en) * 2016-12-22 2020-02-11 Google Llc Collaborative voice controlled devices

Also Published As

Publication number Publication date
EP3797414A4 (en) 2021-08-25
EP3797414A1 (en) 2021-03-31
KR20200047311A (en) 2020-05-07

Similar Documents

Publication Publication Date Title
US10607597B2 (en) Speech signal recognition system and method
US11687319B2 (en) Speech recognition method and apparatus with activation word based on operating environment of the apparatus
US9443527B1 (en) Speech recognition capability generation and control
US20200135212A1 (en) Speech recognition method and apparatus in environment including plurality of apparatuses
CN110288987B (en) System for processing sound data and method of controlling the same
TWI619114B (en) Method and system of environment-sensitive automatic speech recognition
CN109643549B (en) Speech recognition method and device based on speaker recognition
JP7173758B2 (en) Personalized speech recognition method and user terminal and server for performing the same
KR102655628B1 (en) Method and apparatus for processing voice data of speech
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
CN112074900B (en) Audio analysis for natural language processing
US11380326B2 (en) Method and apparatus for performing speech recognition with wake on voice (WoV)
EP3533052B1 (en) Speech recognition method and apparatus
KR102531654B1 (en) Method and device for authentication in voice input
CN114762038A (en) Automatic round description in multi-round dialog
US11830501B2 (en) Electronic device and operation method for performing speech recognition
KR20200051462A (en) Electronic apparatus and operating method for the same
CN112639965A (en) Speech recognition method and device in an environment comprising a plurality of devices
KR20200033707A (en) Electronic device, and Method of providing or obtaining data for training thereof
CN111145735B (en) Electronic device and method of operating the same
US10803868B2 (en) Sound output system and voice processing method
US20230126305A1 (en) Method of identifying target device based on reception of utterance and electronic device therefor
US20230127543A1 (en) Method of identifying target device based on utterance and electronic device therefor
KR20200021400A (en) Electronic device and operating method for performing speech recognition
CN116686046A (en) Electronic apparatus and control method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210409

WD01 Invention patent application deemed withdrawn after publication