CN113490979A - Ambient sound output apparatus, ambient sound output system, ambient sound output method, and non-volatile storage medium - Google Patents

Ambient sound output apparatus, ambient sound output system, ambient sound output method, and non-volatile storage medium Download PDF

Info

Publication number
CN113490979A
CN113490979A CN202080006649.8A CN202080006649A CN113490979A CN 113490979 A CN113490979 A CN 113490979A CN 202080006649 A CN202080006649 A CN 202080006649A CN 113490979 A CN113490979 A CN 113490979A
Authority
CN
China
Prior art keywords
speech
component
original
original sound
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202080006649.8A
Other languages
Chinese (zh)
Other versions
CN113490979B (en
Inventor
诸星利弘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Toshiba Visual Solutions Corp
Original Assignee
Hisense Visual Technology Co Ltd
Toshiba Visual Solutions Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd, Toshiba Visual Solutions Corp filed Critical Hisense Visual Technology Co Ltd
Publication of CN113490979A publication Critical patent/CN113490979A/en
Application granted granted Critical
Publication of CN113490979B publication Critical patent/CN113490979B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Provided are an ambient sound output device, system, method, and non-volatile storage medium, which are capable of easily extracting ambient sound using a voice recognition device. The ambient sound output device is provided with a sound wave detection means that receives original sound with at least one microphone and outputs the original sound as an original sound signal, a speech component acquisition means that extracts a speech component from at least one of the original sound signal and a synthesized signal generated from a plurality of the original sound signals, and an out-of-speech component acquisition means that generates and outputs the out-of-speech component from at least the speech component and the original sound signal.

Description

Ambient sound output apparatus, ambient sound output system, ambient sound output method, and non-volatile storage medium
The present application claims priority to the filing of japanese patent application having the title "ambient sound output apparatus, system, method and program" by the japanese patent office on 17/1/2020, application No. 2020-.
Technical Field
Embodiments of the present application relate to an ambient sound output apparatus, system, method, and non-volatile storage medium.
Background
In home security technology for ensuring the security of a house, a sensor corresponding to the purpose of security is required. For example, in order to perform intrusion detection, a plurality of sensors such as a window sensor and a door sensor for detecting opening and closing of a window and a door are required. In addition to the trouble of installing sensors and the like on the assumption of an intrusion place in a house, a movement route along which a resident moves, and the like, maintenance costs such as replacement of a battery of the sensor and the like are incurred after installation.
In recent years, devices using voice recognition technology, such as Smart speakers (smartspeakers), have become widespread. For example, a smart speaker can remotely control a home appliance or the like in response to a voice uttered by a person as long as the smart speaker is within a range where the voice arrives, and generally covers a wide area of several square meters around. When such a voice recognition device is used, it is possible to realize an easy Home security (Home security) introduced at a low cost without a plurality of sensors. The purpose of a speech recognition device is to extract a component of a human voice (referred to as a speech component) from an input speech signal, and to convert a language uttered by the human into a text (such as a word or a language) based on the speech signal. In the speech recognition apparatus, in addition to the above-described object, a technique for suppressing an ambient sound component other than speech (referred to as an out-of-speech component) is also introduced to eliminate or reduce the out-of-speech component as much as possible.
Prior art documents
Patent document
Patent document 1: japanese laid-open patent publication No. 2001 and 188555
Patent document 2: japanese laid-open patent publication No. 2000-222000
Disclosure of Invention
The voice-related component can be used in home security technology and the like. In particular, home security technology can be easily introduced by extracting out-of-speech components from smart-boxes that are becoming popular. In addition, voice recognition technology has recently been introduced in home appliances such as televisions, and home security technology can be easily introduced using home appliances by extracting voice external components. However, it is hard to say that these advantages have been effectively utilized.
An object to be solved by the present application is to provide an ambient sound output device, system, method, and non-volatile storage medium that can easily extract ambient sound using a voice recognition device.
An ambient sound output device according to an embodiment of the present invention includes a sound wave detection unit that receives original sound with at least one microphone and outputs the original sound as an original sound signal, a speech component acquisition unit that extracts a speech component from at least one of the original sound signal and a synthesized signal generated from a plurality of the original sound signals, and an out-of-speech component acquisition unit that generates and outputs the out-of-speech component from at least the speech component and the original sound signal.
Drawings
Fig. 1 is a block diagram showing an example of the configuration of a system according to a first embodiment;
fig. 2 is a flowchart for a system of the first embodiment that receives a sound wave of a raw sound to perform an analysis process of an external voice component;
fig. 3 is a schematic diagram showing filters and signals until a speech signal is extracted from an original acoustic signal in the system of the first embodiment;
fig. 4 is a schematic diagram showing filters and signals until an out-of-speech component is extracted from an original acoustic signal in the system of the first embodiment;
FIG. 5 is a diagram showing filters and signals until specific components are extracted from the out-of-speech components in the system according to the first embodiment;
fig. 6 is a block diagram showing an example of the configuration of a sound wave detection unit and a raw sound signal processing unit for processing sound waves of raw sound received by a plurality of microphones in the second embodiment;
fig. 7 is a diagram illustrating an example of selecting a microphone by adjustment of gain in the second embodiment;
FIG. 8 is a flow chart for the system to extract out-of-speech components from the original acoustic signal in a second embodiment;
fig. 9 is a diagram illustrating an example of adjusting the directivity of a received sound wave by adjustment of gain in the third embodiment;
FIG. 10 is a flowchart for the system of the fourth embodiment to execute the generation processing of the reference data;
fig. 11 is a diagram illustrating the time to be counted by the system of the fourth embodiment;
fig. 12 is a diagram showing a configuration example of an electronic device incorporating an external speech component generation unit according to a modification;
fig. 13 is a diagram showing an example of the configuration of a system in the case where the external smart speaker of the present modification is used.
Description of the reference numerals
A1 … external voice component generating part, a 2 … application processing part, a 3 … voice detecting part, a 4 … original sound signal processing part, a 5 … voice recognition part, a 6 … storage part, a 7 … control part, an 8 … interface part, a 10 … signal component extracting part, an 11 … voice component acquiring part, a 12 … external voice component acquiring part, a 21 … processing part, a 22 … judging part, a 23 … storage part, a 31 … microphone, a 100 … television receiving device, a 101 … control part, a 200 … intelligent sound box and a 300 … application device.
Detailed Description
Hereinafter, embodiments will be described with reference to the drawings.
(first embodiment)
In the present embodiment, the following example is explained: an out-of-speech component is generated by removing the speech component from the original sound received by the sound wave detection unit, and the out-of-speech component is used in the application system.
Fig. 1 is a block diagram showing an example of the configuration of a system according to the first embodiment.
The external-speech-component acquiring unit 1 is a device that acquires original sound to generate and output an external speech component. The original sound represents a physical sound wave. The external-speech-component acquisition unit 1 includes a sound-wave detection unit 3, an original-sound-signal processing unit 4, and a signal-component extraction unit 10.
The acoustic wave detection unit 3 includes a microphone and an analog-digital conversion unit (hereinafter referred to as an a/D conversion unit), not shown. The sound waves of the original sound are received in the microphones 31A, 31B, 31C (denoted as the microphone 31 in the case where a distinction is particularly required) and converted into electrical signals (analog original sound signals). The analog original acoustic signal is converted to a digital value by an a/D converter not shown, and the original acoustic signal is output. Hereinafter, the original acoustic signal means an original acoustic signal based on a digital value unless particularly distinguished.
The original acoustic signal processing section 4 outputs an original acoustic signal obtained by synthesizing the original acoustic signals output from the acoustic wave detection section 3 (hereinafter referred to as a synthesized original acoustic signal). The original sound signal processing unit 4 performs gain adjustment and the like on the original sound signals output from the plurality of microphones 31 and synthesizes the signals. Here, the original acoustic signal in the expression "original acoustic signal output from the microphone 31" means an original acoustic signal based on a digital value. In reality, the original acoustic signal is output from the a/D converter, but such an expression may be used below for the purpose of clarifying the source of the original acoustic signal. The original acoustic signal processing unit 4 may select an effective microphone from the plurality of microphones 31A, 31B, and 31C by gain adjustment or the like, and apply a microphone array technique such as Beamforming (Beamforming). The original acoustic Signal processing unit 4 may operate in the form of, for example, a Digital Signal Processor (also referred to as a Digital Signal Processor or DSP), may operate as software on a computer such as a microcomputer, or may operate in the form of hardware such as an IC chip. Further, the operation may be performed in a form obtained by combining the above forms.
The signal component extraction unit 10 generates a speech component (also referred to as a speech signal) and an out-of-speech component from the original sound signal or the synthesized original sound signal input from the original sound signal processing unit 4, and outputs them. The speech component is a component particularly centered on a component of human speech, and may be an estimated value of the speech component. The out-of-speech component represents a component obtained by removing a speech component from the original sound. The original acoustic signal processing unit 4 may operate in the form of, for example, a digital signal processor DSP or the like, may operate as software on a computer such as a microcomputer, or may operate in the form of hardware such as an IC chip. Further, the operation may be performed in a form obtained by combining the above forms. The signal component extraction unit 10 includes a speech component acquisition unit 11 and an out-of-speech component extraction unit 12.
The speech component acquisition unit 11 extracts a speech component from the original sound signal or the synthesized original sound signal output from the original sound signal processing unit 4. The voice component acquisition unit 11 may include a filter for extracting a voice component used in a general voice recognition device such as a smart speaker, a noise reducer for removing a component other than a voice, and the like.
The speech external component acquisition unit 12 extracts and outputs a speech external component using parameters such as the original sound signal or the synthesized original sound signal output from the original sound signal processing unit 4, the speech component output from the speech component acquisition unit 11, and the gain adjustment amount (also simply referred to as gain) used in the original sound signal processing unit 4. The out-of-speech component acquisition unit 12 may use a filter as necessary for the speech component output by the speech component acquisition unit 11. The information of the filter may be input as additional data from the speech component acquisition unit 11 to the speech-external component acquisition unit 12.
The application processing unit 2 performs analysis or the like on the speech-external component input from the signal component extraction unit 10, generates information obtained based on the speech-external component (hereinafter referred to as "speech-external information"), and outputs the speech-external information to the outside. The out-of-speech information may include information relating to ambient sound, and for example, in the case where the analysis of the out-of-speech component results in detection of a broken sound of glass, information such as "glass broken" may be output as the out-of-speech information. The application processing unit 2 may detect abnormal sounds based on the out-of-speech component and output detection results such as the presence or absence of detection of abnormal sounds and detection contents such as which abnormal sounds have occurred from what kind of object as out-of-speech information. The abnormal sound that can be detected by the application processing unit 2 may be any abnormal sound such as a sound of breaking a window glass, a sound of footsteps of a person when the person is not at home, a sound of collapsing a heavy object, and a sound of falling of a person. The application processing unit 2 may operate as software on a computer such as a microcomputer, or may operate as hardware such as an IC chip.
The processing units 21A, 21B, and 21C (referred to as the processing unit 21, unless otherwise specified) extract necessary components (hereinafter referred to as specific components) from the out-of-speech components. Specifically, a component of a certain specific frequency band is extracted as a specific component from the speech-external component based on the frequency characteristics of the speech-external component and the like. The specific components extracted by the processing units 21A, 21B, and 21C may be different, and may be determined according to, for example, what kind of abnormal sound is detected. In addition, the specific component to be extracted may be predetermined in each processing unit 21, and in this case, for example, information for specifying the frequency band of the specific component may be incorporated into software or hardware for processing the function of the application processing unit 2. Further, the user or the like may set or select a specific component to be extracted via the interface portion 8 or the like, for example. The processing unit 21 may calculate frequency feature data indicating a specific component included in abnormal sound to be detected from the speech component. The processing unit 21 outputs the extracted specific component or frequency feature data to the subsequent stage.
The storage unit 23 stores data of a specific component (hereinafter, referred to as reference data) in accordance with an event to be detected (hereinafter, referred to as a detection event). The detection event is, for example, a sound of breaking a window, a sound of footsteps of a person when not at home, a sound of collapsing of a heavy object, a sound of falling of a person, or the like. Data of a specific component is acquired in advance for each detection event and is used as reference data for each detection event. The reference data of the specific component may be data represented in a frequency region such as a frequency characteristic or data represented in a time region such as a time signal, and may be an out-of-speech component output by the out-of-speech component acquiring unit 12. The reference data may be frequency feature data calculated from the specific component.
For example, when the detection event is "sound of window breaking", the reference data may be acquired by actually breaking the window in the vicinity of the sound wave detection unit 3 and storing (recording) the "sound of window breaking" in the storage unit 23. Note that the sound sample data stored for each detection event in the storage unit 6 described later may be downloaded to the storage unit 23. The storage unit 23 may be loaded with a storage medium such as a CD having audio data written therein, or a sample provided by a server on the internet or the like. Alternatively, the user may record or edit contents transmitted by a television broadcast signal, a radio broadcast, or the like to create sound sample data, and store the sound sample data in the storage unit 23. When the name of the detection event is not included in the sound sample data, the sound sample data is given the name of the detection event and used as the reference data.
When a specific component (or its feature data) is input from the processing unit 21, the determination unit 22 acquires the corresponding reference data from the storage unit 23, compares the input specific component (or its feature data) with the acquired reference data, and outputs, when it is determined that the input specific component (or its feature data) and the acquired reference data match, the name of the detection event or the like given to the matched reference data as the detection event information.
The above-described example is an example showing a case where the specific component corresponds to the reference data 1 to 1, but the combination of the specific component and the reference data may be various combinations. For example, the number of specific components and the number of reference data may be 1 to M (M is a natural number of 2 or more). For example, when the frequency region of a specific component overlaps with the frequency regions of the out-of-speech components of the plurality of reference data, the specific component and the reference data are 1 pair M.
The number of specific components and the number of reference data may be M to 1(M is a natural number of 2 or more). The processing unit 21 outputs a plurality of (for example, M) specific components to the determination unit 22, and the determination unit 22 compares each specific component with the corresponding reference data for each input specific component and outputs a plurality of (for example, M) detection event information. The detected event information output by the determination unit 22 may be output to a smartphone of the user set by the application processing unit 2 via the internet, for example, to notify the user of the occurrence of an event in the form of detected event information. The function of the determination unit 22 may be realized by using trigger word detection (wake-up word detection) or specific word detection by an artificial intelligence technique (AI technique) provided in the smart speaker.
According to the present embodiment, by using the function of a smart speaker which is widely used, a home security system such as abnormality detection can be constructed.
The voice recognition unit 5 recognizes a voice from the voice component output from the voice component acquisition unit 11 by using a voice recognition technique and converts the voice into a text (a character, a language, or the like). The speech recognition unit 5 outputs a language (hereinafter referred to as a recognized word) converted into a text. The output destination may be an application or an apparatus that uses the recognized word, the recognized word may be displayed on a display unit, not shown, or the recognized word may be output in the form of speech from a speaker, not shown, by speech synthesis or the like. The application or device using the recognized word uses the recognized word as a command for control, for example, and the application or device receiving the recognized word from the voice recognition unit 5 executes control or the like based on the recognized word. Since the voice recognition technology is a well-known common technology, a detailed description is omitted.
The storage unit 6 stores sample data of reference data to be stored in the storage unit 23. As described above in the description of the storage unit 23, the downloaded sound sample data, recorded data, or feature values thereof are stored in the storage unit 6 and supplied to the storage unit 23 as necessary. The storage unit 6 may contain reference data of detection events that are not stored in the storage unit 23. For example, when the user selects "detection event" to be detected by the application processing unit 2 from a remote controller, and the control unit 7 receives a selection signal transmitted from the remote controller via the interface unit 8, the control unit 7 may set reference data of "detection event" stored in the storage unit 6 in the storage unit 23.
The control unit 7 controls each function of the present application system. Specifically, the control unit 7 may control each function based on various control signals including the selection signal input from the interface unit 8. The control unit 7 may control each function based on the recognition word output by the voice recognition unit 5. In fig. 1, data exchange (including control) can be performed also between the control unit 7 and a function block not connected to the control unit 7.
The interface unit 8 performs various communications with the outside of the present application system. Specifically, the interface unit 8 includes various wired and wireless communication interfaces such as a remote controller (hereinafter referred to as a remote controller), infrared communication, a mouse, a keyboard, ethernet, HDMI, Wifi, and 5 th generation mobile communication (5G). For example, the user can use a remote controller to perform various settings and controls on the present application system. The interface unit 8 may be provided with a communication interface for accessing the internet, for example, and may download sound sample data and the like from a server on the internet to the storage unit 23. The interface unit 8 may output detection event information to, for example, a smartphone, a PC, or the like of a user connected to the internet. The interface unit 8 may further include an interface that enables the application processing unit 2 to communicate with an external home appliance, and may control the external home appliance by generating and transmitting a command to the external home appliance via infrared communication in the interface unit 8, for example.
The present system may employ a smart speaker including the functions of the voice component acquisition unit 11, the sound wave detection unit 3, the voice recognition unit 5, and the like. The intelligent sound box is arranged in a household and used for identifying human voice. The microphone of the smart speaker (corresponding to the microphone included in the sound wave detection unit 3 of the present embodiment) covers a wide area of several square meters around the smart speaker and constantly monitors the state of the surroundings. The smart speaker has a configuration for extracting human voice, and includes a function corresponding to the voice component acquiring unit 11. In addition, since the smart speaker regards noise such as noise, environmental sound, or the like as an obstacle when detecting human voice for the purpose of extracting human voice, the above-mentioned obstacle is suppressed by a technique such as noise reduction processing or narrowing of a voice collecting direction by a beam forming technique. For the purpose of confirming the intention of the user, an operation to be in the dialogue mode is also performed after the wakeup word is detected. In this way, the smart speaker is configured to suppress an out-of-speech component that is a background thereof in order to enhance a human voice (speech component). However, in the present embodiment, the smart speaker is used to extract and utilize the out-of-speech component.
Fig. 2 is a flowchart for the system of this embodiment to receive sound waves of original sound to perform analysis processing of an external component of speech.
The voice detection unit 3 is activated in a normal state, receives the sound wave of the original sound around the voice detection unit 3, generates an original sound signal in an electrical form, and outputs the generated original sound signal (step S1). The original acoustic signals output from the microphones 31A, 31B, 31C are set to S0a, S0B, S0C, respectively. The original sound signal input from the speech detecting unit 3 to the original sound signal processing unit 4 is synthesized and input to the speech component acquiring unit 11 and the out-of-speech component acquiring unit 12 as a synthesized original sound signal (S0) (step S2). Here, the synthesized original acoustic signal S0 may be a value obtained by synthesizing the original acoustic signals from the respective microphones 31 in such a manner that S0 is S0a + S0b + S0 c. The synthesized original sound signal S0 may be set to a value (S0a, S0b, S0c) that separates the original sound signals from the microphones 31 (referred to as a separated original sound signal). In particular in the case of the representation of the separated original acoustic signal, the notation < > represents < S0 >. That is, < S0 > - (S0a, S0b, S0 c).
Fig. 3 is a schematic diagram showing filters for extracting an original acoustic signal into a speech signal and frequency characteristics of the respective signals in the system according to this embodiment, and in (a) to (d), the horizontal axis may be represented by frequency, and the vertical axis may be represented by, for example, an amplitude value, a power value, or a relative value thereof (a probability density function or the like). Fig. 3 is a schematic diagram for convenience of explanation, and the diagram is not intended to strictly show the magnitude of the numerical values. Fig. 3 (a) is a schematic diagram of a case where original sound signals are synthesized in such a manner that the synthesized original sound signal S0 becomes S0a + S0b + S0 c.
Returning to fig. 2, the speech component acquisition unit 11 performs processing for extracting the speech signal Sv for the input synthesized original speech signal S0 (step S3). Specifically, the synthesized original sound signal S0 is input to a filter (for example, schematic diagrams of (b) and (c) of fig. 3) for extracting a speech signal to obtain an estimated value Sv of the speech signal (hereinafter, also referred to as an estimated speech signal). The estimated speech signal Sv (for example, the schematic diagram of fig. 3 d) is input to the speech recognition unit 5, and the speech recognition unit 5 performs speech recognition processing based on the estimated speech signal Sv to recognize (text) a language from the speech signal and obtain a recognized word (step S4). The speech recognition unit 5 outputs the obtained recognized word to the outside (step S5).
The synthesized original speech signal S0 is input from the original speech signal processing unit 4 to the speech component acquisition unit 12, and the estimated speech signal Sv and the additional data α are input from the speech component acquisition unit 11 to the speech component acquisition unit 12. The additional data α may be a value of a filter applied by the speech component acquisition unit 11 to the synthesized original sound signal S0.
Fig. 4 is a schematic diagram showing filters for extracting an original acoustic signal to an out-of-speech component and frequency characteristics of the respective signals in the system according to this embodiment, and in (a) to (c), the horizontal axis may be represented by frequency, and the vertical axis may be represented by, for example, an amplitude value, a power value, or a relative value thereof (a probability density function or the like). Like fig. 3, fig. 4 is a schematic diagram for convenience of explanation, and the diagram is not intended to strictly show the magnitude of the numerical values. Fig. 4 (a) is a schematic diagram of a case where original sound signals are synthesized in such a manner that the synthesized original sound signal S0 becomes S0a + S0b + S0 c. Fig. 4 (b) shows a filter used to restore the level of the speech-external component suppressed in step S3 to the original state, and shows, for example, a filter obtained by reversing the filter of fig. 3 (c). The value of this filter is set to, for example, additional data α. In addition, an empirical value obtained from an experiment, past data, or the like may be used as the additional data α.
The speech-extrinsic-component acquisition unit 12 acquires the speech extrinsic component Sn (for example, a schematic diagram of fig. 4 (c)) from at least the synthesized original sound signal S0 and the estimated speech signal Sv, for example (step S6). Specifically, for example, the out-of-speech component Sn is obtained as Sn (f) S0(f) -sv (f). Sn (f), S0(f), and Sv (f) respectively represent the values of the out-of-speech component Sn, the synthesized original sound signal S0, and the estimated speech signal Sv by frequency f, and Sn (f) is obtained by solving for frequency f within a possible range.
The speech extrinsic component Sn may be obtained by considering the additional data α with respect to the synthesized original speech signal S0 and the estimated speech signal Sv. Specifically, for example, the out-of-speech component Sn is obtained as Sn (f) S0(f) -sv (f) × α (f). Where α (f) represents a value per frequency f of the additional data α.
In the present embodiment, it is designed that the user performs a setting of detecting the detection event a, for example, in the application processing unit 2. The reference data corresponding to the detection event a is stored in the storage unit 23A.
The external speech component acquired by the external speech component acquisition unit 12 is input to the application processing unit 2, the specific component is extracted from the external speech component by the processing unit 21A, and the extracted specific component is output to the judgment unit 22A (step S8).
Fig. 5 is a schematic diagram showing filters and signals until a specific component is extracted from an out-of-speech component in the system of this embodiment in a frequency domain, and in (a) to (c) of fig. 5, the horizontal axis may be represented by frequency, and the vertical axis may be represented by, for example, an amplitude value, a power value, or a relative value thereof (a probability density function or the like). Fig. 5 is a schematic diagram for convenience of explanation, and the diagram is not intended to strictly show the magnitude of the numerical values.
The processing unit 21A may include a filter for extracting a low-frequency component while suppressing a high-frequency component, such as the filter fna in fig. 5 (b), for example. Specifically, the filter fna is used when, for example, the sound of a person's footstep is detected as the detection event a. The filter fna may be formed by using a human footstep sound or the like recorded in advance and set in the processing unit 21A in advance. The processing unit 21A in which the filter fna is set may be selected when the user performs setting for detecting the detection event a. Fig. 5 (c) is a schematic diagram of the characteristic component Sna extracted in step S8. When the user performs a setting to detect a detection event B and a detection event C different from the detection event a, the processing unit 21B and the processing unit 21C including filters and the like acquired in advance for each event may be selected.
The determination unit 22A acquires the reference data from the storage unit 23A (step S9).
The reference data is, for example, the specific component filter fna in fig. 5 (b). In this case, the specific component filter fna is stored in the storage unit 23A as reference data in association with the detection event a. The reference data may be formed by using a previously recorded sample of the human footstep sound, a sample of the human footstep sound downloaded from a server on the internet, or the like, in addition to the specific component filter fna, and may be set in the storage unit 23A in advance. When the user performs a setting to detect the detection event a, the determination unit 22A may acquire the reference data from the storage unit 23A. When the user performs the setting of detecting the detection event B or the detection event C, the determination unit 22A may acquire the reference data from the storage units 23B and 23C associated with the detection event B and the detection event C, respectively.
Determination unit 22A compares the input specific component Sna with the reference data (step S10). Specific comparison methods may be, for example: the correlation value between the reference data and the specific component Sna is obtained, and when the correlation value is greater than a threshold value, the reference data is regarded as "the reference data matches the specific component". When determining that the reference data matches the specific component, the determination unit 22A outputs, for example, a detection event name of the reference data used for the determination as detection event information (i.e., out-of-speech information) (step S11). The output out-of-speech information may be output, for example, to a smartphone designated for notifying the user, depending on the application.
In the above example, an example of a case where the combination of the characteristic component and the reference data is 1 to 1 is shown. When detecting a plurality of detection events, the processing unit 21, the determination unit 22, and the storage unit 23 assigned to each detection event in advance are used, whereby detection event information (out-of-speech information) of each detection event can be obtained from the determination unit 22. For example, when the number of the characteristic components and the reference data is 1 pair M, the steps S9 to S11 are repeated for each reference data for one specific component acquired in the step S8, and the detection event information of each reference data is acquired.
According to the present embodiment described above, it is possible to extract an environment component (speech component) in which a human voice component (speech component) is suppressed. By using the extracted voice external component, security applications and application systems that conventionally require dedicated sensors can be realized easily. In particular, in a smart speaker which is widely used in the market or a system having a function of performing an action using voice recognition in the same manner, by including a mechanism (the out-of-voice component acquisition unit 12) for calculating a difference between an original sound signal or a synthesized original sound signal and a voice signal (estimated voice signal) after noise suppression signal processing, it is possible to easily extract a noise component signal (out-of-voice component) in which a human voice component is suppressed. A noise component signal (speech-related component) in which a human voice component is suppressed is conventionally an unnecessary speech, and therefore, is not used as an output. Alternatively, the out-of-speech component is not utilized as the primary input data for the application. According to the present embodiment, it is possible to actively utilize the out-of-speech component, and to easily construct an application that utilizes the acquired out-of-speech component, particularly an application system related to security such as intrusion detection and survival monitoring. In the above-described embodiment, a plurality of microphones are used, but basically, the following configuration is possible: the speech recognition device includes a sound wave detection unit that receives a sound wave of an original sound by at least one microphone and outputs the sound wave as an original sound signal, a speech component acquisition unit that extracts a speech component from at least one of the original sound signal and a synthesized signal generated from a plurality of the original sound signals, and an out-of-speech component acquisition unit that generates and outputs the out-of-speech component from at least the speech component and the original sound signal. Even if a plurality of microphones are used, an original sound signal having the best reception state can be selected from among the original sound signals received by the microphones as a processing target.
(second embodiment)
In the present embodiment, the following example is explained: when a technique is applied in which a microphone that receives a sound wave of original sound is selected by performing gain adjustment on an original sound signal output from a sound wave detection unit having a plurality of microphones, an out-of-speech component is generated.
Fig. 6 is a block diagram showing an example of the configuration of a sound wave detection unit and a raw sound signal processing unit for processing sound waves of raw sound received by a plurality of microphones in the second embodiment.
In the present embodiment, assuming that a smart sound box is used, the sound wave detection unit 310 includes input units 312A and 312B (input unit 312 when not particularly distinguished) for canceling an echo signal of an echo (echo) in addition to the plurality of microphones 311A, 311B, and 311C (when not particularly distinguished, referred to as microphone 311). Although the smart speaker includes a function of outputting voice (including synthesized voice), an echo cancellation function is included for the purpose of preventing the voice output from the smart speaker from entering the microphone 311 equipped in itself and becoming a noise input to the microphone. To cancel the echo, a voice (hereinafter referred to as an echo signal) output from the smart speaker is input to the input unit 312.
The original acoustic signal processing unit 410 processes the original acoustic signal input from the acoustic wave detection unit 310, and outputs the processed original acoustic signal or synthesized original acoustic signal to the signal component extraction unit 10. In addition, the original acoustic signal processing section 410 obtains an original acoustic signal from which an echo signal is removed by using an echo cancellation function.
The gain adjustment units 411A, 411B, and 411C (which are referred to as gain adjustment units 411, unless otherwise specified) respectively adjust gains including amplitudes and phases of the original acoustic signals input from the microphones 311A, 311B, and 311C.
The gain adjustment units 412A and 412B (which are referred to as gain adjustment units 412, unless otherwise specified) respectively adjust gains including amplitudes and phases of the echo signals input from the signal input units 312A and 312B.
The distribution unit 413 outputs the gain-adjusted original acoustic signal output from the gain adjustment unit 411 and the gain adjustment unit 412 to the speech component acquisition unit 11 and the speech external component acquisition unit 12. The distribution unit 413 may output a signal obtained by synthesizing the gain-adjusted original sound signals output from the gain adjustment units 411 and 412 (which is referred to as a synthesized original sound signal unless otherwise specified) to the speech component acquisition unit 11 and the speech-external component acquisition unit 12.
The control unit 414 controls the original sound signal output from the microphone 311 as follows: the gain at which the gain adjustment is made is determined. For example, the control unit 414 adjusts the gains of the gain adjustment unit 411 and the gain adjustment unit 412 based on a beamforming technique or the like so that the integrated directivity of the plurality of microphones 311A, 311B, and 311C is oriented in the direction of the sound emission source. By applying the gain adjusted by the control unit 414 to the gain adjustment unit 411 and the gain adjustment unit 412, the synthesized original sound signal becomes a signal in which the original sound received from the direction of the sound source (e.g., the speaker) is enhanced.
Fig. 7 is a diagram illustrating an example in which microphones are selected by adjustment of gain in this embodiment, and is an example in the case where sound waves of original sound are received by three microphones 311A, 311B, and 311C.
Fig. 7 (a) shows an example in which the gains (Ga, Gb, Gc, respectively) of the gain adjustment units 411A, 411B, 411C are 1.0, and the distribution unit 413 operates to directly synthesize the original sound signals output from the three microphones 311A, 311B, 311C. The directivities D-311A, D-311B, D-311C represent the directivities with which the microphones 311A, 311B, 311C receive sound waves, respectively.
Fig. 7 (b) shows an example in which the gain is set to Ga of 0.0, Gb of 1.0, and Gc of 0.0. Since the gain other than Gb is 0.0, it is shown as follows: only the original sound received by the microphone 311B with the directivity D-311B is effective, and the original sound received by the microphones 311A, 311C is not contained in the synthesized original sound signal. Therefore, the directivities D-311A, D-311C of the microphones 311A, 311C are not shown in (b) of fig. 7. The example in this case can also be regarded as an example in which the microphone 311B is selected from the three microphones 311A, 311B, 311C and utilized.
Fig. 7 (c) shows an example in which the gain is set to Ga 1.0, Gb 0.0, and Gc 1.0, respectively. Since the gain of Gb is 0.0, this is shown as follows; the original sound received by the microphone 311B is not contained in the synthesized original sound signal. Therefore, only the directivities D-311A, D-311C for the microphones 311A and 311C are shown in fig. 7 (C).
Fig. 8 is a flowchart for the system to extract an out-of-speech component from an original acoustic signal in this embodiment, and the operation of this embodiment will be described using this diagram.
The voice uttered by the user U1 propagates in the form of sound waves to be received by the microphones 311A, 311B, 311C (step S21). According to the positional relationship of the user U1 and the microphone 311, the voice uttered by the user U1 reaches the microphone 311B with the strongest intensity. The sound from the detection target N1 also reaches the microphone 311C with the strongest intensity. Thus, by synthesizing the original sound signal in accordance with the setting shown in (b) of fig. 7, the original sound can be obtained in such a manner as to enhance the voice uttered by the user U1 as compared with the case of (a) of fig. 7. On the other hand, by synthesizing the original sound signal in accordance with the setting shown in fig. 7 (c), it is possible to obtain an original sound in which the sound emitted from the detection target N1 is enhanced while suppressing the speech emitted from the user U1, as compared with the case of fig. 7 (a). In the voice recognition apparatus such as a smart speaker, the acquisition is performed so as to enhance the voice uttered by the user U1, and therefore, in the present embodiment, an example in which the original sound is obtained by the configuration of fig. 7 (b) is shown. Therefore, the original acoustic signal processing unit 410 performs gain adjustment processing for the original acoustic signal input from the acoustic wave detection unit 310 so that the gain is 0.0 Ga, 1.0 Gb, and 0.0 Gc, respectively, to obtain an original acoustic signal subjected to gain adjustment. The gain-adjusted original sound signal is output to the signal component extraction unit 10 as a synthesized original sound signal or a separated original sound signal (step S22). Specific examples are as follows. The original acoustic signals input from the microphones 311A, 311B, and 311C are set to S01A, S01B, and S01C, respectively. The original acoustic signals obtained by gain-adjusting the original acoustic signals S01a, S01b, and S01c are referred to as S01ag, S01bg, and S01 cg. That is, S01ag ═ S01a × Ga, S01bg ═ S01b × Gb, and S01cg ═ S01c × Gc are set, and thus the separated original acoustic signal < S01g > (S01ag, S01bg, S01cg) is obtained. In addition, in the case of synthesizing the original acoustic signal S01g, S01g is obtained as S01ag + S01bg + S01 cg.
The speech component acquisition section 11 extracts the speech component Sv from the synthesized original voice signal S01g and outputs the speech component Sv to the speech recognition section 5 and the out-of-speech component acquisition section 12 (step S23). The speech component Sv output to the speech recognition unit 5 is subjected to speech recognition processing (step S24).
In the present embodiment, in order to obtain an original sound which enhances the sound (corresponding to an external component of the sound) uttered by the detection target N1 instead of the voice uttered by the user U1, the separated original sound signal < S01 > -received by the microphones 311A, 311B, 311C is reproduced based on the separated original sound signal < S01g > subjected to the gain adjustment and the gains Ga, Gb, Gc (step S25). Specifically, the speech component acquisition unit 11 outputs the speech component Sv to the speech external component acquisition unit 12, and the original sound signal processing unit 410 outputs the gain-adjusted original sound signals S01ag, S01bg, S01cg, and the gains Ga, Gb, Gc to the speech external component acquisition unit 12. The speech external component acquisition unit 12 performs inverse calculation using the gains Ga, Gb, and Gc on the gain-adjusted original sound signals S01ag, S01bg, and S01cg, respectively, thereby reproducing the original sound signals S01A, S01B, and S01C received by the microphones 311A, 311B, and 311C (corresponding to the original sound signals received with the directivity of D-311A, D-311B, D-311C in fig. 7 (a), respectively). The reproduced original acoustic signals S01a, S01b, and S01c are set to S01ar, S01br, and S01cr, respectively. Specifically, the speech external component acquisition section 12 may obtain the original sound signals S01ar, S01br, S01cr received for the microphones 311A, 311B, 311C by dividing the gain-adjusted original sound signals S01ag, S01bg, S01cg by the gains Ga, Gb, Gc, respectively. However, since division by the gain cannot be performed when the gains Ga, Gb, and Gc are 0.0, the original-sound-signal processing unit 410 can directly output the original sound signal before the gain adjustment, i.e., the synthesized original sound signal S01 or the isolated original sound signal < S01 > to the speech-external-component acquisition unit 12.
The speech component acquiring unit 12 extracts the speech component Sn from the synthesized original sound signal S01r of the speech component Sv and the additional data β input from the speech component acquiring unit 11 and the reproduced original sound signals S01ar, S01br, and S01cr (step S26). Specifically, for example, the speech outer component Sn is obtained as Sn (f) ═ S01r (f) -sv (f) × β (f). Where Sn (f), S01r (f), Sv (f), and β (f) respectively represent the frequency f values of the speech-related component Sn, the synthesized original speech signal S01r, the estimated speech signal Sv, and the additional data β. The additional data β may be a filter value obtained by vertically adjusting a filter applied to the original data by the speech component acquisition unit 11 in order to extract a speech component, such as the additional data α used in the first embodiment, or may be an empirical value obtained from experiments, past data, and the like.
When the original speech signal before gain adjustment, i.e., the synthesized original speech signal S01, is directly input from the original speech signal processing unit 410 to the speech external component acquisition unit 12, the speech external component Sn is obtained, for example, as Sn (f) -S01 (f) -sv (f) - × β (f). Where S01(f) represents the value by frequency f of the synthesized original sound signal S01.
Note that the synthesized original sound signals S01 and S01r used for calculation in the speech external component acquisition unit 12 may be synthesized original sound signals in the case where only a microphone excluding the microphone (corresponding to the microphone 311B) into which the speech of the user enters most strongly, that is, only the microphone 311A and the microphone 311C are used as shown in fig. 7 (C). Specifically, synthesized original acoustic signals excluding S01b and S01br are used such that S01 is S01a + S01c or S01r is S01ar + S01 cr.
Alternatively, S01 or S01r may be used as an external speech component as it is in the form of S01 ═ S01a + S01c or S01r ═ S01ar + S01 cr. That is, Sn is S01a + S01c, and Sn is S01ar + S01 cr.
According to the above steps, it is possible to easily extract an out-of-speech component using a speech recognition apparatus.
(third embodiment)
In the present embodiment, the following example will be described using the functional configuration of fig. 6: when a technique (for example, a beam forming technique) is applied in which the reception directivity of a sound wave can be changed by gain adjustment of an original sound signal output from a sound wave detection unit having a plurality of microphones, an out-of-speech component is generated.
In general, the position where the user exists may vary, and in the second embodiment, an example in which the microphone 311 is selected according to the position of the user is shown. In the present embodiment, an example is shown in which the control unit 414 changes the directivity of the sound wave received by the microphone 311 in accordance with the estimated position of the user while estimating the position of the user U1, for example, by the beam forming technique.
Fig. 9 is a diagram illustrating an example of adjusting the directivity of the received sound wave by the adjustment of the gain in the embodiment, and is an example of a case where the directivity of the microphone 311 is changed by the beam forming technique or the like.
Fig. 9 (a) shows an example of the case where the gain adjustment units 411A, 411B, and 411C respectively set the gain Ga to 1.0, Gb to 1.0, and Gc to 1.0. The directivity D-311 represents the directivity of the received sound wave in the case where the microphones 311A, 311B, 311C are regarded as one microphone, and the center point D-311-C represents the center point of the directional beam. The user U2 indicates a user who utters speech, and the detection object N2 indicates a detection object that utters an external component of speech to be detected.
Fig. 9 (B) shows an example of the directivity of the received sound wave in the case where the directional beam B-311 is directed toward the position (estimated position) of the user U2. Fig. 9 (c) shows a directivity a-311 obtained by removing the directional beam B-311 of fig. 9 (B) from the directivity D-311 of fig. 9 (a).
In a typical voice recognition apparatus, it is desirable to perform voice recognition processing on an original sound signal obtained as a directional beam B-311 directed toward the user U2 as shown in fig. 9 (B), but in the present embodiment, it is desirable to obtain an original sound signal with a directivity a-311 avoiding the user U2 as much as possible as shown in fig. 9 (c).
An operation example of the present embodiment will be described with reference to the flowchart of fig. 8. The same portions as those of the second embodiment will not be described.
The voice uttered by the user U2 propagates in the form of sound waves to be received by the microphones 311A, 311B, 311C (step S21). The control unit 414 estimates the position of the user U2 using the received voice by the beam forming technique, generates the directional beam B-311 as shown in fig. 9 (B), and obtains the original sound signal using the generated directional beam B-311 (step S22). Since the beamforming technique is a general technique, description thereof is omitted. The result of generating the directional beam B-311 is to direct the directivity of the received sound wave toward the position (or estimated position) of the user U2. The gains for the microphones 311 at this time are Ga1, Gb1, and Gc 1. That is, the original acoustic signal processing unit 410 outputs the original acoustic signal obtained by performing gain adjustment in which the gain is Ga-Ga 1, Gb-Gb 1, and Gc-Gc 1, respectively, on the original acoustic signal input from the acoustic wave detection unit 310 to the signal component extraction unit 10 (step S22). The original acoustic signals output from the microphones 311A, 311B, 311C are set as S02a, S02B, S02C, respectively. The original acoustic signals obtained by gain-adjusting the original acoustic signals S02a, S02b, and S02c are referred to as S02ag, S02bg, and S02cg, respectively. That is, the separated original acoustic signal < S02g > (S02ag, S02bg, S02cg) is obtained by setting S02ag to S02a × Ga, S02bg to S02b × Gb, and S02cg to S02c × Gc. In addition, in the case of synthesizing the original acoustic signal S02g, S02g is obtained as S02ag + S02bg + S02 cg.
The speech component acquisition unit 11 extracts the speech component Sv from the synthesized original voice signal S02g, and outputs the speech component Sv to the speech recognition unit 5 and the out-of-speech component acquisition unit 12 (step S23). The speech component Sv output to the speech recognition unit 5 is subjected to speech recognition processing (step S24).
In the present embodiment, in order to obtain an original sound in which the sound (external voice component equivalent) uttered by the detection target N2 is enhanced, instead of the voice uttered by the user U2, the original sound signals received by the microphones 311A, 311B, and 311C are reproduced based on the separated original sound signal < S02g > subjected to the gain adjustment and the gains Ga, Gb, and Gc (step S25). Specifically, the speech component acquisition unit 11 outputs the speech component Sv to the speech external component acquisition unit 12, and the original sound signal processing unit 410 outputs the gain-adjusted original sound signals S02ag, S02bg, and S02cg and the gains Ga1, Gb1, and Gc1 to the speech external component acquisition unit 12. The speech external component acquisition unit 12 can reproduce the original sound signals S02a, S02B, and S02C received by the microphones 311A, 311B, and 311C (corresponding to the original sound signals received with the directivity of D-311 in fig. 9 (a)) by performing inverse calculation using the gains Ga1, Gb1, and Gc1 for the gain-adjusted original sound signals S02ag, S02bg, and S02cg, respectively. The reproduced original acoustic signals S02a, S02b, and S02c are set as S02ar, S02br, and S02cr, respectively. Further, the speech component acquiring unit 12 extracts the speech component Sn based on the speech component Sv and the additional data β input from the speech component acquiring unit 11 and the synthesized original voice signal S02r (which indicates the synthesis values of S02ar, S02br, and S02 cr) of the reproduced original voice signal (step S26). Specifically, for example, the out-of-speech component Sn is obtained as Sn (f) S02r (f) -sv (f) × β (f).
The speech external component acquisition unit 12 may acquire the synthesized original sound signal S02 from the original sound signal processing unit 410 instead of S02r, and may acquire the speech external component Sn, for example, so that Sn (f) is S02(f) -sv (f) × β (f).
The speech external component acquiring unit 12 may acquire the speech external component Sn in the form of, for example, Sn (f) ═ S02(f) -S02 g (f) or Sn (f) ═ S02r (f) -S02 g. In this case, as shown in fig. 9 (c), the original sound signal received with the directivity a-311 obtained by removing the beam directivity B-311 from the directivity D-311 is directly obtained as the speech external component Sn, and therefore sv (f) and β (f) are not required.
The speech-external-component obtaining unit 12 may obtain the speech-external component Sn in the form of, for example, Sn (f) ═ S02(f) -S02 g-sv (f) × β (f) or Sn (f) ═ S02r (f) -S02 g-sv (f) × β (f). In this case, as shown in fig. 9 (c), since the voice component Sv is removed from the original sound received with the directivity a-311, which is obtained by removing the beam directivity B-311 from the directivity D-311, it is possible to obtain a voice external component (ambient sound) with higher purity.
According to the above procedure, it is possible to easily extract the out-of-speech component using a speech recognition apparatus to which the beamforming technique is applied.
(fourth embodiment)
A method of generating reference data used in an application using an out-of-speech component and an example of using the generated reference data will be described.
For example, a voice-external component after a trigger output is acquired with detection of a door sound as a trigger, and reference data is generated from the acquired voice-external component. If the reference data is assumed to be normal (in a state where no abnormality occurs) when it is generated, for example, when it is determined that the out-of-speech component acquired with the door open sound as a trigger is different from the reference data when the application is in operation, it is determined that some abnormality has occurred and the abnormality detection signal is output as out-of-speech information. In addition, as a trigger for acquiring the reference data, detection of a recognized word obtained by speech recognition may be used. In the present embodiment, the following example is shown for the trigger for acquiring the reference data: a case where recognized words obtained by speech recognition are detected; and detecting out-of-speech information such as a door sound.
Fig. 10 is a flowchart for the system of the fourth embodiment to execute the reference data generation processing. The reference data may be generated, for example, during the period in which the application system is operated using the steps of the flowcharts shown in the first to third embodiments, or may be generated outside the operation period. When the operation is performed outside the operation period, for example, the user may arbitrarily break the glass and the acoustic wave detection unit 3 may receive the acoustic wave of the breaking sound of the glass and record the received acoustic wave in the storage unit 6 as reference data. In the present embodiment, an example is shown in which normal environment data in which no abnormality occurs is acquired as reference data during the operation period of the application system, using the functional configuration of fig. 1.
While the application system is operating, the speech recognition process and the out-of-speech component recognition process are operated (steps S301 and S401). The speech recognition process and the out-of-speech component recognition process can be implemented based on the flowchart of fig. 2, for example. In the flowchart of fig. 2, the process ends after step S11, but in the present embodiment, the processes of step S1 to step S11 are repeatedly performed all the time. The flowchart of fig. 10 shows a case where the speech recognition processing and the speech-external component analysis processing operate in parallel, and first, a process flow starting from the speech recognition processing and then a process flow starting from the speech-external component recognition processing will be described below.
The control unit 7 monitors whether or not the speech recognition unit 5 detects a recognized word (step S302). When the speech recognition unit 5 detects the recognized word, the control unit 7 checks whether or not the timer unit a, not shown, is activated (step S303). When the timer unit a is confirmed to be activated, the control unit 7 returns to step 302 to continue monitoring whether or not the speech recognition unit 5 detects a recognized word (yes in step S303). When it is confirmed that the timer unit a is not activated, the control unit 7 activates the timer unit a to start the counting of the time T11 (no in step S303, proceed to step S304). When the time T11 exceeds the threshold TH _ T11, the counting of the time T21 is started (step S306). Further, the storage (recording) of the out-of-speech component in the storage section 6 is started at the same time (step S307). When the time T21 is equal to or less than the threshold TH _ T21, the storage (recording) of the out-of-speech component in the storage section 6 in step S306 is continued (no in step S308). When the time T21 exceeds the threshold TH _ T21, the storage (recording) of the out-of-speech component in the storage section 6 is stopped (yes in step S308, the flow proceeds to S309).
Fig. 11 is a diagram for explaining the time for counting, and fig. 11 (a) is an explanatory diagram of the time T11 and T21 for counting in the process flow from the start of the voice recognition process. The time T11 is the time from when the speech recognition unit 5 detects the recognized word and the timer unit a is started in step S304 until the out-of-speech component is stored (recorded) in the storage unit 6. When the time T11 exceeds the threshold TH _ T11, the storage (recording) of the out-of-speech component into the storage section 6 is started (step S307). Time T21 is the time from the start of storage (recording) of the out-of-speech component in storage unit 6 to the stop of recording. By setting the thresholds TH _ T11 and TH _ T21 for the times T11 and T21, respectively, it is possible to control the time (corresponding to T11) from the time when the speech recognition unit 5 detects the recognized word to the time when recording starts and the recording time (corresponding to T21). In step S303, a plurality of conditional branches are performed at different times in the data storing process in step S307, but a plurality of data storing processes may be performed at the same time by preparing a plurality of timer units a and the like.
Returning to fig. 10, the control unit 7 monitors whether or not the application processing unit 2 detects out-of-speech information (step S402). When the application processing unit 2 detects the out-of-speech information, the control unit 7 checks whether or not the timer unit B, not shown, is activated (step S403). When the timer unit B is confirmed to be activated, the control unit 7 returns to step 402 to continue monitoring whether or not the application processing unit 2 detects out-of-speech information (yes in step S403). When it is confirmed that the timer unit B is not activated, the control unit 7 activates the timer unit B to start the counting of the time T12 (no in step S403, proceed to step S404). When the time T12 exceeds the threshold TH _ T12, the counting of the time T22 is started (yes in step S405, advance to step S406). Further, the storage (recording) of the out-of-speech component in the storage section 6 is started at the same time (step S407). When the time T22 is equal to or less than the threshold TH _ T22, the storage (recording) of the out-of-speech component in the storage unit 6 in step S407 is continued (no in step S408). When the time T22 exceeds the threshold TH _ T22, the storage (recording) of the out-of-speech component in the storage section 6 is stopped (yes in step S408, the flow proceeds to S409).
Fig. 11 (b) is an explanatory diagram of the time T12 and T22 for timing in the process flow from the speech outer component recognition processing. The time T12 is the time from when the application processing unit 2 detects the out-of-speech information and starts the timer unit B in step S404 until the out-of-speech component is stored (recorded) in the storage unit 6. When the time T12 exceeds the threshold TH _ T12, the storage (recording) of the out-of-speech component in the storage section 6 is started (step S407). Time T22 is the time from the start of storage (recording) of the out-of-speech component in storage unit 6 to the stop of recording. By setting the thresholds TH _ T12 and TH _ T22 for the times T12 and T22, respectively, it is possible to control the time (corresponding to T12) from the detection of the information other than voice by the application processing section 2 to the start of recording and the recording time (corresponding to T22).
As described above, according to the present embodiment, it is possible to acquire the reference data using, as a trigger, detection of a recognized word obtained by speech recognition or detection of out-of-speech information such as a door sound. The user can assign detection event names such as "normal state after the recognized word is detected" and "normal state after the door open sound is detected" to the acquired reference data. When the application system is in use, a list of detection event names for which reference data has been acquired may be displayed on a display unit, not shown, so that the user can select from a remote controller or the like. For example, when the user selects "the normal state after the door open sound is detected" as the detection event from the list of the detection event names, the application processing unit 2 may acquire the out-of-speech component immediately after the trigger using "the door open sound" as the trigger in the operation state. The application processing unit 2 may compare the acquired out-of-speech component with the corresponding reference data to output detection event information (out-of-speech information) such as an abnormality detection signal. The output voice-out information can be transmitted to a smart phone or the like of the user via the internet or the like. For example, the user may be notified that "the door of his own house is opened and an abnormality occurs" as the out-of-voice information.
In the present embodiment, although the example of acquiring the reference data in the application system operation is shown, for example, when the original sound is acquired experimentally while reproducing an environment in which no voice (human voice) is generated, the original sound signal or the synthesized original sound signal output from the original sound signal processing unit 4 may be stored in the storage unit 6 as the reference data without being processed by the voice external component acquiring unit 12.
In addition, if the reference data acquired using the detection of the recognized word obtained by the voice recognition as a trigger in the present embodiment is used, it is possible to confirm whether or not the control target by the voice recognition normally operates, for example. Specifically, for example, a case will be described below in which a start command, which is a power-on command generated by the voice recognition unit 5, is wirelessly output to an air conditioner (air conditioner) by an Infrared remote controller or the like, with the air conditioner being a control target. When the air conditioner successfully receives the start command, the air conditioner operates according to the control content (power on) of the start command. At the same time, the application processing unit 2 acquires the environmental sound as the reference data by triggering detection of the recognized word (corresponding to the activation instruction in this case) obtained by the voice recognition. For example, a sound after the start of the air conditioner is recorded as the reference data acquired in the present embodiment from the time when the air conditioner normally receiving the start command is powered on. If the reference data is used during the operation of the application system, when the user issues a start instruction but the air conditioner fails to receive the start instruction and performs an operation other than the start instruction, the determination unit 22 of the application processing unit 2 compares the speech-related component extracted by using the detection of the recognition word (start instruction) as a trigger with the reference data (for example, step S10 in fig. 2), and determines that the speech-related component does not match the reference data at this time. And according to the judgment result, the failure of the air conditioner to receive the starting instruction can be known. The determination result may be transmitted as detected event information to a smartphone of the user via the internet, or the recognized word may be displayed on a display unit, not shown, or the recognized word may be output in the form of speech from a speaker, not shown, by speech synthesis or the like.
According to the above procedure, by using the reference data acquired in the present embodiment, the user can know whether or not the command issued by voice is normally applied to the control target.
Further, if the reference data acquired by using the detection of the out-of-speech information (detection event) as a trigger in the same procedure is used, it is possible to know whether the output detection event is a normally detected event or an erroneously detected event. For example, when a detection event is detected due to erroneous detection, the determination unit 22 of the application processing unit 2 compares the acquired speech component with the reference data acquired in advance in the present embodiment, and determines that the detection event is an erroneously detected event as a result of the comparison. Further, if a speech-related component matching the reference data acquired in the present embodiment is detected without using the detection event as a trigger, it is known that the detection event corresponding to the reference data has occurred. For example, the determination unit 22 of the application processing unit 2 constantly monitors the out-of-speech component, and determines that the detection event a has occurred when it is determined that the out-of-speech component matches the reference data acquired immediately after the detection of the detection event a in advance.
(modification example)
An example of a system configuration for generating and using an external speech component will be described.
Fig. 12 is a diagram showing an example of a configuration of an electronic device incorporating an external speech component generation unit according to a modification. The electronic device 100 may be a home appliance such as a television receiver, an air conditioner, a refrigerator, or a washing machine, as long as it has a control function based on voice recognition, and is not limited to the home appliance. In the electronic apparatus 100, the control unit 101 executes control of each function of the electronic apparatus 100 based on the control information generated by the voice recognition unit 5. This modification may be regarded as an example in which the system shown in fig. 1 is built in the electronic apparatus 100. By adding the voice external component acquisition unit 12, the application processing unit 2, the control unit 7, and the like to the functions normally provided in the electronic apparatus 100 and performing the same processing as the flowchart of fig. 2 or fig. 8, an application system using the electronic apparatus 100, for example, a home security system can be realized.
Next, a configuration example of a system using a smart speaker will be described.
Fig. 13 is a diagram showing an example of a configuration of a system in a case where an external smart speaker is used according to a modification. The electronic device 100A may be a home appliance such as a television receiver, an air conditioner, a refrigerator, or a washing machine, as long as it has a control function based on voice recognition, and is not limited to the home appliance. The smart sound box 200 is externally connected to the electronic device 100A via an interface not shown, and in the electronic device 100A, the control unit 101A controls each function of the electronic device 100A based on the control information generated by the voice recognition unit 5. In addition, an application device 300 is connected to the smart sound box 200. The application device 300 acquires an external voice component based on the original sound signal, the voice component, the additional data (α, β, etc.), and the like input from the smart sound box 200, detects a detection event, and outputs detection event information (external voice information) by the same processing as the flowchart of fig. 2 or 8. The application device 300 may, for example, notify the user's smartphone via the internet from the interface section 8 when the detection event is detected.
The functions shown in fig. 12 and 13 may be distributed on the internet via the interface unit 8 or the like, or may be functions on a cloud server, and combinations of various functions and system forms may be considered. For example, by setting the application device 300 to a device on the cloud, reference data from a large number of users connected to the network can be accumulated as a sample, and can be used to improve the detection accuracy of a detection event, for example.
As described above, according to the present modification, it is possible to extract the out-of-speech component in various ways, and to use the extracted out-of-speech component in various ways, for example, in home security technology.
According to at least one of the above-described embodiments and modifications, it is possible to provide an ambient sound output apparatus, a system, a method, and a nonvolatile storage medium that can easily extract ambient sound using a voice recognition device, the nonvolatile storage medium storing computer instructions that, when executed by a processor or a computer device, can realize the above-described method of easily extracting ambient sound using a voice recognition device.
Several embodiments of the present application have been described, but these embodiments are presented as examples and are not intended to limit the scope of the application. These new embodiments may be implemented in other various forms, and various omissions, substitutions, and changes may be made without departing from the spirit of the present application. These embodiments and modifications thereof are included in the scope and gist of the application, and are included in the inventions described in the claims and the scope equivalent thereto. In the claims, the components are expressed in a divided manner, a plurality of components are expressed in a combined manner, or both of them are combined with each other within the scope of the present application. In addition, a plurality of embodiments may be combined, and examples formed by the combination are also within the scope of the application.
The drawings are for clarity of description, and the width, thickness, shape, and the like of each portion may be schematically shown in comparison with an actual form. In the block diagram, there is also a case where data and signals are exchanged between blocks which are not wired or in the direction of arrows which are not shown even though they are wired. The functions shown in the block diagrams, the processing shown in the flowcharts, and the timing charts can be realized by hardware (an IC chip, etc.), software (a program, etc.), a Digital Signal processing arithmetic chip (a Digital Signal Processor, DSP), or a combination of these hardware and software. The device of the present application can be applied to any of a case where a claim is expressed as a control logic, a case where a claim is expressed as a program including instructions to be executed by a computer, and a case where a claim is expressed as a computer-readable storage medium storing the instructions. The terms and expressions used are not limited to those used, and other expressions are also included in the present application as long as they have substantially the same contents and the same subjects.

Claims (12)

  1. An ambient sound output apparatus, wherein,
    the ambient sound output means generates and outputs an out-of-speech component using a speech recognition device.
  2. The ambient sound output device of claim 1,
    the ambient sound output device includes:
    sound wave detection means which receives original sounds with at least one microphone and outputs as original sound signals, respectively;
    a speech component acquisition unit of the speech recognition device that extracts a speech component from at least one of the original acoustic signal and a synthesized original acoustic signal generated from a plurality of the original acoustic signals; and
    a speech outer component acquisition mechanism that generates and outputs the speech outer component from at least the speech component and the original acoustic signal.
  3. The ambient sound output device of claim 2,
    the voice component acquisition means outputs a frequency characteristic value of a filter used when extracting the voice component from the original sound signal,
    the speech outer component acquisition means generates the speech outer component from at least the frequency characteristic value of the filter, the speech component, and the original acoustic signal.
  4. The ambient sound output device of claim 3,
    the ambient sound output device is further provided with a raw sound signal processing means,
    the original sound signal processing means generates gain-adjusted original sound signals from the original sound signals and gains respectively assigned to the microphones, generates a first synthesized original sound signal in which the gain-adjusted original sound signals are synthesized, and outputs a value of the gain, the gain-adjusted original sound signal, and the first synthesized original sound signal,
    the speech component acquisition means extracts the speech component from the first synthesized original sound signal,
    the speech outer component acquisition means reproduces the original sound signal from the gain-adjusted original sound signal and the value of the gain to obtain a reproduced original sound signal, and generates and outputs a speech outer component from at least the speech component and the reproduced original sound signal.
  5. The ambient sound output device of claim 3,
    the ambient sound output device is further provided with a raw sound signal processing means,
    the original sound signal processing means generates a gain-adjusted original sound signal from the original sound signal and a gain assigned to the microphone, generates a first synthesized original sound signal in which the gain-adjusted original sound signal is synthesized, and outputs the original sound signal and the first synthesized original sound signal,
    the speech component acquisition means extracts the speech component from the first synthesized original sound signal,
    the voice external component acquisition means generates and outputs a voice external component from at least the voice component and the original acoustic signal.
  6. The ambient sound output device of claim 1,
    the voice recognition device is a voice recognition mechanism of a television receiving apparatus.
  7. The ambient sound output device of claim 1,
    the voice recognition device is a voice recognition mechanism of the intelligent sound box.
  8. A system, wherein,
    the system analyzes the out-of-speech component using an out-of-speech component obtained from an ambient sound output apparatus that generates and outputs the out-of-speech component using a speech recognition device, and outputs the analysis result.
  9. The system of claim 8, wherein,
    the voice recognition device is any one of a voice recognition mechanism of the television receiving apparatus and a voice recognition mechanism of the smart speaker.
  10. In a method, wherein,
    the method uses a speech recognition device to generate and output an out-of-speech component.
  11. The method of claim 10, wherein,
    receiving the raw sound with at least one microphone and outputting as raw sound signals, respectively,
    outputting a synthesized original sound signal synthesized from the original sound signal and the original sound signal,
    extracting speech components from the synthesized original sound signal,
    generating and outputting the out-of-speech component from at least the speech component and the original acoustic signal.
  12. A non-transitory storage medium readable by a computer, the storage medium storing computer instructions for causing a computer including a digital signal processor to perform processes, wherein,
    the computer instructions are for causing the computer to perform the steps of:
    acquiring a raw acoustic signal from at least one microphone;
    outputting a synthesized original sound signal synthesized from the original sound signal and the original sound signal;
    extracting a speech component from the synthesized original sound signal; and
    an out-of-speech component is generated and output from at least the speech component and the original acoustic signal.
CN202080006649.8A 2020-01-17 2020-12-11 Ambient sound output device, system, method, and non-volatile storage medium Active CN113490979B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP2020-006227 2020-01-17
JP2020006227A JP2021113888A (en) 2020-01-17 2020-01-17 Environmental sound output device, system, method and program
PCT/CN2020/135774 WO2021143411A1 (en) 2020-01-17 2020-12-11 Ambient sound output apparatus, system, method, and nonvolatile storage medium

Publications (2)

Publication Number Publication Date
CN113490979A true CN113490979A (en) 2021-10-08
CN113490979B CN113490979B (en) 2024-02-27

Family

ID=76863553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080006649.8A Active CN113490979B (en) 2020-01-17 2020-12-11 Ambient sound output device, system, method, and non-volatile storage medium

Country Status (3)

Country Link
JP (1) JP2021113888A (en)
CN (1) CN113490979B (en)
WO (1) WO2021143411A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008064892A (en) * 2006-09-05 2008-03-21 National Institute Of Advanced Industrial & Technology Voice recognition method and voice recognition device using the same
JP2011118822A (en) * 2009-12-07 2011-06-16 Nec Casio Mobile Communications Ltd Electronic apparatus, speech detecting device, voice recognition operation system, and voice recognition operation method and program
CN102968999A (en) * 2011-11-18 2013-03-13 斯凯普公司 Audio signal processing
CN107742517A (en) * 2017-10-10 2018-02-27 广东中星电子有限公司 A kind of detection method and device to abnormal sound
KR20180090046A (en) * 2017-02-02 2018-08-10 인성 엔프라 주식회사 voice recognition device and lighting device therewith and lighting system therewith

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6054142B2 (en) * 2012-10-31 2016-12-27 株式会社東芝 Signal processing apparatus, method and program
CN104954543A (en) * 2014-03-31 2015-09-30 小米科技有限责任公司 Automatic alarm method and device and mobile terminal
WO2019003716A1 (en) * 2017-06-27 2019-01-03 共栄エンジニアリング株式会社 Sound collecting device, directivity control device, and directivity control method
US20200126549A1 (en) * 2017-07-14 2020-04-23 Daikin Industries, Ltd. Device control system
US10339913B2 (en) * 2017-12-27 2019-07-02 Intel Corporation Context-based cancellation and amplification of acoustical signals in acoustical environments
CN110390942A (en) * 2019-06-28 2019-10-29 平安科技(深圳)有限公司 Mood detection method and its device based on vagitus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008064892A (en) * 2006-09-05 2008-03-21 National Institute Of Advanced Industrial & Technology Voice recognition method and voice recognition device using the same
JP2011118822A (en) * 2009-12-07 2011-06-16 Nec Casio Mobile Communications Ltd Electronic apparatus, speech detecting device, voice recognition operation system, and voice recognition operation method and program
CN102968999A (en) * 2011-11-18 2013-03-13 斯凯普公司 Audio signal processing
KR20180090046A (en) * 2017-02-02 2018-08-10 인성 엔프라 주식회사 voice recognition device and lighting device therewith and lighting system therewith
CN107742517A (en) * 2017-10-10 2018-02-27 广东中星电子有限公司 A kind of detection method and device to abnormal sound

Also Published As

Publication number Publication date
WO2021143411A1 (en) 2021-07-22
JP2021113888A (en) 2021-08-05
CN113490979B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
EP3726856B1 (en) A hearing device comprising a keyword detector and an own voice detector
KR102471499B1 (en) Image Processing Apparatus and Driving Method Thereof, and Computer Readable Recording Medium
JP5419361B2 (en) Voice control system and voice control method
EP2587481B1 (en) Controlling an apparatus based on speech
JP6450139B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
CN112397083B (en) Voice processing method and related device
US11488617B2 (en) Method and apparatus for sound processing
KR20200132613A (en) Method and apparatus for speech recognition with wake on voice
KR102374054B1 (en) Method for recognizing voice and apparatus used therefor
JP7498560B2 (en) Systems and methods
CN107465970A (en) Equipment for voice communication
JP2019184809A (en) Voice recognition device and voice recognition method
CN113314121B (en) Soundless voice recognition method, soundless voice recognition device, soundless voice recognition medium, soundless voice recognition earphone and electronic equipment
JP2007034238A (en) On-site operation support system
CN113228710A (en) Sound source separation in hearing devices and related methods
CN113490979B (en) Ambient sound output device, system, method, and non-volatile storage medium
US20110208516A1 (en) Information processing apparatus and operation method thereof
US10276156B2 (en) Control using temporally and/or spectrally compact audio commands
US20100249961A1 (en) Environmental sound reproducing device
KR20210100368A (en) Electronice device and control method thereof
KR20210054246A (en) Electorinc apparatus and control method thereof
EP4149120A1 (en) Method, hearing system, and computer program for improving a listening experience of a user wearing a hearing device, and computer-readable medium
CN115331672B (en) Device control method, device, electronic device and storage medium
US20240223973A1 (en) Hearing device comprising a transmitter
US20220360935A1 (en) Sound field control apparatus and method for the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant