WO2021143411A1 - 环境声输出装置、系统、方法及非易失性存储介质 - Google Patents

环境声输出装置、系统、方法及非易失性存储介质 Download PDF

Info

Publication number
WO2021143411A1
WO2021143411A1 PCT/CN2020/135774 CN2020135774W WO2021143411A1 WO 2021143411 A1 WO2021143411 A1 WO 2021143411A1 CN 2020135774 W CN2020135774 W CN 2020135774W WO 2021143411 A1 WO2021143411 A1 WO 2021143411A1
Authority
WO
WIPO (PCT)
Prior art keywords
original
speech
extra
acoustic signal
component
Prior art date
Application number
PCT/CN2020/135774
Other languages
English (en)
French (fr)
Inventor
諸星利弘
Original Assignee
海信视像科技股份有限公司
东芝视频解决方案株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海信视像科技股份有限公司, 东芝视频解决方案株式会社 filed Critical 海信视像科技股份有限公司
Priority to CN202080006649.8A priority Critical patent/CN113490979B/zh
Publication of WO2021143411A1 publication Critical patent/WO2021143411A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the embodiments of the present application relate to an environmental sound output device, system, method, and non-volatile storage medium.
  • a sensor corresponding to the safety application is required.
  • multiple sensors such as window sensors and door sensors for detecting opening and closing of windows and doors are required.
  • the cost of maintenance such as replacement of the batteries of the sensors will also be incurred after installation.
  • voice recognition technology such as smart speakers
  • voice components components of the human voice
  • text textualization, linguisticization, etc.
  • extra-speech components a technology for suppressing environmental sound components other than speech
  • Patent Document 1 Japanese Patent Application Publication No. 2001-188555
  • Patent Document 2 Japanese Patent Application Publication No. 2000-222000
  • the extra-voice component can be used in home security technology and so on.
  • home security technology can be easily introduced by extracting extra-speech components from smart speakers that are becoming popular.
  • voice recognition technology has been similarly introduced in home appliances such as televisions, and home security technologies can be easily introduced by using home appliances by extracting components outside the voice.
  • home security technologies can be easily introduced by using home appliances by extracting components outside the voice.
  • the problem to be solved by this application is to provide an environmental sound output device, system, method, and non-volatile storage medium that can easily extract environmental sound using a voice recognition device.
  • An environmental sound output device includes a sound wave detection mechanism, a voice component acquisition mechanism of a voice recognition device, and a voice non-speech component acquisition mechanism.
  • the sound wave detection mechanism uses at least one microphone to receive the original sound as the original sound signal
  • the voice component acquisition mechanism extracts a voice component from at least one of the original acoustic signal and a synthesized signal generated from a plurality of the original acoustic signals
  • the extra-speech component acquisition mechanism extracts a voice component based on at least the voice component And the original sound signal to generate and output the extra-speech component.
  • FIG. 1 is a block diagram showing a configuration example of the system of the first embodiment
  • 2 is a flowchart for the system of the first embodiment to receive sound waves of original sound to perform analysis processing of extra-speech components;
  • Fig. 3 is a schematic diagram showing filters and respective signals that the system of the first embodiment extracts from the original acoustic signal to the speech signal;
  • FIG. 4 is a schematic diagram showing filters and various signals that the system of the first embodiment extracts from the original acoustic signal to the extra-speech component;
  • Fig. 5 is a schematic diagram showing filters and respective signals that are extracted from extra-speech components to specific components in the system of the first embodiment
  • FIG. 6 is a block diagram showing a configuration example of a sound wave detection unit and an original sound signal processing unit for processing sound waves of original sound received by a plurality of microphones in the second embodiment;
  • FIG. 7 is a diagram illustrating an example of selecting a microphone through gain adjustment in the second embodiment
  • Fig. 8 is a flowchart for the system to extract extra-speech components from the original acoustic signal in the second embodiment
  • FIG. 9 is a diagram illustrating an example of adjusting the directivity of received sound waves by adjusting the gain in the third embodiment.
  • Fig. 10 is a flowchart for the system of the fourth embodiment to execute reference data generation processing
  • FIG. 11 is a diagram illustrating the time to be counted by the system of the fourth embodiment.
  • FIG. 12 is a diagram showing a configuration example of an electronic device with a built-in extra-voice component generating unit according to a modification
  • FIG. 13 is a diagram showing a configuration example of a system in the case of using an external smart speaker according to this modification.
  • 1...speech extra-component generation unit 2...application processing unit, 3...speech detection unit, 4...original sound signal processing unit, 5...speech recognition unit, 6...storage unit, 7...control unit, 8...interface unit, 10 ...Signal component extraction unit, 11...speech component acquisition unit, 12...extra-speech component acquisition unit, 21...processing unit, 22...determination unit, 23...storage unit, 31...microphone, 100...TV receiver, 101...control unit , 200...smart speakers, 300...application devices.
  • an extra-speech component obtained by removing a speech component from the original sound received by the sound wave detection unit is generated, and the extra-speech component is used in an application system.
  • FIG. 1 is a block diagram showing a configuration example of the system of the first embodiment.
  • the extra-speech component acquisition unit 1 is a device that obtains original sound to generate and output extra-speech components.
  • the original sound means a physical sound wave.
  • the extra-speech component acquisition unit 1 includes a sound wave detection unit 3, an original sound signal processing unit 4, and a signal component extraction unit 10.
  • the sound wave detection unit 3 includes a microphone and an analog-digital conversion unit (hereinafter referred to as an A/D conversion unit) not shown.
  • the microphones 31A, 31B, and 31C receive the sound waves of the original sound and convert them into electric signals (analog original sound signals).
  • the analog original sound signal is converted to a digital value in an A/D conversion unit not shown, and the original sound signal is output.
  • the original acoustic signal means an original acoustic signal based on a digital value.
  • the original acoustic signal processing section 4 outputs an original acoustic signal obtained by synthesizing the original acoustic signal output by the acoustic wave detecting section 3 (hereinafter referred to as a synthesized original acoustic signal).
  • the original sound signal processing unit 4 performs gain adjustment or the like on the original sound signals output from the plurality of microphones 31 and synthesizes them.
  • the original acoustic signal in the expression "original acoustic signal output from the microphone 31" means an original acoustic signal based on a digital value.
  • the original sound signal is output from the A/D conversion unit, but in the following, for the purpose of clarifying the source of the original sound signal, this expression is sometimes used.
  • the original acoustic signal processing unit 4 may also select an effective microphone from a plurality of microphones 31A, 31B, and 31C through gain adjustment or the like, and apply a microphone array technology such as beamforming.
  • the original acoustic signal processing unit 4 can operate in the form of a digital signal processor (also called a Digital Signal Processor or DSP), for example, can also operate as software on a computer such as a microcomputer, or can also operate as hardware such as an IC chip. In the form of action. In addition, it is also possible to operate in a form obtained by combining the above forms.
  • DSP Digital Signal Processor
  • the signal component extraction unit 10 generates a voice component (also referred to as a voice signal) and an extra-speech component based on the original sound signal input from the original sound signal processing unit 4 or a synthesized original sound signal, and outputs them.
  • the voice component is a component centered on the component of the human voice in particular, and it may be an estimated value of the voice component.
  • the extra-speech component refers to the component obtained by removing the speech component from the original sound.
  • the original acoustic signal processing unit 4 may operate in the form of a digital signal processor DSP, for example, may operate as software on a computer such as a microcomputer, or may operate in the form of hardware such as an IC chip. In addition, it is also possible to operate in a form obtained by combining the above forms.
  • the signal component extraction unit 10 includes a speech component acquisition unit 11 and an extra-speech component extraction unit 12.
  • the speech component acquisition unit 11 extracts speech components from the original sound signal output by the original sound signal processing unit 4 or the synthesized original sound signal.
  • the voice component acquisition unit 11 may be provided with a filter for extracting voice components, a noise reducer for removing non-voice components, etc., which are used in ordinary voice recognition devices such as smart speakers.
  • the extra-speech component acquisition unit 12 uses the original sound signal or synthesized original sound signal output by the original sound signal processing unit 4, the speech component output by the speech component acquisition unit 11, and the gain adjustment amount used in the original sound signal processing unit 4 ( Sometimes it is also referred to as gain) and other parameters to extract the extra-speech components and output them.
  • the extra-speech component acquisition unit 12 may use a filter for the speech component output by the speech component acquisition unit 11 as needed. The information of the filter may be input from the speech component acquisition unit 11 to the extra-speech component acquisition unit 12 as additional data.
  • the application processing unit 2 performs analysis or the like on the extra-speech component input from the signal component extraction unit 10, generates information based on the extra-speech component (hereinafter referred to as extra-speech information), and outputs the extra-speech information to the outside.
  • the extra-speech information may include information related to environmental sound. For example, in the case where the analysis of the extra-speech component detects the sound of glass breaking, the information such as "glass breaking" can be output as the extra-speech information.
  • the application processing unit 2 may detect abnormal sound based on the extra-speech component, and output the detection content such as whether or not an abnormal sound is detected, and what kind of abnormal sound is generated from what kind of object, as extra-speech information.
  • the abnormal sound that can be detected by the application processing unit 2 may be, for example, any abnormal sound such as a broken window glass, a person's footsteps when not at home, a falling weight of a heavy object, and a person falling down.
  • the application processing unit 2 may operate as software on a computer such as a microcomputer, for example, or may operate on hardware such as an IC chip.
  • the processing units 21A, 21B, and 21C (referred to as the processing unit 21 unless otherwise distinguished) extract necessary components (hereinafter referred to as specific components) from non-speech components. Specifically, based on the frequency characteristics of the non-speech component, etc., a component of a certain frequency band is extracted from the non-speech component as the specific component.
  • the specific components extracted by the processing units 21A, 21B, and 21C may be different, and it may be determined according to, for example, what kind of abnormal sound is detected. In addition, the specific component to be extracted may also be predetermined in each processing unit 21.
  • a software or hardware for processing the functions of the application processing unit 2 may be incorporated into the software or hardware for determining the frequency band of the specific component. information.
  • the user or the like can also set or select a specific component to be extracted via, for example, the interface 8 or the like.
  • the processing unit 21 may also calculate frequency feature quantity data indicating a specific component included in the abnormal sound to be detected from the non-speech component. The processing unit 21 outputs the extracted specific component or frequency feature amount data to the subsequent stage.
  • the storage unit 23 stores data of a specific component (hereinafter referred to as reference data) according to an event to be detected (hereinafter referred to as a detection event). Detection events are, for example, the sound of broken windows, the sound of people's footsteps when not at home, the sound of falling heavy objects, and the sound of people falling down. According to the detection event, the data of the specific component is obtained in advance and set as the reference data for each detection event.
  • the reference data of the specific component may use the extra-speech component output by the extra-speech component acquisition unit 12, and the reference data may be data expressed in a frequency region such as a frequency characteristic, or data expressed in a time region such as a time signal. In addition, the reference data may also be frequency feature amount data calculated from a specific component.
  • the reference data may be stored (recorded in) the storage unit 23 by actually breaking the window near the sound wave detecting unit 3. To get it.
  • the sound sample data stored for each detection event in the storage unit 6 described later may be downloaded to the storage unit 23. It is also possible to download a sample provided by a storage medium such as a CD with audio sample data or a server on the Internet to the storage unit 23.
  • the user it is also possible for the user to record and edit content transmitted by television broadcast signals, radio broadcasts, etc. to create sound sample data, and store it in the storage unit 23. If there is no name of the detected event in the sound sample data, the name of the detected event is assigned to the sound sample data as reference data.
  • the determination unit 22 inputs a specific component (or its characteristic data) from the processing unit 21, it acquires the corresponding reference data from the storage unit 23, compares the input specific component (or its characteristic data) with the acquired reference data, and then When it is determined that the two match, the name of the detection event assigned to the matching reference data, etc. are output as the detection event information.
  • the above-mentioned example is an example showing a case where the specific component and the reference data correspond one-to-one, but the combination of the specific component and the reference data may be various combinations.
  • the number of specific components and the number of reference data may be one pair to M (M is a natural number greater than or equal to 2).
  • M is a natural number greater than or equal to 2.
  • the number of specific components and the number of reference data may be M to 1 (M is a natural number greater than or equal to 2).
  • the processing unit 21 outputs a plurality of (for example, M) specific components to the determination unit 22, and the determination unit 22 compares each specific component with the corresponding reference data according to the input specific components, and outputs a plurality of (for example, M) detection events. information.
  • the detection event information output by the determination unit 22 may be output to the user's smartphone set by the application processing unit 2 via the Internet, for example, to notify the user that an event has occurred in the form of detection event information.
  • the function of the determining unit 22 can also be realized by using trigger word detection (wake word detection) and specific word detection based on artificial intelligence technology (AI technology) equipped on the smart speaker.
  • trigger word detection wake word detection
  • AI technology artificial intelligence technology
  • the voice recognition unit 5 uses voice recognition technology to recognize the voice components output by the voice component acquisition unit 11 and convert them into text (textualization, linguisticization, etc.).
  • the speech recognition unit 5 outputs a textualized language (hereinafter referred to as a recognized word).
  • the output destination may be an application or device that uses the recognized word, or the recognized word may be displayed on a display unit (not shown), or the recognized word may be output as a voice through a speaker (not shown) such as voice synthesis.
  • the application or device using the recognized word uses the recognized word as an instruction for control, and the application or device that receives the recognized word from the voice recognition unit 5 performs control based on the recognized word, for example. Since the speech recognition technology is a well-known common technology, a detailed description is omitted.
  • the storage unit 6 stores sample data of reference data to be stored in the storage unit 23. As described above in the description of the storage unit 23, the downloaded sound sample data, recording data, or their characteristic amounts are stored in the storage unit 6, and are provided to the storage unit 23 as necessary.
  • the storage unit 6 may include reference data of detection events that are not stored in the storage unit 23. For example, when the user selects the "detection event" that the application processing unit 2 wants to detect from the remote control, when the control unit 7 receives the selection signal sent from the remote control via the interface unit 8, control The section 7 sets the reference data of the “detection event” stored in the storage section 6 in the storage section 23.
  • the control unit 7 controls each function of the application system. Specifically, the control unit 7 can control each function based on various control signals including a selection signal input from the interface unit 8. In addition, the control unit 7 may control each function based on the recognized words output by the voice recognition unit 5. It should be noted that, in FIG. 1, data exchange (including control) can also be performed between the control unit 7 and the functional blocks not connected to the control unit 7.
  • the interface unit 8 performs various communications with the outside of the application system.
  • the interface unit 8 includes various wired and wireless communication interfaces such as remote controller (hereinafter referred to as remote control), infrared communication, mouse, keyboard, Ethernet, HDMI, Wifi, 5th generation mobile communication (5G), etc. .
  • remote control remote controller
  • the interface unit 8 may include, for example, a communication interface for accessing the Internet, and download sound sample data and the like from a server on the Internet to the storage unit 23.
  • the interface unit 8 may also output detection event information to, for example, a smartphone or PC of a user connected to the Internet.
  • the interface unit 8 may further include an interface that enables the application processing unit 2 to communicate with external home appliances. For example, it can generate and send commands to the external home appliances via infrared communication in the interface unit 8 to control the external home appliances.
  • the present system may also adopt a smart speaker including the functions of the voice component acquisition unit 11, the sound wave detection unit 3, the voice recognition unit 5, and the like.
  • Smart speakers are installed in the home and used to recognize human voices.
  • the microphone of the smart speaker (corresponding to the microphone included in the sound wave detection unit 3 of the present embodiment) covers a wide area of several square meters in the surrounding area and always monitors the surrounding state.
  • the smart speaker has a structure for extracting human voice, and includes a function equivalent to the voice component acquisition unit 11.
  • smart speakers consider noise such as noise or environmental sounds as obstacles in detecting human voices. Therefore, noise reduction processing and narrowing of the voice collection direction based on beamforming technology are used. Technology to restrain the above obstacles.
  • the smart speaker is configured to enhance the human voice (speech component) while suppressing the extra-speech component that becomes its background.
  • a smart speaker is used to extract and utilize extra-voice components.
  • FIG. 2 is a flowchart of the system used in this embodiment to receive sound waves of original sound to perform analysis processing of extra-speech components.
  • the voice detection unit 3 is normally activated, receives the sound waves of the original sound around the voice detection unit 3, generates an electrical original sound signal, and outputs it (step S1).
  • the original acoustic signals output from the microphones 31A, 31B, and 31C are respectively referred to as S0a, S0b, and S0c.
  • the original sound signal input from the speech detection unit 3 to the original sound signal processing unit 4 is synthesized and input as a synthesized original sound signal (set as S0) to the speech component acquisition unit 11 and the extra-speech component acquisition unit 12 (step S2).
  • the synthesized original acoustic signal S0 may also be set to a value (S0a, S0b, Soc) that distinguishes the original acoustic signals from the microphones 31 (referred to as separated original acoustic signals).
  • Fig. 3 shows a schematic diagram of the filter and the frequency characteristics of each signal from the original acoustic signal to the speech signal extracted by the system of this embodiment.
  • the horizontal axis can be represented by frequency
  • the vertical axis can be It is represented by, for example, an amplitude value, a power value, etc., or their relative value (probability density function, etc.).
  • Fig. 3 is a schematic diagram for convenience of explanation, and the figure is not intended to strictly show the magnitude of the numerical value.
  • the speech component acquisition unit 11 performs extraction processing of the speech signal Sv with respect to the input synthesized original sound signal S0 (step S3).
  • the synthesized original sound signal S0 is input to the filter for extracting the speech signal (for example, the schematic diagrams of (b) and (c) in Fig. 3) to obtain the estimated value Sv of the speech signal (hereinafter sometimes referred to as estimated speech Signal).
  • the estimated speech signal Sv (for example, the schematic diagram of FIG. 3(d)) is input to the speech recognition unit 5, and the speech recognition unit 5 performs speech recognition processing based on the estimated speech signal Sv, and recognizes the language from the speech signal (textualization).
  • Obtain recognition words step S4).
  • the voice recognition unit 5 outputs the obtained recognized words to the outside (step S5).
  • the synthesized original sound signal S0 is input from the original sound signal processing unit 4 to the extra-speech component acquisition unit 12, and the estimated speech signal Sv and the additional data ⁇ are input from the speech component acquisition unit 11 to the extra-speech component acquisition unit 12.
  • the additional data ⁇ may be the value of a filter applied to the synthesized original sound signal S0 by the speech component acquisition unit 11.
  • FIG. 4 shows a schematic diagram of the filter and the frequency characteristics of each signal from the original acoustic signal to the extra-speech component extracted by the system of this embodiment.
  • the horizontal axis can be represented by frequency
  • the vertical axis It can be represented by, for example, an amplitude value, a power value, etc., or their relative value (probability density function, etc.).
  • FIG. 4 is a schematic diagram for convenience of explanation, and the figure is not intended to strictly show the magnitude of the numerical value. Fig.
  • Fig. 4(b) shows a filter used to restore the level of the extra-speech component suppressed in step S3 to the original state. For example, it shows that the filter of Fig. 3(c) is reversed up and down Filter.
  • the value of this filter be, for example, the additional data ⁇ .
  • empirical values obtained based on experiments, past data, etc. may also be used as additional data ⁇ .
  • the extra-speech component Sn is acquired based on at least the synthesized original sound signal S0 and the estimated speech signal Sv (for example, the schematic diagram of (c) of FIG. 4) (step S6).
  • Sn(f), S0(f), and Sv(f) respectively represent the value of the frequency f of the extra-speech component Sn, the synthesized original acoustic signal S0, and the estimated speech signal Sv.
  • the extra-speech component Sn in consideration of the additional data ⁇ for the synthesized original sound signal S0 and the estimated speech signal Sv.
  • ⁇ (f) represents the value of the additional data ⁇ according to the frequency f.
  • the user performs a setting such as detection of the detection event A on the application processing unit 2.
  • the reference data corresponding to the detection event A is stored in the storage unit 23A.
  • the extra-speech component acquired by the extra-speech component acquisition unit 12 is input to the application processing unit 2, and the processing unit 21A extracts a specific component from the extra-speech component, and the extracted specific component is output to the determination unit 22A (step S8).
  • Fig. 5 shows a schematic diagram of the filter and each signal that the system of this embodiment extracts from the extra-speech component to the specific component in the frequency region.
  • the horizontal axis can be represented by the frequency Representation
  • the vertical axis can be represented by, for example, an amplitude value, a power value, etc., or their relative values (probability density function, etc.).
  • Fig. 5 is a schematic diagram for convenience of explanation, and the figure is not intended to strictly show the magnitude of the numerical value.
  • the processing unit 21A may include, for example, a filter that suppresses high-frequency components and extracts low-frequency components, such as the filter fna in (b) of FIG. 5.
  • this filter fna is used when detecting, for example, the sound of human footsteps as the detection event A.
  • the filter fna can be formed by using pre-recorded human footsteps, etc., and is set in the processing unit 21A in advance.
  • the user may select the processing unit 21A in which the filter fna is set when the user has made the setting to detect the detection event A.
  • Fig. 5(c) is a schematic diagram of the characteristic component Sna extracted in step S8.
  • the processing unit 21B including filters obtained in advance for each event. ⁇ 21C.
  • the determination unit 22A acquires the reference data from the storage unit 23A (step S9).
  • the reference data is, for example, the specific component filter fna of FIG. 5(b).
  • the specific component filter fna is stored in the storage unit 23A in association with the detection event A as reference data.
  • the reference data can also be formed by using pre-recorded human footsteps, samples of human footsteps downloaded from a server on the Internet, etc., and set in the storage unit in advance. 23A.
  • the determination unit 22A may obtain the reference data from the storage unit 23A.
  • the determination unit 22A may obtain the reference data from the storage units 23B and 23C associated with the detection event B and the detection event C, respectively.
  • the determination unit 22A compares the input specific component Sna with the reference data (step S10).
  • the specific comparison method may be, for example, solving the correlation value between the reference data and the specific component Sna, and when the correlation value is greater than a certain threshold, it is regarded as "the reference data is consistent with the specific component".
  • the detection event name of the reference data used for the determination is output as the detection event information (out-of-voice information) (step S11).
  • the output extra-voice information can be output to a smart phone designated to notify the user, for example, according to the application.
  • the case where the combination of the characteristic component and the reference data is one-to-one is shown.
  • the processing unit 21, the determination unit 22, and the storage unit 23 respectively assigned to each detection event in advance it is possible to obtain the detection event information of each detection event (except for voice) from the determination unit 22. information).
  • steps S9 to S11 are repeated for a specific component acquired in step S8 according to the reference data to obtain detection event information for each reference data. .
  • the present embodiment described above it is possible to extract the environmental component (out-of-speech component) in which the human voice component (speech component) is suppressed.
  • the extracted voice components security applications and application systems that previously required dedicated sensors can be simply realized.
  • the voice signal for calculating the original sound signal or synthesizing the original sound signal and the noise suppression signal processing estimatemated By including the mechanism (extra-speech component acquisition unit 12) for the difference between the speech signals
  • the noise component signal extracts the noise component signal in which the human speech component is suppressed.
  • the noise component signal (out-of-speech component) that suppresses the human voice component is an unnecessary voice in the past, it will not be used as an output. Or, the non-voice component will not be used as the main input data of the application. According to this embodiment, it is possible to actively use the extra-speech component, and it is possible to easily construct an application that uses the acquired extra-speech component, in particular, an application system related to security such as intrusion detection and survival monitoring.
  • a plurality of microphones are used, but basically the following configuration is sufficient: a sound wave detection mechanism, a speech component acquisition mechanism of a speech recognition device, and a speech component acquisition mechanism, the sound wave detection The mechanism uses at least one microphone to receive the sound wave of the original sound and output it as an original sound signal, and the voice component acquisition mechanism extracts a voice from at least one of the original sound signal and a synthesized signal generated from a plurality of the original sound signals Component, the extra-speech component obtaining mechanism generates and outputs the extra-speech component at least according to the speech component and the original sound signal.
  • the original acoustic signal with the best reception state can be selected as the processing object from the original acoustic signals received by them.
  • the voice is generated External ingredients.
  • FIG. 6 is a block diagram showing a configuration example of a sound wave detection unit and an original sound signal processing unit for processing sound waves of original sound received by a plurality of microphones in the second embodiment.
  • the sound wave detection unit 310 includes a plurality of microphones 311A, 311B, and 311C (referred to as microphones 311 unless otherwise distinguished), and also includes a microphone for echo cancellation. (echo) input units 312A and 312B of the echo signal (in the case of no particular distinction, they are referred to as the input unit 312).
  • the smart speaker includes a function of outputting voice (including synthesized voice), it includes an echo cancellation function for the purpose of preventing the voice output from the smart speaker from entering its own microphone 311 and becoming noise input to the microphone.
  • the voice output from the smart speaker hereinafter referred to as an echo signal
  • an echo signal is input to the input unit 312.
  • the original acoustic signal processing unit 410 processes the original acoustic signal input from the acoustic wave detection unit 310 and outputs the processed original acoustic signal or the synthesized original acoustic signal to the signal component extraction unit 10. In addition, the original acoustic signal processing section 410 obtains the original acoustic signal from which the echo signal has been removed by using the echo cancellation function.
  • the gain adjustment units 411A, 411B, and 411C (referred to as the gain adjustment unit 411 unless otherwise distinguished) respectively adjust the gains including amplitude and phase of the original acoustic signals input from the microphones 311A, 311B, and 311C.
  • the gain adjustment units 412A and 412B (referred to as the gain adjustment unit 412 if not particularly distinguished) respectively adjust the gain including amplitude and phase of the echo signals input from the signal input units 312A and 312B.
  • the distribution unit 413 outputs the gain-adjusted original sound signal output by the gain adjustment unit 411 and the gain adjustment unit 412 to the speech component acquisition unit 11 and the extra-speech component acquisition unit 12.
  • the distribution unit 413 may also synthesize the gain-adjusted original sound signals output by the respective gain adjustment units 411 and 412 (if there is no special distinction, they will be referred to as synthesized original sound signals) to the speech component acquisition unit 11 , Output from the acquisition unit 12 for extra-speech components.
  • the control unit 414 performs control such as the following on the original sound signal output by the microphone 311: determining the gain for gain adjustment. For example, the control unit 414 adjusts the gains of the gain adjustment unit 411 and the gain adjustment unit 412 so that the integrated directivity of the plurality of microphones 311A, 311B, and 311C faces the direction of the origin of the voice based on beamforming technology or the like. By applying the gain adjusted by the control unit 414 to the gain adjustment unit 411 and the gain adjustment unit 412, the synthesized original sound signal becomes a signal in which the original sound received from the direction of the origin of the voice (the speaker, etc.) has been enhanced .
  • FIG. 7 is a diagram illustrating an example of selecting a microphone by adjusting the gain in this embodiment, and is an example in a case where the three microphones, the microphones 311A, 311B, and 311C, receive sound waves of the original sound.
  • FIG. 7(a) is an example in the case where the gains of the gain adjustment units 411A, 411B, and 411C (referred to as Ga, Gb, and Gc, respectively) are set to 1.0, respectively.
  • the distribution unit 413 will start from the three microphones 311A, 311B. , The original sound signal output by 311C is directly synthesized.
  • the directivity D-311A, D-311B, and D-311C indicate the directivity of the sound waves received by the microphones 311A, 311B, and 311C, respectively.
  • the example in this case can also be regarded as an example in which the microphone 311B is selected and utilized from the three microphones 311A, the microphone 311B, and the microphone 311C.
  • FIG. 8 is a flowchart for the system to extract extra-speech components from the original sound signal in this embodiment. This figure is used to explain the operation of this embodiment.
  • the voice uttered by the user U1 propagates in the form of sound waves and is received by the microphones 311A, 311B, and 311C (step S21). According to the positional relationship between the user U1 and the microphone 311, the voice uttered by the user U1 reaches the microphone 311B with the strongest intensity. In addition, the sound from the detection object N1 also reaches the microphone 311C with the strongest intensity. Therefore, by synthesizing the original sound signal in accordance with the setting shown in FIG. 7(b), the original sound can be obtained in a manner of enhancing the voice uttered by the user U1 compared with the case of FIG. 7(a). On the other hand, by synthesizing the original sound signal in accordance with the setting shown in FIG.
  • the original sound signal subjected to the gain adjustment is output to the signal component extraction unit 10 as a synthesized original sound signal or a separated original sound signal (step S22). Specific examples are shown below.
  • the original acoustic signals input from the microphones 311A, 311B, and 311C are respectively referred to as S01a, S01b, and S01c.
  • the original acoustic signals obtained by gain adjustment of the original acoustic signals S01a, S01b, and S01c be S01ag, S01bg, and S01cg.
  • the speech component acquisition unit 11 extracts the speech component Sv from the synthesized original sound signal S01g and outputs the speech component Sv to the speech recognition unit 5 and the extra-speech component acquisition unit 12 (step S23).
  • the voice component Sv output to the voice recognition unit 5 is subjected to voice recognition processing (step S24).
  • the original sound signal ⁇ S01g> and the gain are separated according to the gain adjustment.
  • Ga, Gb, and Gc are used to reproduce the separated original acoustic signal ⁇ S01> received by the microphones 311A, 311B, and 311C (step S25).
  • the speech component acquisition unit 11 outputs the speech component Sv to the extra-speech component acquisition unit 12, and the original sound signal processing unit 410 converts the original sound signals S01ag, S01bg, S01cg, gains Ga, Gb, and Gc to which the gains have been adjusted.
  • the extra-voice component acquisition unit 12 outputs.
  • the extra-speech component acquisition unit 12 performs inverse calculations on the original acoustic signals S01ag, S01bg, and S01cg that have undergone gain adjustment using gains Ga, Gb, and Gc, respectively, thereby being able to reproduce the original signals received by the microphones 311A, 311B, and 311C.
  • the acoustic signals S01a, S01b, and S01c (respectively correspond to the original acoustic signals received with the directivity of D-311A, D-311B, and D-311C in (a) of FIG. 7).
  • the voice extraneous component acquisition unit 12 can obtain the original acoustic signals received by the microphones 311A, 311B, and 311C by dividing the gain-adjusted original acoustic signals S01ag, S01bg, and S01cg by the gains Ga, Gb, and Gc, respectively.
  • the original acoustic signal processing unit 410 may synthesize the original acoustic signal before gain adjustment, that is, synthesize the original acoustic signal S01 or separate the original acoustic signal.
  • the signal ⁇ S01> is directly output to the speech extra-component acquisition unit 12.
  • the extra-speech component acquisition unit 12 extracts the extra-speech component Sn( Step S26).
  • Sn(f), S01r(f), Sv(f), ⁇ (f) represent the values of the frequency f of the extra-speech component Sn, the synthesized original sound signal S01r, the estimated speech signal Sv, and the additional data ⁇ , respectively.
  • the additional data ⁇ may be, like the additional data ⁇ used in the first embodiment, the value of a filter obtained by up-and-down the filter applied to the original data by the voice component acquisition unit 11 to extract the voice component, or it may be Empirical values obtained based on experiments, past data, etc.
  • S01(f) represents the value of frequency f of the synthesized original acoustic signal S01.
  • the voice recognition device can be used to easily extract out-of-speech components.
  • the functional configuration of FIG. 6 is used to explain an example in which a technique capable of changing the received directivity of the sound wave by adjusting the gain of the original sound signal output by the sound wave detection unit having a plurality of microphones is applied (for example, in the case of beamforming technology), extra-speech components are generated.
  • the location where the user exists will change.
  • the microphone 311 is selected according to the location of the user is shown.
  • the control unit 414 estimates the position of the user U1 by beamforming technology while changing the directivity of the sound waves received by the microphone 311 according to the estimated position of the user.
  • FIG. 9 is a diagram illustrating an example of adjusting the directivity of received sound waves by adjusting the gain in this embodiment, and is an example in a case where the directivity of the microphone 311 is changed by beamforming technology or the like.
  • the directivity D-311 indicates the directivity of the received sound wave when the microphones 311A, 311B, and 311C are regarded as one microphone, and the center point D-311-C indicates the center point of the directional beam.
  • the user U2 represents the user who uttered the voice
  • the detection object N2 represents the detection object who uttered the extra-speech component to be detected.
  • FIG. 9 shows an example of the directivity of the received sound wave when the directional beam B-311 is directed toward the position (estimated position) of the user U2.
  • FIG. 9(c) shows the directivity A-311 obtained by subtracting the directional beam B-311 of FIG. 9(b) from the directivity D-311 of FIG. 9(a).
  • a general voice recognition device it is desirable to perform voice recognition processing on the original acoustic signal obtained as the directional beam B-311 toward the user U2 as shown in FIG. 9(b), but in this embodiment, it is desirable to As in (c), avoid the directivity A-311 of user U2 as much as possible to obtain the original acoustic signal.
  • the voice uttered by the user U2 propagates in the form of sound waves and is received by the microphones 311A, 311B, and 311C (step S21).
  • the control unit 414 uses the beamforming technology to estimate the position of the user U2 using the received voice, generates a directional beam B-311 as shown in FIG. 9(b), and uses the generated directional beam B-311 to obtain the original Acoustic signal (step S22). Since the beamforming technology is a normal technology, the description is omitted. As a result of generating the directional beam B-311, the directivity of the received sound wave is directed toward the position (or estimated position) of the user U2. Let the gains for the microphones 311 at this time be Ga1, Gb1, and Gc1.
  • Acoustic signal step S22).
  • the original acoustic signals output from the microphones 311A, 311B, and 311C are respectively referred to as S02a, S02b, and S02c.
  • the speech component acquisition unit 11 extracts the speech component Sv from the synthesized original sound signal S02g, and outputs the speech component Sv to the speech recognition unit 5 and the extra-speech component acquisition unit 12 (step S23).
  • the voice component Sv output to the voice recognition unit 5 is subjected to voice recognition processing (step S24).
  • the separation original sound signal ⁇ S02g> and the gain Ga are adjusted based on the gain. , Gb, and Gc to reproduce the original sound signals received by the microphones 311A, 311B, and 311C (step S25).
  • the speech component acquisition unit 11 outputs the speech component Sv to the extra-speech component acquisition unit 12, and the original sound signal processing unit 410 converts the original sound signals S02ag, S02bg, S02cg and gains Ga1, Gb1, and Gc1 to which the gain has been adjusted.
  • the extra-voice component acquisition unit 12 outputs.
  • the voice extraneous component acquisition unit 12 performs inverse calculations on the original acoustic signals S02ag, S02bg, and S02cg that have undergone gain adjustment using gains Ga1, Gb1, and Gc1, respectively, thereby being able to reproduce the original sound signals received by microphones 311A, 311B, and 311C.
  • Acoustic signals S02a, S02b, and S02c (corresponding to the original acoustic signals received with the directivity of D-311 in Fig. 9(a)). Let the reproduced original sound signals S02a, S02b, and S02c be S02ar, S02br, and S02cr, respectively.
  • the extra-speech component acquisition unit 12 generates the synthesized original sound signal S02r (representing the synthesized value of S02ar, S02br, and S02cr) based on the speech component Sv and additional data ⁇ input from the speech component acquisition unit 11, and the reproduced original sound signal.
  • the external component Sn is obtained, so Sv(f) and ⁇ (f) are not required.
  • Sn(f) S02r(f)-S02g-Sv(f)
  • ⁇ (f) obtains the extra-speech component Sn.
  • the directivity A-311 is the directivity obtained by removing the beam directivity B-311 from the directivity D-311.
  • a speech recognition device to which beamforming technology is applied can be used to easily extract out-of-speech components.
  • the detection of a door sound is used as a trigger to obtain the external voice component after the trigger output, and the reference data is generated from the obtained external voice component. If it is assumed that the reference data is generated at ordinary times (the state where no abnormality has occurred), for example, when the application determines that the extra-voice component acquired by the door opening sound as a trigger is different from the reference data, it is determined that some abnormality has occurred.
  • the abnormality detection signal is output as non-voice information.
  • detection of recognized words obtained by voice recognition may also be used as a trigger for obtaining reference data. In this embodiment, for the trigger for obtaining reference data, the following examples are shown: a case where a recognized word obtained through voice recognition is detected; and a case where extra-speech information such as a door sound is detected.
  • Fig. 10 is a flowchart for the system of the fourth embodiment to execute reference data generation processing.
  • the generation of the reference data may be performed during the operation of the application system using the steps of the flowcharts shown in the first to third embodiments, or may be performed outside the operation period.
  • the user may arbitrarily break the glass, and cause the sound wave detection unit 3 to receive the sound wave of the breaking sound of the glass and record it in the storage unit 6 as the reference data.
  • the functional configuration of FIG. 1 is used to show an example in which normal environmental data where no abnormality occurs during the operation period of the application system is acquired as reference data.
  • the application system is put into operation, and the speech recognition processing and the speech out-of-speech component recognition processing are operated (steps S301 and S401).
  • the speech recognition processing and the recognition processing of extra-speech components can be implemented based on the flowchart of FIG. 2, for example. However, in the flowchart of FIG. 2, the processing ends after step S11, but in this embodiment, the processing of steps S1 to S11 is always repeated.
  • the flowchart of FIG. 10 shows a case where the speech recognition process and the speech extra-speech component analysis process operate in parallel.
  • the processing flow starting from the speech recognition process will be explained first, and then the processing flow starting from the speech extra-speech component recognition processing will be explained.
  • the control unit 7 monitors whether the speech recognition unit 5 detects a recognized word (step S302).
  • the control unit 7 confirms whether or not the timer unit A (not shown) is activated (step S303).
  • the control unit 7 confirms that the timer unit A is activated, it returns to step 302 and continues to monitor whether the speech recognition unit 5 has detected a recognized word (YES in step S303).
  • the control unit 7 confirms that the time measurement unit A has not been activated, the time measurement unit A is activated to start the time measurement of the time T11 (NO in step S303, proceed to S304).
  • the time T11 exceeds the threshold TH_T11, the counting of the time T21 is started (step S306).
  • the storage (recording) of the extra-voice component to the storage unit 6 is started at the same time (step S307).
  • the time T21 is less than or equal to the threshold TH_T21
  • the storage (recording) of the extra-voice component in the storage unit 6 in step S306 is continued (NO in step S308).
  • the time T21 exceeds the threshold value TH_T21
  • the storage (recording) of the extraneous voice component in the storage unit 6 is stopped (YES in step S308, and proceed to S309).
  • FIG. 11 is a diagram for explaining the time for counting
  • FIG. 11(a) is an explanatory diagram of the times T11 and T21 for counting in the processing flow starting from the voice recognition process.
  • the time T11 is the time from when the speech recognition unit 5 detects the recognized word and starts the timer unit A in step S304 to the time the extra-speech component is stored (recorded) in the storage unit 6.
  • the storage (recording) of the extra-voice component in the storage unit 6 is started (step S307).
  • the time T21 is the time from the start of the storage (recording) of the extra-voice component in the storage unit 6 to the stop of the recording.
  • Thresholds TH_T11 and TH_T21 can be set for the times T11 and T21, respectively, so that the time (corresponding to T11) and recording time (corresponding to T21) from the detection of the recognized word by the voice recognition unit 5 to the start of recording can be controlled.
  • Step S303 is for making the data storage process of step S307 not implement multiple conditional branches at the same time, but it is also possible to simultaneously execute multiple data storage processes by preparing multiple timing units A and the like.
  • control unit 7 monitors whether the application processing unit 2 detects extra-voice information (step S402).
  • the control unit 7 confirms whether or not the timer unit B (not shown) is activated (step S403).
  • the control unit 7 confirms that the timer unit B is activated, it returns to step 402 and continues to monitor whether the application processing unit 2 detects extra-voice information (YES in step S403).
  • the control unit 7 confirms that the time measurement unit B has not been activated, the time measurement unit B is activated to start the measurement of the time T12 (NO in step S403, proceed to S404).
  • step S407 the storage (recording) of the extra-voice component to the storage unit 6 is started at the same time (step S407).
  • the time T22 is less than or equal to the threshold value TH_T22
  • the storage (recording) of the extra-voice component in the storage unit 6 in step S407 is continued (NO in step S408).
  • the time T22 exceeds the threshold value TH_T22
  • the storage (recording) of the extra-voice component in the storage unit 6 is stopped (YES in step S408, and proceed to S409).
  • FIG. 11(b) is an explanatory diagram of the time T12 and T22 for timing in the processing flow of the recognition processing of the extra-speech component.
  • the time T12 is the time from when the application processing unit 2 detects the extra-speech information and starts the timer unit B in step S404 until the extra-speech component is stored (recorded) in the storage unit 6.
  • the time T12 exceeds the threshold TH_T12, the storage (recording) of the extra-voice component in the storage unit 6 is started (step S407).
  • the time T22 is the time from the start of the storage (recording) of the extra-voice component in the storage unit 6 to the stop of the recording.
  • Thresholds TH_T12 and TH_T22 can be set for the times T12 and T22, respectively, so that the time (equivalent to T12) and the recording time (equivalent to T22) from when the application processing unit 2 detects the non-voice information to the start of recording can be controlled.
  • the present embodiment it is possible to acquire reference data using, for example, the detection of recognized words obtained by voice recognition or the detection of extra-speech information such as door sounds as a trigger.
  • the user can assign detection event names such as "normal state after the recognition word is detected", "normal state after the door opening sound is detected”, etc., to the acquired reference data.
  • detection event names such as "normal state after the recognition word is detected", "normal state after the door opening sound is detected”, etc.
  • a list of detected event names for which reference data has been acquired may be displayed on a display unit (not shown) so that the user can select it from a remote control or the like.
  • the application processing unit 2 can use the "door opening sound” as a trigger in the operating state to obtain the following Out-of-speech components.
  • the application processing unit 2 may also compare the acquired extra-speech components with corresponding reference data to output detection event information (extra-speech information) such as an abnormality detection signal.
  • the output extra-voice information can be sent to the user's smart phone or the like via the Internet or the like. For example, the user can be notified that "the door of his house is opened and an abnormality has occurred" as an extra-voice message.
  • the control target based on voice recognition is operating normally, for example.
  • an air conditioner air conditioner
  • a power-on command generated by the voice recognition unit 5 that is, a start command
  • the air conditioner successfully receives the start command, it will act according to the control content (power on) of the start command.
  • the application processing unit 2 uses the detection of a recognized word (in this case, equivalent to a start command) obtained by voice recognition as a trigger to obtain ambient sound as reference data. From the time the air conditioner that normally receives the start command is turned on, for example, the sound after the air conditioner starts to start is recorded as the reference data acquired in this embodiment. If the reference data is used in the application system operation, when the user issues a start command but the air conditioner fails to receive the start command and performs an action other than the start command, the determination unit 22 of the application processing unit 2 compares the recognition word (start command The detection of) is used as a trigger and the extracted extra-speech component is compared with the reference data (for example, step S10 in FIG. 2).
  • start command The detection of
  • the air conditioner failed to receive the start command.
  • the result of this determination can be sent to the user’s smartphone via the Internet as detection event information, the recognized word can also be displayed on a display unit not shown, or the recognized word can be voiced through a speaker not shown, such as voice synthesis. Form output.
  • the user can know whether the instruction given by the voice is normally acting on the control object.
  • the determination unit 22 of the application processing unit 2 compares the acquired extra-speech component with the reference data acquired in advance by this embodiment, and the result of the comparison is that it is determined to be The detection event is a false detection event.
  • the detection event is not used as a trigger, if an extra-speech component that matches the reference data acquired by this embodiment is detected, it can be known that a detection event corresponding to the reference data has occurred.
  • the judging unit 22 of the application processing unit 2 always monitors the extra-voice component, and when it is judged that it matches the reference data acquired immediately after the detection of the detection event A in advance, it is judged that the detection event A has occurred.
  • FIG. 12 is a diagram showing a configuration example of an electronic device incorporating an extra-voice component generating unit according to a modification.
  • the electronic device 100 only needs to have a control function based on voice recognition, and may be a home appliance such as a television receiver, an air conditioner, a refrigerator, a washing machine, etc., and is not limited to home appliances.
  • the control unit 101 performs control of each function of the electronic device 100 based on the control information generated by the voice recognition unit 5.
  • This modification can also be regarded as an example in which the system shown in FIG. 1 is built in the electronic device 100.
  • the application system of the device 100 such as a home security system.
  • FIG. 13 is a diagram showing a configuration example of a system in the case of using an external smart speaker according to a modification.
  • the electronic device 100A only needs to have a control function based on voice recognition, and it may be a home appliance such as a television receiver, an air conditioner, a refrigerator, a washing machine, etc., and is not limited to a home appliance.
  • the smart speaker 200 is externally connected to the electronic device 100A via an interface not shown.
  • the control unit 101A controls each function of the electronic device 100A based on the control information generated by the voice recognition unit 5.
  • an application device 300 is connected to the smart speaker 200.
  • the application device 300 uses the same processing as the flowchart in FIG. 2 or FIG.
  • the application device 300 may notify the user's smartphone from the interface unit 8 via the Internet when a detection event is detected.
  • each function shown in FIG. 12 or FIG. 13 may be distributed on the Internet via the interface unit 8 or the like, or may be set as a function on a cloud server, and various combinations of functions and system forms may be considered.
  • the application device 300 as a device on the cloud, it is possible to accumulate reference data from a large number of users connected to the network as a sample, which can be used, for example, to improve the detection accuracy of a detection event.
  • the extra-speech component can be extracted in various ways, and the extracted extra-speech component can be used in various ways for, for example, home security technology.
  • an environmental sound output device, system, method, and non-volatile storage medium that can easily extract environmental sounds using a voice recognition device
  • the non-volatile storage stores computer instructions, and when the computer instructions are executed by a processor or a computer device, the above-mentioned method of using a voice recognition device to easily extract environmental sounds can be realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

提供能够使用语音识别设备来容易地提取环境声的环境声输出装置、系统、方法及非易失性存储介质。环境声输出装置具备声波检测机构、语音识别设备的语音成分获取机构、和语音外成分获取机构,所述声波检测机构利用至少一个麦克风接收原始声,并且作为原始声信号来输出,所述语音成分获取机构从所述原始声信号、以及根据多个所述原始声信号生成的合成信号中的至少一方提取语音成分,所述语音外成分获取机构至少根据所述语音成分和所述原始声信号来生成并且输出所述语音外成分。

Description

环境声输出装置、系统、方法及非易失性存储介质
本申请要求在2020年1月17日提交日本专利局、申请号为2020-006227、发明名称为“环境声输出装置、系统、方法及程序”的日本专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施方式涉及环境声输出装置、系统、方法及非易失性存储介质。
背景技术
在用于确保住宅的安全性的家庭安全技术中,需要与安全性的用途对应的传感器。例如,为了进行入侵检测,需要用于检测窗户、门的开闭的窗户传感器、门传感器等多个传感器。另外,除了要假定住宅中的入侵场所、居住者移动的活动路线等来设置传感器等引入的麻烦以外,还会在设置后产生需要进行传感器的蓄电池更换等维护的成本。
近年来,智能音箱(Smart speaker)等利用了语音识别技术的设备正在普及。例如,智能音箱只要是处于语音到达的范围内则能够根据人发出的语音来进行家电等的远程控制,通常覆盖周围几平方米的广阔区域。当使用这样的语音识别设备时,有可能实现以无需多个传感器这样的低成本引入的容易的家庭安全技术(Home security)。语音识别设备的目的在于,从输入的声音信号中特别提取人的声音的成分(称为语音成分),根据语音信号对人发出的语言进行文本化(文字化、语言化等)。在语音识别设备中,在上述目的的基础上,还引入用于抑制语音以外的环境声成分(称为语音外成分)的技术以消除或尽可能减小语音外成分。
在先技术文献
专利文献
专利文献1:日本特开2001-188555号公报
专利文献2:日本特开2000-222000号公报
发明内容
语音外成分能够在家庭安全技术等中利用。特别是通过从正在普及的智能音箱中提取语音外成分而能够容易地引入家庭安全技术。另外,最近在电视等家电中也同样地引入语音识别技术,通过提取语音外成分而能够使用家电来容易地引入家庭安全技术。然而,很难说这些优点已得到有效利用。
本申请要解决的课题在于,提供能够使用语音识别设备来容易地提取环境声的环境声输出装置、系统、方法及非易失性存储介质。
本申请一实施方式的环境声输出装置具备声波检测机构、语音识别设备的语音成分获取机构、以及语音外成分获取机构,所述声波检测机构利用至少一个麦克风来接收原始声,并且作为原始声信号来输出,所述语音成分获取机构从所述原始声信号、以及根据多个所述原始声信号生成的合成信号中的至少一方提取语音成分,所述语音外成分获取机构至少根据所述语音成分和所述原始声信号来生成并且输出所述语音外成分。
附图说明
图1是表示第一实施方式的系统的构成例的框图;
图2是用于第一实施方式的系统接收原始声的声波来执行语音外成分的分析处理的流程图;
图3是表示第一实施方式的系统从原始声信号中提取到语音信号为止的滤波器和各信号的示意图;
图4是表示第一实施方式的系统从原始声信号中提取到语音外成分为止的滤波器和各信号的示意图;
图5是表示第一实施方式的系统从语音外成分中提取到特定成分为止的滤波器和各信号的示意图;
图6是表示在第二实施方式中用于对多个麦克风接收到的原始声的声波进行处理的声波检测部及原始声信号处理部的构成例的框图;
图7是说明在第二实施方式中通过增益的调整来选择麦克风的示例的图;
图8是在第二实施方式中用于系统从原始声信号中提取语音外成分的流程图;
图9是说明在第三实施方式中通过增益的调整来调整接收声波的指向性的示例的图;
图10是用于第四实施方式的系统执行参照数据的生成处理的流程图;
图11是说明第四实施方式的系统要计时的时间的图;
图12是表示变形例的内置有语音外成分生成部的电子设备的构成例的图;
图13是表示本变形例的使用了外置的智能音箱的情况下的系统的构成例的图。
附图标记说明
1…语音外成分生成部、2…应用处理部、3…语音检测部、4…原始声信号处理部、5…语音识别部、6…存储部、7…控制部、8…接口部、10…信号成分提取部、11…语音成分获取部、12…语音外成分获取部、21…处理部、22…判定部、23…存储部、31…麦克风、100…电视接收装置、101…控制部、200…智能音箱、300…应用装置。
具体实施方式
以下,参照附图对实施方式进行说明。
(第一实施方式)
在本实施方式中,说明如下的示例:生成从声波检测部接收到的原始声中去除语音成分而得到的语音外成分,并将语音外成分在应用系统中使用。
图1是表示第一实施方式的系统的构成例的框图。
语音外成分获取部1是获取原始声来生成并输出语音外成分的装置。原始声表示物理上的声波。语音外成分获取部1包括声波检测部3、原始声信号处理部4和信号成分提取部10。
声波检测部3包括麦克风和未图示的模拟数字转换部(以下称为A/D转换部)。在麦克风31A、31B、31C(在特别需要进行区分的情况下记为麦克风31)中接收原始声的声波并将其转换为电信号(模拟原始声信号)。模拟原始声信号在未图示的A/D转换部中被向数字值转换并输出原始声信号。以下,只要没有特别区分,则原始声信号表示基于数字值的原始声信号。
原始声信号处理部4输出通过将声波检测部3所输出的原始声信号合成而得到的原始声信号(以下称为合成原始声信号)。另外,原始声信号处理部4对从多个麦克风31输出的原始声信号实施增益调整等并进行合成。这里,“从麦克风31输出的原始声信号”这样的表述中的原始声信号表示基于数字值的原始声信号。实际上原始声信号是从A/D转换部输出的,但以下出于明确原始声信号的来源的目的,有时也使用这样的表述。原始声信号处理部4也可以通过增益调整等来例如从多个麦克风31A、31B、31C中选择有效的麦克风、适用波束成形(Beamforming)等麦克风阵列技术。原始声信号处理部4例如可以以数字信号处理器(也称为Digital Signal Processor或DSP)等形式进行动作,例如也可以在微型计算机等计算机上作为软件来进行动作,还可以以IC芯片等硬件的形式进行动作。另外,还可以以组合以上形式而得到的形式进行动作。
信号成分提取部10根据从原始声信号处理部4输入的原始声信号或合成原始声信号,生成语音成分(也称为语音信号)和语音外成分并将它们输出。语音成分是特别以人的语音的成分为中心的成分,也可以是语音成分的推定值。语音外成分表示从原始声中去除语音成分而得到的成分。原始声信号处理部4例如可以以数字信号处理器DSP等形式进行动作,例如也可以在微型计算机等计算机上作为软件来进行动作,还可以以IC芯片等硬件的形式进行动作。另外,还可以以组合以上形式而得到的形式进行动作。信号成分提取部10包括语音成分获取部11和语音外成分提取部12。
语音成分获取部11从原始声信号处理部4所输出的原始声信号或合成原始声信号中提取语音成分。语音成分获取部11也可以具备在智能音箱等通常的语音识别设备中使用的用于提取语音成分的滤波器、用于去除语音外成分的降噪器等。
语音外成分获取部12使用原始声信号处理部4所输出的原始声信号或合成原始声信号、语音成分获取部11所输出的语音成分、在原始声信号处理部4中使用的增益调整量(有时也简称为增益)等参数等来提取语音外成分并将其输出。语音外成分获取部12也可以针对语音成分获取部11所输出的语音成分根据需要来利用滤波器。滤波器的信息也可以作为附加数据而从语音成分获取部11输入到语音外成分获取部12中。
应用处理部2对从信号成分提取部10输入的语音外成分实施分析等,生成基于语音外成分得到的信息(以下称为语音外信息)并将语音外信息向外部输出。语音外信息可以包括与环境声相关的信息,例如在语音外成分的分析的结果是检测出玻璃的破碎声的情况下,可以将“玻璃破碎”这样的信息作为语音外信息来输出。应用处理部2也可以基于语音外成分来检测异常声,将有无检测出异常声等检测结果、从何种物体产生了何种异常声等检测内容作为语音外信息来输出。应用处理部2能够检测出的异常声例如可以是窗户玻璃的破碎声、不在家时的人的脚步声、重物的倒塌声、人的摔倒声等任意的异常声。应用处理部2例如可以在微型计算机等计算机上作为软件来进行动作,也可以在IC芯片等硬件中进行动作。
处理部21A、21B、21C(在不特别区分的情况下,称为处理部21)从语音外成分中提取必要的成分(以下称为特定成分)。具体而言,根据语音外成分的频率特性等,从语音外成分中提取某特定频带的成分来作为特定成分。处理部21A、21B、21C分别提取的特定成分可以不同,例如根据检测出何种异常声等情况来确定。另外,要提取的特定成分也可以在各处理部21中预先确定,这种情况下,例如可以向用于处理应用处理部2的功能的软件、硬件中编入用于确定特定成分的频带的信息。另外,用户等也可以经由例如接 口部8等来设定或选择应提取的特定成分。另外,处理部21还可以算出表示想要从语音外成分检测出的异常声中包含的特定成分的频率特征量数据。处理部21将提取出的特定成分或频率特征量数据向后段输出。
存储部23按照要检测的事件(以下称为检测事件)来存储特定成分的数据(以下称为参照数据)。检测事件例如是窗户的破碎声、不在家时的人的脚步声、重物的倒塌声、人的摔倒声等。按检测事件来预先获取特定成分的数据并设为各检测事件的参照数据。特定成分的参照数据可以利用语音外成分获取部12所输出的语音外成分,该参照数据可以是在频率特性等频率区域中表示的数据,也可以是在时间信号等时间区域中表示的数据。另外,参照数据还可以是根据特定成分算出的频率特征量数据。
例如,在将检测事件设为“窗户的破碎声”的情况下,参照数据可以通过实际上在声波检测部3的附近打破窗户而将“窗户的破碎声”存储于(录音于)存储部23来获取。另外,也可以将在后述的存储部6中按检测事件来存储的声样数据下载到存储部23中。还可以将写入有声样数据的CD等存储介质或互联网上的服务器等提供的样本等下载到存储部23中。另外,还可以是用户对电视的广播信号、无线电广播等传送的内容进行录音、编辑来制成声样数据,并将其存储于存储部23。在声样数据中没有检测事件的名字的情况下,向声样数据赋予检测事件的名字来设为参照数据。
判定部22在从处理部21输入特定成分(或其特征数据)时,从存储部23获取对应的参照数据,对输入的特定成分(或其特征数据)与获取到的参照数据进行比较,在判定为两者一致时,将一致的参照数据上赋予的检测事件的名称等作为检测事件信息来输出。
上述的示例是表示特定成分与参照数据1对1地对应的情况的示例,但特定成分与参照数据的组合可以是各种组合。例如,特定成分的数量与参照数据的数量可以是1对M(M为2以上的自然数)的情况。例如,在某个特定成分的频率区域与多个参照数据的语音外成分的频率区域重叠这样的情况下,特定成分与参照数据成为1对M。
另外,特定成分的数量与参照数据的数量可以是M对1(M为2以上的自然数)的情况。处理部21向判定部22输出多个(例如M个)特定成分,判定部22按输入的特定成分来分别将各特定成分与对应的参照数据进行比较,输出多个(例如M个)检测事件信息。判定部22输出了的检测事件信息例如可以经由互联网向应用处理部2所设定的用户的智能手机输出,而以检测事件信息的形式通知用户发生了事件。需要说明的是,判定部22的功能也可以通过利用基于智能音箱上配备的人工智能技术(AI技术)进行的触发词检测(唤醒词检测)、特定词检测来实现。
根据本实施方式,通过使用广泛普及的智能音箱的功能而能够进行异常检测等家庭安全系统的构建。
语音识别部5针对语音成分获取部11所输出的语音成分使用语音识别技术来识别语音并将其文本化(文字化、语言化等)。语音识别部5输出经过文本化而得到的语言(以下称为识别词)。输出目的地可以是要利用识别词的应用或装置,也可以在未图示的显示部上显示识别词,还可以通过语音合成等从未图示的扬声器将识别词以语音的形式输出。利用识别词的应用或装置例如将识别词用作控制用的指令,从语音识别部5接收到识别词的应用或装置基于识别词来执行控制等。由于语音识别技术是公知的常见技术,因此省略详细的说明。
在存储部6中存储有要让存储部23来存储的参照数据的样本数据。如以上在存储部 23的说明中叙述的那样,在存储部6中存储有下载下来的声样数据、录制数据或它们的特征量等,根据需要来向存储部23提供。也可以在存储部6中包含未存储于存储部23的检测事件的参照数据。例如可以是,在用户从遥控器选择了想要让应用处理部2检测出的“检测事件”的情况下,当控制部7经由接口部8接收到从遥控器发送来的选择信号时,控制部7将存储于存储部6的“检测事件”的参照数据设定于存储部23。
控制部7控制本应用系统的各功能。具体而言,控制部7可以基于从接口部8输入的包括选择信号在内的各种控制信号来控制各功能。另外,控制部7也可以基于语音识别部5所输出的识别词来控制各功能。需要说明的是,在图1中,可以在控制部7和没有与该控制部7连线的功能框之间也进行数据的交换(包括控制)。
接口部8与本应用系统的外部进行各种通信。具体而言,接口部8包括远程控制器(以下称为遥控器)、红外线通信、鼠标、键盘、以太网、HDMI、Wifi、第5代移动通信(5G)等有线、无线的各种通信接口。例如,用户能够使用遥控器来对本应用系统进行各种设定、控制。另外,接口部8例如也可以具备用于接入互联网的通信接口,从互联网上的服务器向存储部23下载声样数据等。另外,接口部8例如还可以向与互联网连接的用户的智能手机、PC等输出检测事件信息。另外,接口部8还可以具备能够实现应用处理部2与外部的家电的通信的接口,例如可以经由接口部8中的红外线通信对外部的家电生成、发送指令等来控制外部的家电。
另外,本系统也可以采用包含语音成分获取部11、声波检测部3、语音识别部5等的功能在内的智能音箱。智能音箱设置在家庭内,用于识别人声。智能音箱的麦克风(与本实施方式的声波检测部3所包括的麦克风相当)覆盖周围几平方米的广阔区域并始终监视着周边的状态。智能音箱成为提取人的声音这样的结构,包含与语音成分获取部11相当的功能。另外,智能音箱出于提取人声这样的目的而将噪音等杂音或环境声等认为是检测人的声音时的障碍,因此通过降噪处理、基于波束成形技术实现的语音收集方向的缩窄等技术来抑制上述障碍。另外,也出于用于确认用户的意图的目的,而在检测出唤醒词之后进行成为对话模式的动作。这样,智能音箱构成为为了增强人的声音(语音成分)而抑制成为其背景的语音外成分。然而,在本实施方式中,使用智能音箱来提取并利用语音外成分。
图2是用于该实施方式的系统接收原始声的声波来执行语音外成分的分析处理的流程图。
语音检测部3常态下启动,接收语音检测部3周边的原始声的声波,生成电形式的原始声信号并将其输出(步骤S1)。将从麦克风31A、31B、31C输出的原始声信号分别设为S0a、S0b、S0c。从语音检测部3向原始声信号处理部4输入的原始声信号被合成而作为合成原始声信号(设为S0)向语音成分获取部11和语音外成分获取部12输入(步骤S2)。这里,合成原始声信号S0可以是以S0=S0a+S0b+S0c的方式将来自各麦克风31的原始声信号合成而得到的值。另外,合成原始声信号S0也可以设为(S0a、S0b、S0c)这样将来自各麦克风31的原始声信号区分开的值(称为分离原始声信号)。特别是在表示分离原始声信号的情况下,标注<>来表示<S0>。即,表示为<S0>=(S0a、S0b、S0c)。
图3示出该实施方式的系统从原始声信号提取到语音信号为止的滤波器和各信号的频率特性的示意图,在(a)~(d)中,横轴可以由频率表示,纵轴可以由例如振幅值、电力值等或它们的相对值(概率密度函数等)表示。图3是用来便于说明的示意图,该图并不意在严格示出数值的大小。图3的(a)是以合成原始声信号S0=S0a+S0b+S0c的 方式将原始声信号合成的情况下的示意图。
返回到图2,语音成分获取部11针对输入的合成原始声信号S0进行语音信号Sv的提取处理(步骤S3)。具体而言,向用于提取语音信号的滤波器(例如图3的(b)、(c)的示意图)输入合成原始声信号S0来获得语音信号的推定值Sv(以下有时也称为推定语音信号)。推定语音信号Sv(例如图3的(d)的示意图)被向语音识别部5输入,语音识别部5基于推定语音信号Sv来进行语音识别处理,从语音信号中识别语言(进行文本化)来获得识别词(步骤S4)。语音识别部5将获得的识别词向外部输出(步骤S5)。
从原始声信号处理部4向语音外成分获取部12输入合成原始声信号S0,从语音成分获取部11向语音外成分获取部12输入推定语音信号Sv和附加数据α。附加数据α可以是语音成分获取部11对合成原始声信号S0适用了的滤波器的值。
图4示出该实施方式的系统从原始声信号提取到语音外成分为止的滤波器和各信号的频率特性的示意图,在(a)~(c)中,横轴可以由频率表示,纵轴可以由例如振幅值、电力值等或它们的相对值(概率密度函数等)等表示。与图3同样,图4是用来便于说明的示意图,该图并不意在严格示出数值的大小。图4的(a)是以合成原始声信号S0=S0a+S0b+S0c的方式将原始声信号合成的情况下的示意图。图4的(b)示出为了将在步骤S3中被抑制了的语音外成分的等级恢复到原始状态而使用的滤波器,例如,示出将图3的(c)的滤波器上下对调了的滤波器。将该滤波器的值例如设为附加数据α。另外,也可以将根据实验、过往数据等获得的经验值作为附加数据α来使用。
在语音外成分获取部12中,例如至少根据合成原始声信号S0和推定语音信号Sv来获取语音外成分Sn(例如图4的(c)的示意图)(步骤S6)。具体而言,例如以Sn(f)=S0(f)-Sv(f)的方式获得语音外成分Sn。其中,Sn(f)、S0(f)、Sv(f)分别表示语音外成分Sn、合成原始声信号S0、推定语音信号Sv的按频率f的值,针对可能的范围内的频率f求解Sn(f)来获得Sn。
另外,也可以针对合成原始声信号S0和推定语音信号Sv考虑附加数据α来获取语音外成分Sn。具体而言,例如,以Sn(f)=S0(f)-Sv(f)×α(f)的方式获得语音外成分Sn。其中,α(f)表示附加数据α的按频率f的值。
在本实施方式中,设计成用户对应用处理部2实施了例如对检测事件A进行检测这样的设定。与检测事件A对应的参照数据存储于存储部23A。
语音外成分获取部12所获取的语音外成分被向应用处理部2输入,由处理部21A从语音外成分中提取特定成分,提取出的特定成分被向判定部22A输出(步骤S8)。
图5在频率区域中示出该实施方式的系统从语音外成分中提取到特定成分为止的滤波器和各信号的示意图,在图5的(a)~(c)中,横轴可以由频率表示,纵轴可以由例如振幅值、电力值等或它们的相对值(概率密度函数等)表示。图5是用来便于说明的示意图,该图并不意在严格示出数值的大小。
处理部21A例如可以包括图5的(b)的滤波器fna这样的抑制高频成分且提取低频成分的滤波器。具体而言,在将例如人的脚步声作为检测事件A来检测的情况下使用该滤波器fna。滤波器fna可以通过使用事先录制好的人的脚步声等来形成且预先设定于处理部21A。可以是在用户进行了对检测事件A进行检测这样的设定时,选择设定有滤波器fna的处理部21A。图5的(c)是在步骤S8中提取出的特性成分Sna的示意图。另外,也可以在用户进行了对与检测事件A不同的检测事件B、检测事件C进行检测这样的设定时, 选择包括分别按事件预先获取了的滤波器等在内的处理部21B、处理部21C。
判定部22A从存储部23A获取参照数据(步骤S9)。
参照数据例如是图5的(b)的特定成分滤波器fna。这种情况下,与检测事件A关联地将特定成分滤波器fna存储于存储部23A来作为参照数据。另外,参照数据除了特定成分滤波器fna以外,还可以通过使用事先录制好的人的脚步声、从互联网上的服务器等下载下来的人的脚步声的样本等来形成并预先设定于存储部23A。可以在用户进行了对检测事件A进行检测这样的设定时,判定部22A从存储部23A获取参照数据。另外,也可以在用户进行了对检测事件B、检测事件C进行检测这样的设定时,判定部22A从分别与检测事件B、检测事件C相关联的存储部23B、23C中获取参照数据。
判定部22A对输入的特定成分Sna与参照数据进行比较(步骤S10)。具体的比较方法例如可以是:求解参照数据与特定成分Sna的相关值,在相关值大于某阈值的情况下视作“参照数据与特定成分一致”。判定部22A在判定为“参照数据与特定成分一致”的情况下,例如将判定所使用的参照数据的检测事件名等作为检测事件信息(语音外信息)来输出(步骤S11)。输出的语音外信息可以根据应用而例如向为了通知用户而指定好的智能手机输出。
在以上的示例中,示出的是特性成分与参照数据的组合为1对1的情况的示例。在检测多个检测事件的情况下,通过使用预先分别被分派给各检测事件的处理部21、判定部22、存储部23,而能够从判定部22获得各检测事件的检测事件信息(语音外信息)。另外,例如在特性成分与参照数据的数量是1对M的情况下,针对在步骤S8中获取到的一个特定成分按参照数据来反复进行步骤S9~S11,从而获得各参照数据的检测事件信息。
根据以上所示的本实施方式,能够取出抑制了人的声音成分(语音成分)的环境成分(语音外成分)。通过使用取出的语音外成分,能够简单地实现以往需要专用传感器的安全性应用、应用系统。尤其是在市面上正在普及的智能音箱或者具有同等的使用语音识别来进行动作的功能的系统中,通过包括用于算出原始声信号或合成原始声信号与杂音抑制信号处理后的语音信号(推定语音信号)之间的差量的机构(语音外成分获取部12)在内,由此能够简单地取出抑制了人的声音成分的杂音成分信号(语音外成分)。由于抑制了人的声音成分的杂音成分信号(语音外成分)以往是不需要的语音,因此不会将其作为输出来使用。或者,不会将语音外成分活用为应用的主输入数据。根据本实施方式,能够积极地利用语音外成分,能够容易地构建对所获取的语音外成分进行利用的应用、尤其是入侵检测、生存监控等与安全性关联的应用系统。需要说明的是,在上述的实施方式中,使用了多个麦克风,但基本上如下构成即可:具备声波检测机构、语音识别设备的语音成分获取机构和语音外成分获取机构,所述声波检测机构利用至少一个麦克风接收原始声的声波并将其作为原始声信号输出,所述语音成分获取机构从所述原始声信号和根据多个所述原始声信号生成的合成信号中的至少一方提取语音成分,所述语音外成分获取机构至少根据所述语音成分和所述原始声信号来生成并输出所述语音外成分。另外,即便使用多个麦克风,也可以从它们接收到的原始声信号中选择接收状态最好的原始声信号来作为处理对象。
(第二实施方式)
在本实施方式中,说明如下的示例:在适用了通过对具有多个麦克风的声波检测部所输出的原始声信号实施增益调整来选择接收原始声的声波的麦克风的技术的情况下,生成语音外成分。
图6是表示在第二实施方式中用于对多个麦克风接收到的原始声的声波进行处理的声波检测部及原始声信号处理部的构成例的框图。
在本实施方式中,假设利用了智能音箱的情况,声波检测部310除了包括多个麦克风311A、311B、311C(在不特别区分的情况下,称为麦克风311)以外,还包括用于消除回声(echo)的回声信号的输入部312A、312B(在不特别区分的情况下,称为输入部312)。虽然智能音箱包含输出语音(包括合成语音)的功能,但出于防止智能音箱输出的语音进入其自身配备的麦克风311而成为向麦克风输入的杂音这样的目的而包含回声消除功能。为了消除回声,将智能音箱输出的语音(以下称为回声信号)向输入部312输入。
原始声信号处理部410对从声波检测部310输入的原始声信号进行处理,并将处理后的原始声信号或合成原始声信号向信号成分提取部10输出。另外,原始声信号处理部410通过使用回声消除功能来获得去除了回声信号的原始声信号。
增益调整部411A、411B、411C(在不特别区分的情况下,称为增益调整部411)分别对从麦克风311A、311B、311C输入的原始声信号进行包括振幅、相位在内的增益的调整。
增益调整部412A、412B(在不特别区分的情况下,称为增益调整部412)分别对从信号输入部312A、312B输入的回声信号进行包括振幅、相位在内的增益的调整。
分发部413将增益调整部411及增益调整部412输出的进行了增益调整的原始声信号向语音成分获取部11、语音外成分获取部12输出。另外,分发部413也可以将各增益调整部411、412输出的进行了增益调整的原始声信号合成而得到的信号(若无特别区分,则称为合成原始声信号)向语音成分获取部11、语音外成分获取部12输出。
控制部414对麦克风311输出的原始声信号进行如下等控制:确定进行增益调整的增益。例如,控制部414基于波束成形技术等,以使多个麦克风311A、311B、311C的综合的指向性朝向语音的发出源的方向的方式调整增益调整部411及增益调整部412的增益。通过将由控制部414调整过的增益适用于增益调整部411及增益调整部412,由此合成原始声信号成为从语音的发出源(发声者等)的方向接收到的原始声得到了增强的信号。
图7是说明在该实施方式中通过增益的调整来选择麦克风的示例的图,是由麦克风311A、311B、311C这三个麦克风接收原始声的声波的情况下的示例。
图7的(a)是将增益调整部411A、411B、411C的增益(分别称为Ga、Gb、Gc)分别设为1.0的情况下的示例,分发部413以将从三个麦克风311A、311B、311C输出的原始声信号直接合成的方式进行动作。指向性D-311A、D-311B、D-311C分别表示麦克风311A、311B、311C接收声波的指向性。
图7的(b)是将增益分别设为Ga=0.0、Gb=1.0、Gc=0.0的情况下的示例。由于Gb以外的增益为0.0,因此如下这样示出:仅麦克风311B以指向性D-311B接收到的原始声是有效的,麦克风311A、麦克风311C接收到的原始声没有被包含在合成原始声信号中。因此,麦克风311A、麦克风311C的指向性D-311A、D-311C没有在图7的(b)中示出。这种情况下的示例也可以视作是从三个麦克风311A、麦克风311B、麦克风311C选择麦克风311B来利用的示例。
图7的(c)是将增益分别设为Ga=1.0、Gb=0.0、Gc=1.0的情况下的示例。由于Gb的增益为0.0,因此如下这样示出;麦克风311B接收到的原始声没有被包含在合成原始声信号中。因此,在图7的(c)中仅示出针对麦克风311A、麦克风311C的指向性D -311A、D-311C。
图8是在该实施方式中用于系统从原始声信号提取语音外成分的流程图,使用该图来说明本实施方式的动作。
用户U1发出的语音以声波的形式传播而被麦克风311A、311B、311C接收(步骤S21)。根据用户U1与麦克风311的位置关系,用户U1发出的语音以最强的强度到达麦克风311B。另外,来自检测对象N1的声响同样以最强的强度到达麦克风311C。因而,通过按照图7的(b)所示的设定来合成原始声信号,由此与图7的(a)的情况相比能够以增强用户U1发出的语音的方式获得原始声。另一方面,通过按照图7的(c)所示的设定来合成原始声信号,由此与图7的(a)的情况相比能够获得抑制用户U1发出的语音并增强检测对象N1发出的声响的原始声。在智能音箱等语音识别设备中,以增强用户U1发出的语音的方式进行获取,因此,在本实施方式中,示出利用图7的(b)的结构获得原始声的示例。因而,原始声信号处理部410对从声波检测部310输入的原始声信号,实施将增益分别设为Ga=0.0、Gb=1.0、Gc=0.0的增益调整处理,获得进行了增益调整的原始声信号。进行了增益调整的原始声信号作为合成原始声信号或者分离原始声信号向信号成分提取部10输出(步骤S22)。具体例如下所示。将从麦克风311A、311B、311C输入的原始声信号分别设为S01a、S01b、S01c。另外,将对原始声信号S01a、S01b、S01c进行了增益调整而得到的原始声信号设为S01ag、S01bg、S01cg。即,设为S01ag=S01a×Ga、S01bg=S01b×Gb、S01cg=S01c×Gc,从而获得分离原始声信号<S01g>=(S01ag、S01bg、S01cg)。另外,在是合成原始声信号S01g的情况下,获得S01g=S01ag+S01bg+S01cg。
语音成分获取部11从合成原始声信号S01g中提取语音成分Sv并将语音成分Sv向语音识别部5和语音外成分获取部12输出(步骤S23)。向语音识别部5输出的语音成分Sv被进行语音识别处理(步骤S24)。
在本实施方式中,为了获得不是增强用户U1发出的语音而是增强检测对象N1发出的声音(相当于语音外成分)的原始声,根据进行了增益调整的分离原始声信号<S01g>和增益Ga、Gb、Gc来重现按麦克风311A、311B、311C接收到的分离原始声信号<S01>(步骤S25)。具体而言,语音成分获取部11将语音成分Sv向语音外成分获取部12输出,原始声信号处理部410将进行了增益调整的原始声信号S01ag、S01bg、S01cg、增益Ga、Gb、Gc向语音外成分获取部12输出。语音外成分获取部12通过对进行了增益调整的原始声信号S01ag、S01bg、S01cg分别使用增益Ga、Gb、Gc来进行反算,由此能够重现按麦克风311A、311B、311C接收到的原始声信号S01a、S01b、S01c(分别相当于以图7的(a)中的D-311A、D-311B、D-311C的指向性接收到的原始声信号)。将重现后的原始声信号S01a、S01b、S01c分别设为S01ar、S01br、S01cr。具体而言,语音外成分获取部12可以通过将进行了增益调整的原始声信号S01ag、S01bg、S01cg分别除以增益Ga、Gb、Gc来获得按麦克风311A、311B、311C接收到的原始声信号S01ar、S01br、S01cr。其中,在增益Ga、Gb、Gc为0.0的情况下,无法进行基于增益的除法运算,因此原始声信号处理部410可以将增益调整前的原始声信号、即合成原始声信号S01或者分离原始声信号<S01>直接向语音外成分获取部12输出。
另外,语音外成分获取部12根据从语音成分获取部11输入的语音成分Sv及附加数据β以及重现后的原始声信号S01ar、S01br、S01cr的合成原始声信号S01r来提取语音外成分Sn(步骤S26)。具体而言,例如,以Sn(f)=S01r(f)-Sv(f)×β(f)的方式获 得语音外成分Sn。其中,Sn(f)、S01r(f)、Sv(f)、β(f)分别表示语音外成分Sn、合成原始声信号S01r、推定语音信号Sv、附加数据β的按频率f的值。附加数据β可以像在第一实施方式中使用的附加数据α那样是通过将语音成分获取部11为了提取语音成分而对原始数据适用的滤波器上下对调来得到的滤波器的值,也可以是根据实验、过往数据等获得的经验值。
需要说明的是,在从原始声信号处理部410直接向语音外成分获取部12输入增益调整前的原始声信号、即合成原始声信号S01的情况下,具体而言,例如以Sn(f)=S01(f)-Sv(f)×β(f)的方式获得语音外成分Sn。其中,S01(f)表示合成原始声信号S01的按频率f的值。
另外,在语音外成分获取部12中用于计算的合成原始声信号S01、S01r也可以设为图7的(c)那样仅使用将用户的语音最强进入的麦克风(相当于麦克风311B)除外了的麦克风、即仅使用麦克风311A和麦克风311C的情况下的合成原始声信号。具体而言,使用以S01=S01a+S01c或S01r=S01ar+S01cr的方式将S01b、S01br除外了的合成原始声信号。
另外,也可以以S01=S01a+S01c或S01r=S01ar+S01cr的形式将S01或S01r直接作为语音外成分来使用。即,设为Sn=S01a+S01c、Sn=S01ar+S01cr。
根据以上的步骤,能够使用语音识别设备来容易地提取语音外成分。
(第三实施方式)
在本实施方式中,使用图6的功能构成来说明如下示例:在适用了通过对具有多个麦克风的声波检测部所输出的原始声信号进行增益调整而能够变更声波的接收指向性的技术(例如波束成形技术)的情况下,生成语音外成分。
通常,用户存在的位置会发生变动,在第二实施方式中,示出的是根据用户的位置来选择麦克风311的示例。在本实施方式中,示出控制部414例如通过波束成形技术一边推测用户U1的位置一边根据推测出的用户的位置来改变麦克风311接收声波的指向性的示例。
图9是说明在该实施方式中通过增益的调整来调整接收声波的指向性的示例的图,是通过波束成形技术等来改变麦克风311的指向性的情况下的示例。
图9的(a)是在增益调整部411A、411B、411C中将增益分别设为Ga=1.0、Gb=1.0、Gc=1.0的情况下的示例。指向性D-311表示将麦克风311A、311B、311C视作一个麦克风的情况下的接收声波的指向性,中心点D-311-C表示指向性波束的中心点。用户U2表示发出语音的用户,检测对象N2表示发出要检测的语音外成分的检测对象。
图9的(b)示出将指向性波束B-311朝向用户U2的位置(推测位置)的情况下的接收声波的指向性的示例。图9的(c)示出从图9的(a)的指向性D-311去除了图9的(b)的指向性波束B-311而得到的指向性A-311。
在通常的语音识别设备中,期望图9的(b)那样针对作为朝向用户U2的指向性波束B-311而获得的原始声信号进行语音识别处理,但在本实施方式中,期望如图9的(c)那样以尽可能避开用户U2的指向性A-311来获得原始声信号。
使用图8的流程图来说明本实施方式的动作例。对与第二实施方式同样的部分省略说明。
用户U2发出的语音以声波的形式传播而被麦克风311A、311B、311C接收(步骤S21)。 控制部414通过波束成形技术,利用接收到的语音来推测用户U2的位置,生成图9的(b)所示那样的指向性波束B-311,利用生成的指向性波束B-311来获得原始声信号(步骤S22)。由于波束成形技术为通常的技术,因此省略说明。生成指向性波束B-311的结果是使得接收声波的指向性朝向用户U2的位置(或者推测位置)。将此时的针对各麦克风311的增益设为Ga1、Gb1、Gc1。即,原始声信号处理部410向信号成分提取部10输出针对从声波检测部310输入的原始声信号进行了增益分别设为Ga=Ga1、Gb=Gb1、Gc=Gc1的增益调整而得到的原始声信号(步骤S22)。将从麦克风311A、311B、311C输出的原始声信号分别设为S02a、S02b、S02c。另外,将对原始声信号S02a、S02b、S02c进行了增益调整而得到的原始声信号分别设为S02ag、S02bg、S02cg。即,设为S02ag=S02a×Ga、S02bg=S02b×Gb、S02cg=S02c×Gc,从而获得分离原始声信号<S02g>=(S02ag、S02bg、S02cg)。另外,在是合成原始声信号S02g的情况下,获得S02g=S02ag+S02bg+S02cg。
语音成分获取部11从合成原始声信号S02g中提取语音成分Sv,并将语音成分Sv向语音识别部5和语音外成分获取部12输出(步骤S23)。向语音识别部5输出的语音成分Sv被进行语音识别处理(步骤S24)。
在本实施方式中,为了获得不是增强用户U2发出的语音而是增强检测对象N2发出的声响(语音外成分相当)的原始声,根据进行了增益调整的分离原始声信号<S02g>和增益Ga、Gb、Gc来重现按麦克风311A、311B、311C接收到的原始声信号(步骤S25)。具体而言,语音成分获取部11将语音成分Sv向语音外成分获取部12输出,原始声信号处理部410将进行了增益调整的原始声信号S02ag、S02bg、S02cg和增益Ga1、Gb1、Gc1向语音外成分获取部12输出。语音外成分获取部12通过对进行了增益调整的原始声信号S02ag、S02bg、S02cg分别使用增益Ga1、Gb1、Gc1来进行反算,由此能够重现按麦克风311A、311B、311C接收到的原始声信号S02a、S02b、S02c(相当于以图9的(a)的D-311的指向性接收到的原始声信号)。将重现后的原始声信号S02a、S02b、S02c分别设为S02ar、S02br、S02cr。进而,语音外成分获取部12根据从语音成分获取部11输入的语音成分Sv及附加数据β以及重现后的原始声信号的合成原始声信号S02r(表示S02ar、S02br、S02cr的合成值)来提取语音外成分Sn(步骤S26)。具体而言,例如以Sn(f)=S02r(f)-Sv(f)×β(f)的方式获得语音外成分Sn。
另外,语音外成分获取部12也可以从原始声信号处理部410获取合成原始声信号S02来代替S02r,例如以Sn(f)=S02(f)-Sv(f)×β(f)的方式获得语音外成分Sn。
另外,语音外成分获取部12也可以以例如Sn(f)=S02(f)-S02g(f)或Sn(f)=S02r(f)-S02g的形式获得语音外成分Sn。这种情况下,如图9的(c)所示,由于利用从指向性D-311中去除了波束指向性B-311而得到的指向性A-311来接收到的原始声信号直接作为语音外成分Sn来获得,因此不需要Sv(f)、β(f)。
另外,语音外成分获取部12还可以以例如Sn(f)=S02(f)-S02g-Sv(f)×β(f)或Sn(f)=S02r(f)-S02g-Sv(f)×β(f)的形式获得语音外成分Sn。这种情况下,如图9的(c)所示,由于从利用指向性A-311接收到的原始声中去除了语音成分Sv,因此能够获得纯度更高的语音外成分(环境声),其中,指向性A-311是通过从指向性D-311中去除波束指向性B-311而得到的指向性。
根据以上的步骤,能够使用适用了波束成形技术的语音识别设备来容易地提取语音外成分。
(第四实施方式)
对在使用了语音外成分的应用中使用的参照数据的生成方法和生成的参照数据的使用例进行说明。
例如,以检测门声作为触发来获取触发输出后的语音外成分,由获取到的语音外成分来生成参照数据。若假定参照数据生成时为平时(没有发生异常的状态),则例如在应用运用时认定以开门声为触发来获取到的语音外成分与参照数据不同的情况下,判定发生了某些异常而将异常检测信号作为语音外信息进行输出。另外,作为获取参照数据的触发,也可以使用通过语音识别获得的识别词的检出。在本实施方式中,针对用于获取参照数据的触发,示出如下示例:通过语音识别获得的识别词被检测出的情况;以及检测出门声等语音外信息的情况。
图10是用于第四实施方式的系统执行参照数据的生成处理的流程图。参照数据的生成例如可以在使用第一~第三实施方式所示的流程图的步骤来运用应用系统的期间内实施,也可以在运用期间外实施。在运用期间外实施的情况下,例如可以是,用户任意地打破玻璃,使声波检测部3接收玻璃的破碎声的声波并录制于存储部6来作为参照数据。在本实施方式中,使用图1的功能构成来示出在应用系统的运用期间内将不发生异常的平时的环境数据作为参照数据来获取的示例。
使应用系统处于运用期间,使语音识别处理及语音外成分识别处理进行动作(步骤S301、S401)。语音识别处理及语音外成分识别处理例如可以基于图2的流程图来实施。其中,在图2的流程图中,在步骤S11之后处理结束,但在本实施方式中,使步骤S1~步骤S11的处理始终反复实施。图10的流程图示出语音识别处理和语音外成分分析处理并行地动作的情况,以下,首先说明始于语音识别处理的处理流程,然后说明始于语音外成分识别处理的处理流程。
控制部7监视语音识别部5是否检测出识别词(步骤S302)。控制部7在语音识别部5检测出识别词的情况下,确认未图示的计时部A是否启动(步骤S303)。控制部7在确认到计时部A启动的情况下,返回到步骤302,继续监视语音识别部5是否检测出识别词(在步骤S303中为是)。控制部7在确认到计时部A没有启动的情况下,启动计时部A来开始时间T11的计时(在步骤S303中为否,进入S304)。当时间T11超过阈值TH_T11时,开始时间T21的计时(步骤S306)。另外,同时开始语音外成分向存储部6的存储(录制)(步骤S307)。在时间T21为阈值TH_T21以下的情况下,继续步骤S306中的语音外成分向存储部6的存储(录制)(在步骤S308中为否)。当时间T21超过阈值TH_T21时,停止语音外成分向存储部6的存储(录制)(在步骤S308中为是,进入S309)。
图11是用于说明计时用的时间的图,图11的(a)是始于语音识别处理的处理流程中的计时用的时间T11及T21的说明图。时间T11是从语音识别部5检测出识别词而在步骤S304中启动计时部A开始到将语音外成分向存储部6存储(录制)为止的时间。当时间T11超过阈值TH_T11时,开始语音外成分向存储部6的存储(录制)(步骤S307)。时间T21是从开始语音外成分向存储部6的存储(录制)到停止录制为止的时间。对于时间T11、T21可以分别设定阈值TH_T11、TH_T21,由此能够控制从语音识别部5检测出识别词到开始录制为止的时间(相当于T11)、录制时间(相当于T21)。步骤S303是用于使步骤S307的数据存储处理不同时实施多个的条件分支,但也可以通过准备多个计时部A等来同时执行多个数据存储处理。
返回到图10,控制部7监视应用处理部2是否检测出语音外信息(步骤S402)。控制部7在应用处理部2检测出语音外信息的情况下,确认未图示的计时部B是否启动(步骤S403)。控制部7在确认到计时部B启动的情况下,返回到步骤402,继续监视应用处理部2是否检测出语音外信息(在步骤S403中为是)。控制部7在确认到计时部B没有启动的情况下,启动计时部B来开始时间T12的计时(在步骤S403中为否,进入S404)。当时间T12超过阈值TH_T12时,开始时间T22的计时(在步骤S405中为是,进入步骤S406)。另外,同时开始语音外成分向存储部6的存储(录制)(步骤S407)。在时间T22为阈值TH_T22以下的情况下,继续步骤S407中的语音外成分向存储部6的存储(录制)(在步骤S408中为否)。当时间T22超过阈值TH_T22时,停止语音外成分向存储部6的存储(录制)(在步骤S408中为是,进入S409)。
图11的(b)是始于语音外成分识别处理的处理流程中的计时用的时间T12及T22的说明图。时间T12是从应用处理部2检测出语音外信息而在步骤S404中启动计时部B开始到将语音外成分向存储部6存储(录制)为止的时间。当时间T12超过阈值TH_T12时,开始语音外成分向存储部6的存储(录制)(步骤S407)。时间T22是从开始语音外成分向存储部6的存储(录制)到停止录制为止的时间。对于时间T12、T22可以分别设定阈值TH_T12、TH_T22,由此能够控制从应用处理部2检测出语音外信息到开始录制为止的时间(相当于T12)、录制时间(相当于T22)。
以上,根据本实施方式,能够例如将通过语音识别获得的识别词的检出或者门声等语音外信息的检出作为触发来获取参照数据。用户可以对获取到的参照数据赋予例如“检测出识别词后的平时状态”、“检测出开门声后的平时状态”等检测事件名。也可以在应用系统运用时将获取了参照数据的检测事件名的列表显示于未图示的显示部来供用户能够从遥控器等进行选择。例如,在用户从检测事件名的列表中选择“检测出开门声后的平时状态”作为检测事件时,应用处理部2可以在运用状态下将“开门声”作为触发来获取触发后紧接着的语音外成分。应用处理部2也可以将获取到的语音外成分与对应的参照数据进行比较来输出异常检测信号等检测事件信息(语音外信息)。输出的语音外信息可以经由互联网等向用户的智能手机等发送。例如,可以通知用户“自家的门打开,发生异常”这一情况来作为语音外信息。
需要说明的是,在本实施方式中,示出了在应用系统运用中获取参照数据的示例,但例如也可以是,在重现了不产生语音(人的声音)的环境的基础上试验性地获取原始声的情况下,将原始声信号处理部4输出的原始声信号或合成原始声信号不由语音外成分获取部12进行处理而是作为参照数据存入存储部6。
另外,若是使用在本实施方式中将通过语音识别获得的识别词的检出作为触发来获取到的参照数据,则例如能够确认基于语音识别的控制对象是否正常地动作。具体而言,例如,以下示出将空气调节器(空调)作为控制对象而将语音识别部5生成的接通电源的命令即启动指令通过红外线遥控器(Infrared遥控器)等无线向空调输出的情况。空调在成功接收到启动指令时,按照启动指令的控制内容(接通电源)进行动作。另外,同时在应用处理部2中将通过语音识别获得的识别词(这种情况下相当于启动指令)的检出作为触发来获取环境声作为参照数据。从正常接收到启动指令的空调接通电源开始,录制例如空调开始启动后的声响来作为本实施方式中获取的参照数据。若在应用系统运用时使用该参照数据,则在用户发出启动指令但空调接收启动指令失败而进行了启动指令以外的动作的情 况下,应用处理部2的判定部22对将识别词(启动指令)的检出作为触发而提取出的语音外成分与参照数据进行比较(例如图2的步骤S10),此时判定为语音外成分与参照数据不一致。根据该判定结果可知空调接收启动指令失败。可以将该判定结果作为检测事件信息经由互联网向用户的智能手机发送,也可以将识别词显示于未图示的显示部,还可以通过语音合成等从未图示的扬声器将识别词以语音的形式输出。
根据以上的步骤,通过使用由本实施方式获取到的参照数据,由此用户能够知晓其用语音发出的指令是否对控制对象正常地起作用。
另外,若是使用以同样的步骤将语音外信息(检测事件)的检出作为触发来获取到的参照数据,则能够知晓输出的检测事件是被正常检测到的事件还是被误检测的事件。例如,在因误检测而检测出检测事件的情况下,应用处理部2的判定部22对获取到的语音外成分与预先通过本实施方式获取到的参照数据进行比较,比较的结果是判断为检测事件是误检测的事件。另外,若是在没有将检测事件作为触发的情况下,检测到与通过本实施方式获取到的参照数据一致的语音外成分的情况下,则可知发生了与参照数据对应的检测事件。例如,应用处理部2的判定部22始终监视语音外成分,在判定为与预先在检测事件A的检出后立刻获取到的参照数据一致时,判定为发生了检测事件A。
(变形例)
对生成并利用语音外成分的系统的构成例进行说明。
图12是表示变形例的内置有语音外成分生成部的电子设备的构成例的图。电子设备100只要具备基于语音识别实现的控制功能即可,可以是电视接收装置、空调、冰箱、洗衣机等家电,另外,并不局限于家电。在电子设备100中,控制部101基于语音识别部5生成的控制信息来执行电子设备100的各功能的控制。本变形例也可以视作是将图1所示的系统内置于电子设备100的示例。通过在电子设备100通常具备的功能上追加语音外成分获取部12、应用处理部2、控制部7等,并进行与图2或图8的流程图同样的处理,由此能够实现使用了电子设备100的应用系统、例如家庭安全系统。
接着,对使用了智能音箱的系统的构成例进行说明。
图13是表示变形例的使用了外置的智能音箱的情况下的系统的构成例的图。电子设备100A只要具备基于语音识别实现的控制功能即可,可以是电视接收装置、空调、冰箱、洗衣机等家电,另外,并不局限于家电。在电子设备100A上经由未图示的接口来外置连接智能音箱200,在电子设备100A中,控制部101A基于语音识别部5生成的控制信息来执行电子设备100A的各功能的控制。另外,在智能音箱200上连接有应用装置300。应用装置300利用与图2或图8的流程图同样的处理,基于从智能音箱200输入的原始声信号、语音成分、附加数据(α、β等)等来获取语音外成分,对检测事件进行检测并输出检测事件信息(语音外信息)。应用装置300可以在检测到检测事件的情况下,例如从接口部8经由互联网向用户的智能手机通知。
需要说明的是,图12或图13所示的各功能可以经由接口部8等在互联网上分散,也可以设为是云服务器上的功能,可以考虑各种功能的组合、系统形态。例如,通过将应用装置300设为云上的装置,由此能够将来自网络上连接的大量用户的参照数据作为样本进行累积,例如能够在提高检测事件的检测精度中利用。
以上,根据本变形例,能够以各种方式提取语音外成分,能够将提取出的语音外成分以各种方式利用于例如家庭安全技术。
根据以上所述的至少一个实施方式、变形例,能够提供可使用语音识别设备来容易地提取环境声的环境声输出装置、系统、方法及非易失性存储介质,所述非易失性存储介质保存有计算机指令,所述计算机指令被处理器或计算机设备执行时能够实现上述的使用语音识别设备来容易地提取环境声的方法。
说明了本申请的几个实施方式,但这些实施方式是作为示例来提示的,并不意在限定申请的范围。这些新的实施方式可以以其他各种各样的形态来实施,可以在不脱离申请的主旨的范围内进行各种省略、置换、变更。这些实施方式及其变形包含在申请的范围、主旨内,并且包含在权利要求书所记载的发明及与其等同的范围内。另外,就权利要求中的各构成要素而言,将构成要素分割表达的情况、将多个构成要素合并表达的情况或者组合这两种表达的情况均在本申请的范畴内。另外,可以组合多个实施方式,由该组合构成的实施例也在申请的范畴内。
另外,附图用于使说明更加明确,与实际的形态相比,存在将各部分的宽度、厚度、形状等示意性地表示的情况。在框图中,也存在没有连线的框之间或者在即便连线但没有示出箭头的方向上进行数据、信号的交换的情况。框图所示的各功能、流程图、时序图所示的处理可以通过硬件(IC芯片等)、软件(程序等)、数字信号处理用运算芯片(Digital Signal Processor、DSP)或者这些硬件与软件的组合来实现。另外,在将权利要求表述为控制逻辑的情况下、将权利要求表述为包含使计算机执行的指令的程序的情况下以及将权利要求表述为记载有上述指令的可供计算机读取的存储介质的情况下,均可以适用本申请的装置。另外,使用的名称、用语也不受限定,其他的表述只要具有实质上相同的内容、相同的主旨,则也包含在本申请中。

Claims (12)

  1. 一种环境声输出装置,其中,
    所述环境声输出装置使用语音识别设备来生成并且输出语音外成分。
  2. 根据权利要求1所述的环境声输出装置,其中,
    所述环境声输出装置具备:
    声波检测机构,其利用至少一个麦克风接收原始声,并且分别作为原始声信号来输出;
    所述语音识别设备的语音成分获取机构,其从所述原始声信号和根据多个所述原始声信号生成的合成原始声信号中的至少一方提取语音成分;以及
    语音外成分获取机构,其至少根据所述语音成分和所述原始声信号来生成并输出所述语音外成分。
  3. 根据权利要求2所述的环境声输出装置,其中,
    所述语音成分获取机构,输出频率特性值,该频率特性值是在从所述原始声信号中提取所述语音成分时使用的滤波器的频率特性值,
    所述语音外成分获取机构,至少根据所述滤波器的频率特性值、所述语音成分、以及所述原始声信号来生成所述语音外成分。
  4. 根据权利要求3所述的环境声输出装置,其中,
    所述环境声输出装置还具备原始声信号处理机构,
    所述原始声信号处理机构,根据所述原始声信号和分别分配给所述麦克风的增益而生成进行了增益调整的原始声信号,并且生成将进行了增益调整的原始声信号合成了的第一合成原始声信号,并且输出所述增益的值、进行了增益调整的原始声信号、以及第一合成原始声信号,
    所述语音成分获取机构从所述第一合成原始声信号中提取所述语音成分,
    所述语音外成分获取机构根据进行了增益调整的原始声信号和所述增益的值来重现所述原始声信号以获得重现原始声信号,并且至少根据所述语音成分和所述重现原始声信号来生成并输出语音外成分。
  5. 根据权利要求3所述的环境声输出装置,其中,
    所述环境声输出装置还具备原始声信号处理机构,
    所述原始声信号处理机构,根据所述原始声信号和分配给所述麦克风的增益而生成进行了增益调整的原始声信号,并且生成将进行了增益调整的原始声信号合成了的第一合成原始声信号,并且输出所述原始声信号和第一合成原始声信号,
    所述语音成分获取机构,从所述第一合成原始声信号中提取所述语音成分,
    所述语音外成分获取机构,至少根据所述语音成分和所述原始声信号来生成并输出语音外成分。
  6. 根据权利要求1所述的环境声输出装置,其中,
    所述语音识别设备是电视接收装置的语音识别机构。
  7. 根据权利要求1所述的环境声输出装置,其中,
    所述语音识别设备是智能音箱的语音识别机构。
  8. 一种系统,其中,
    所述系统,利用从环境声输出装置获得的语音外成分,对所述语音外成分进行分析, 并且输出分析结果,其中,所述环境声输出装置使用语音识别设备来生成并且输出语音外成分。
  9. 根据权利要求8所述的系统,其中,
    语音识别设备是电视接收装置的语音识别机构和智能音箱的语音识别机构中的任一个。
  10. 一种方法,其中,
    所述方法使用语音识别设备来生成并且输出语音外成分。
  11. 根据权利要求10所述的方法,其中,
    利用至少一个麦克风接收原始声,并且分别作为原始声信号来输出,
    输出将所述原始声信号合成了的合成原始声信号和所述原始声信号,
    从所述合成原始声信号中提取语音成分,
    至少根据所述语音成分和所述原始声信号来生成并且输出所述语音外成分。
  12. 一种计算机可读的非易失性存储介质,所述存储介质存储有计算机指令,所述计算机指令是用于使包括数字信号处理器在内的计算机执行处理的计算机指令,其中,
    所述计算机指令用于使所述计算机执行如下的步骤:
    从至少一个麦克风获取原始声信号;
    输出将所述原始声信号合成了的合成原始声信号和所述原始声信号;
    从所述合成原始声信号中提取语音成分;以及
    至少根据所述语音成分和所述原始声信号来生成并且输出语音外成分。
PCT/CN2020/135774 2020-01-17 2020-12-11 环境声输出装置、系统、方法及非易失性存储介质 WO2021143411A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202080006649.8A CN113490979B (zh) 2020-01-17 2020-12-11 环境声输出装置、系统、方法及非易失性存储介质

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020-006227 2020-01-17
JP2020006227A JP2021113888A (ja) 2020-01-17 2020-01-17 環境音出力装置、システム、方法およびプログラム

Publications (1)

Publication Number Publication Date
WO2021143411A1 true WO2021143411A1 (zh) 2021-07-22

Family

ID=76863553

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/135774 WO2021143411A1 (zh) 2020-01-17 2020-12-11 环境声输出装置、系统、方法及非易失性存储介质

Country Status (3)

Country Link
JP (1) JP2021113888A (zh)
CN (1) CN113490979B (zh)
WO (1) WO2021143411A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968999A (zh) * 2011-11-18 2013-03-13 斯凯普公司 处理音频信号
CN104954543A (zh) * 2014-03-31 2015-09-30 小米科技有限责任公司 自动报警方法、装置及移动终端
CN107742517A (zh) * 2017-10-10 2018-02-27 广东中星电子有限公司 一种对异常声音的检测方法及装置
US20190035381A1 (en) * 2017-12-27 2019-01-31 Intel Corporation Context-based cancellation and amplification of acoustical signals in acoustical environments
CN110390942A (zh) * 2019-06-28 2019-10-29 平安科技(深圳)有限公司 基于婴儿哭声的情绪检测方法及其装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008064892A (ja) * 2006-09-05 2008-03-21 National Institute Of Advanced Industrial & Technology 音声認識方法およびそれを用いた音声認識装置
JP2011118822A (ja) * 2009-12-07 2011-06-16 Nec Casio Mobile Communications Ltd 電子機器、発話検出装置、音声認識操作システム、音声認識操作方法及びプログラム
JP6054142B2 (ja) * 2012-10-31 2016-12-27 株式会社東芝 信号処理装置、方法およびプログラム
KR101889465B1 (ko) * 2017-02-02 2018-08-17 인성 엔프라 주식회사 음성인식장치와, 음성인식장치가 구비된 조명등기구와, 이를 이용한 조명시스템
JP7152786B2 (ja) * 2017-06-27 2022-10-13 シーイヤー株式会社 集音装置、指向性制御装置及び指向性制御方法
EP3653943B1 (en) * 2017-07-14 2024-04-10 Daikin Industries, Ltd. Machinery control system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968999A (zh) * 2011-11-18 2013-03-13 斯凯普公司 处理音频信号
CN104954543A (zh) * 2014-03-31 2015-09-30 小米科技有限责任公司 自动报警方法、装置及移动终端
CN107742517A (zh) * 2017-10-10 2018-02-27 广东中星电子有限公司 一种对异常声音的检测方法及装置
US20190035381A1 (en) * 2017-12-27 2019-01-31 Intel Corporation Context-based cancellation and amplification of acoustical signals in acoustical environments
CN110390942A (zh) * 2019-06-28 2019-10-29 平安科技(深圳)有限公司 基于婴儿哭声的情绪检测方法及其装置

Also Published As

Publication number Publication date
CN113490979A (zh) 2021-10-08
CN113490979B (zh) 2024-02-27
JP2021113888A (ja) 2021-08-05

Similar Documents

Publication Publication Date Title
CN111836178B (zh) 包括关键词检测器及自我话音检测器和/或发射器的听力装置
US11380326B2 (en) Method and apparatus for performing speech recognition with wake on voice (WoV)
US9940928B2 (en) Method and apparatus for using hearing assistance device as voice controller
EP2439961B1 (en) Hearing aid, hearing assistance system, walking detection method, and hearing assistance method
US9269367B2 (en) Processing audio signals during a communication event
KR20180004950A (ko) 영상처리장치, 영상처리장치의 구동방법 및 컴퓨터 판독가능 기록매체
JP2008191662A (ja) 音声制御システムおよび音声制御方法
KR20080006622A (ko) 마이크로폰 신호 중 풍잡음의 검출 및 억제 장치 및 방법
KR20210019985A (ko) 음성인식 오디오 시스템 및 방법
JP2018530778A (ja) 協調的なオーディオ処理
CN111415686A (zh) 针对高度不稳定的噪声源的自适应空间vad和时间-频率掩码估计
JP2023159381A (ja) 音声認識オーディオシステムおよび方法
US20230319190A1 (en) Acoustic echo cancellation control for distributed audio devices
CN107465970A (zh) 用于语音通信的设备
JP2018533051A (ja) 協調的なオーディオ処理
JP2004500750A (ja) 補聴器調整方法及びこの方法を適用する補聴器
US20220335937A1 (en) Acoustic zoning with distributed microphones
JP2019184809A (ja) 音声認識装置、音声認識方法
JP2007034238A (ja) 現場作業支援システム
WO2021143411A1 (zh) 环境声输出装置、系统、方法及非易失性存储介质
US8635064B2 (en) Information processing apparatus and operation method thereof
KR102372327B1 (ko) 음성 인식 방법 및 이에 사용되는 장치
JP2016206646A (ja) 音声再生方法、音声対話装置及び音声対話プログラム
CN116249952A (zh) 使用动态分类器的用户语音活动检测
KR102495028B1 (ko) 휘파람소리 인식 기능이 구비된 사운드장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913807

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20913807

Country of ref document: EP

Kind code of ref document: A1