WO2022176085A1 - In-vehicle voice separation device and voice separation method - Google Patents

In-vehicle voice separation device and voice separation method Download PDF

Info

Publication number
WO2022176085A1
WO2022176085A1 PCT/JP2021/006024 JP2021006024W WO2022176085A1 WO 2022176085 A1 WO2022176085 A1 WO 2022176085A1 JP 2021006024 W JP2021006024 W JP 2021006024W WO 2022176085 A1 WO2022176085 A1 WO 2022176085A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
seat
vehicle
voice
score
Prior art date
Application number
PCT/JP2021/006024
Other languages
French (fr)
Japanese (ja)
Inventor
真 宗平
尚嘉 竹裏
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2021/006024 priority Critical patent/WO2022176085A1/en
Publication of WO2022176085A1 publication Critical patent/WO2022176085A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the technology disclosed herein relates to an in-vehicle audio separation device.
  • a device that is an in-vehicle device and is capable of responding to voice instructions is known.
  • vehicle device A device that is an in-vehicle device and is capable of responding to voice instructions.
  • Well-known in-vehicle devices include, for example, Amazon Echo Auto.
  • Conventional in-vehicle equipment, represented by Amazon Echo Auto uses the wake word "Alexa" before the instruction voice to prevent voice processing from becoming a heavy load.
  • this wake word is not necessary for conversation between humans, it is desired to realize an on-vehicle device that does not use wake words.
  • In-vehicle equipment is also required to close up only a specific voice even in a noisy vehicle.
  • a technique using a plurality of microphones is known for this problem.
  • Cerence's Passenger Interference Cancellation hereinafter referred to as "PIC”
  • Transtron's microphone are known to be applied to vehicle-mounted equipment.
  • An example from an article about Transtron array microphones ⁇ URL: https://www. transtron. com/products/array. html>
  • a method of arranging a microphone at each seat is conceivable.
  • this method has good voice separation performance, it has the property of requiring wiring costs and labor for arranging microphones to each seat. This method is only used in some luxury cars. Another use for multiple microphones is to place them near the center information display on the dashboard. Although this method does not require wiring cost and labor, it has the property that the voice separation performance is inferior when there is no difference in the angle at which the voice is input, such as between the driver's seat and the rear seat.
  • Array microphones in which a plurality of microphones are aligned are also known.
  • a sound source separation method is disclosed that has high separation performance even for sound sources that do not have sparsity in the time-frequency domain using neural networks.
  • Patent Document 1 uses a trained deep neural network (hereinafter referred to as "DNN") to learn observation signals from a sound source with a known direction ⁇ as teacher data, and select sounds from the ⁇ direction. It discloses that it extracts In order to improve voice separation performance without placing a microphone in each seat, it is conceivable to combine the DNN technology disclosed in Patent Document 1 with an array microphone placed on the dashboard.
  • DNN trained deep neural network
  • Combining DNN technology with the in-vehicle array microphone means implementing high-load processing for the number of people in order to extract the voice of each passenger.
  • a high processing load may lead to a delay in response to voices, a hindrance to other functions being executed by the vehicle-mounted device, and the like.
  • An object of the disclosed technology is to solve the above problems, to provide a voice separation device for an in-vehicle device that does not use a wake word, does not require a microphone to be placed in each seat, and does not cause a high processing load.
  • An in-vehicle speech separation device includes a speech level calculation unit that calculates a speech level based on the acquired in-vehicle speech, and a dialogue request score for each seat based on the acquired information necessary for dialogue request detection. and a voice input authority determination unit that determines to which of the seats in the vehicle the voice input authority should be granted based on the utterance level and the dialogue request score. include.
  • the in-vehicle audio separation device has the above configuration, an in-vehicle device that does not use a wake word, does not require microphones to be placed in each seat, and does not impose a high processing load is realized.
  • FIG. 1 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 1.
  • FIG. 2 is a flow chart showing processing of the in-vehicle audio separation device according to the first embodiment.
  • FIG. 3 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 2.
  • FIG. 4 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 3.
  • FIG. 5 is a flow chart showing processing of the in-vehicle audio separation device according to the third embodiment.
  • FIG. 6 is an image diagram of a dialogue request score calculated based on information from the surrounding road condition acquisition unit.
  • FIG. 1 is a block diagram showing the functional configuration of an in-vehicle audio separation device 100 according to Embodiment 1.
  • the in-vehicle speech separation device 100 includes a dialogue request score calculation unit 20, an utterance level calculation unit 40, a dialogue request presence/absence determination unit 50, a voice input authority determination unit 60, a voice and a separation unit 70 .
  • the speech separation device 100 has two input systems, which are connected to the information acquisition device 10 and the speech acquisition device 30, respectively.
  • the speech separation device 100 has at least one output system and is connected to the speech recognition device 80 .
  • the voice separation device 100 may have a second output system and may be connected to the notification device 90 .
  • the information acquisition device 10 is a camera, drive recorder, driver monitor, or other device capable of detecting interaction requests.
  • the voice acquisition device 30 is a device that includes a microphone and acquires voice inside the vehicle.
  • the dialog request score calculator 20 of the speech separation device 100 calculates the score of the dialog request based on the information from the information acquisition device 10 .
  • the interaction request score calculator 20 analyzes the line of sight of the passenger based on the moving image captured by the camera. An interaction request score is calculated for each seat in the vehicle.
  • the speech level calculation unit 40 of the speech separation device 100 calculates the speech level inside the vehicle based on the information from the speech acquisition device 30 . Specifically, the speech level calculator 40 removes noise other than speech from the speech information received from the speech acquisition device 30, and calculates the speech level of the speech. The speech information received by the speech level calculator 40 is not yet separated for each passenger here. Therefore, the in-vehicle utterance level calculated by the utterance level calculation unit 40 (hereinafter simply referred to as "utterance level”) is not for each passenger but for the entire vehicle interior.
  • the dialogue request presence/absence determination unit 50 of the speech separation device 100 determines whether there is a dialogue request based on the dialogue request score calculated by the dialogue request score calculation unit 20 and the speech level calculated by the speech level calculation unit 40 . For example, when the speech level is sufficiently high and the score of the dialogue request is not small, the dialogue request presence/absence determination unit 50 determines that "there is a dialogue request". For example, the dialogue request presence/absence determination unit 50 may determine that "there is a dialogue request" even when the speech level is somewhat low but the dialogue request is sufficiently large. If the condition of "there is a demand for dialogue" is not satisfied, the dialogue request presence/absence determination unit 50 may determine that "there is no demand for dialogue". The processing flow of the dialog request presence/absence determination unit 50 will be clarified later.
  • the speech input right determination unit 60 of the speech separation device 100 grants the right to perform voice input to the vehicle-mounted device (hereinafter referred to as "voice determine the occupants to whom input rights are granted. Information on the passenger to whom the voice input right has been granted is sent to the voice separation section 70 .
  • the interaction request score exceeds a threshold for multiple seats in a vehicle, it may be determined that the driver's seat is always given priority if the driver's seat is included. For example, the interaction request score threshold may be lowered only for the driver's seat. In this case also, the driver's seat has priority over the other seats. For example, it may be determined that if the interaction request scores for multiple seats in the car exceed a threshold, first come first served. This decision may or may not include the driver's seat. For example, if the interaction request scores for multiple seats in the vehicle exceed a threshold, it may be determined that priority is given to the interaction request with the higher score. In this case, if the score of the driver's seat is raised in advance, there is an effect that the driver's seat is prioritized over the other seats.
  • the expiration date of the voice input right may be set to a predetermined fixed time so that the held seat is not frequently switched. For example, if the score of the seat other than the held seat continues to exceed the score of the held seat for a certain period of time, the voice input right may be switched. For example, it may be determined that the condition for switching audio input rights is to exceed the score of the retained seat plus a margin.
  • the switching of the voice input authority based on this rule is clarified by the following numerical example with a margin of 5 [pt], for example.
  • the voice separation unit 70 of the voice separation device 100 separates only the voice of the seat to which the voice input right is granted from the voice data sent from the voice acquisition device 30 .
  • the sound separation device 100 according to the technology disclosed herein does not perform sound separation for all seats, but performs sound separation only for specific seats to which the sound input right has been granted. Therefore, the load required for sound processing is high. I can get away with it.
  • the speech that has undergone separation processing by the speech separation unit 70 is output to the speech recognition device 80 .
  • the speech recognition device 80 is specifically a device including an on-vehicle device such as a car navigation system.
  • the audio separation device 100 may have a second output system so that the information of the seat to which the audio input right is granted can be displayed outside.
  • the external display means may be the external notification device 90, or may be an LED display, a display screen, or a speaker that the audio separation device 100 has.
  • the voice separation device 100 may have a configuration capable of externally displaying the information of the seat to which the voice input right has been granted in some way.
  • FIG. 2 is a flow chart showing an example of processing of the in-vehicle audio separation device 100 according to the first embodiment.
  • the processing of the speech separation device 100 includes a step of acquiring information necessary for detecting a dialogue request (ST1), a step of analyzing the acquired information (ST2), and a step of acquiring voice inside the vehicle (ST3). ), a step of calculating the speech level in the vehicle from the acquired voice (ST4), a step of determining whether the speech level is equal to or higher than a first threshold (ST5), and a step of determining whether the score of the dialogue request is equal to or higher than the first threshold.
  • the step (ST1) of acquiring the information necessary for detecting the dialogue request is a process performed by the information acquisition device 10
  • the step (ST3) of acquiring the voice inside the vehicle is a process performed by the voice acquisition device 30. be.
  • these steps (ST1, ST3) are included in the flow chart of FIG. 2 to clarify the operation of the system as a whole.
  • the dialog request score calculator 20 of the speech separation device 100 executes the step of analyzing the information acquired via the information acquisition device 10 (ST2).
  • the dialog request score calculator 20 calculates a dialog request score for each seat based on the analysis result.
  • the speech level calculation unit 40 of the speech separation device 100 executes a step (ST4) of calculating the speech level in the vehicle from the speech acquired via the speech acquisition device 30.
  • the step (ST2) processed by the dialogue request score calculation unit 20 and the step (ST4) processed by the speech level calculation unit 40 do not matter before or after the execution.
  • step (ST2) and step (ST4) are processed concurrently.
  • the dialogue request presence/absence determination unit 50 of the speech separation device 100 determines whether or not the utterance level obtained in step (ST4) of the utterance level calculation unit 40 is above a certain level. Specifically, the dialogue request presence/absence determination unit 50 compares a predetermined speech level first threshold with the speech level. That is, the step (ST5) of determining whether or not the speech level is equal to or higher than the first threshold is executed here.
  • step (ST5) in FIG. determines whether Specifically, the dialogue request presence/absence determination unit 50 compares the dialogue request score for each seat calculated by the dialogue request score calculation unit 20 with a predetermined dialogue request score first threshold. That is, a dialogue request score equal to or greater than the first threshold is interpreted as "there is a dialogue request”. Conversely, even if the utterance level is above a certain level, if the dialogue request score is less than the first threshold and the action or situation is not one of trying to talk, it is interpreted as "no dialogue request”. That is, a step (ST6) of determining whether or not the score of the interaction request is equal to or greater than the first threshold is executed here.
  • the dialogue request presence/absence determination unit 50 determines whether the conversation is being conducted in a low voice. Specifically, the dialogue request presence/absence determination unit 50 compares a predetermined speech level second threshold with the speech level. That is, the step (ST7) of determining whether the speech level is equal to or higher than the second threshold is executed here. As for the magnitude relationship between the first threshold and the second threshold of the speech level, the second threshold is smaller than the first threshold.
  • the dialog request presence/absence determination unit 50 compares the dialog request score for each seat calculated by the dialog request score calculation unit 20 with a predetermined dialog request score second threshold. That is, the step (ST8) of determining whether or not the score of the interaction request is equal to or greater than the second threshold is executed here.
  • the second threshold is larger than the first threshold. In other words, a dialog request score equal to or greater than the second threshold is interpreted as "there is a dialog request”.
  • the former is referred to as the "utterance level threshold” and the latter is referred to as the "dialogue request score threshold" in situations where confusion is likely to occur.
  • a plurality of IF sentences are combined to determine to which seat the voice input privilege should be given from the respective values of the speech level and dialogue request score. It shows the decision flow for
  • the in-vehicle speech separation device 100 may define the seats to which voice input rights should be granted in an N+1 dimensional state space consisting of speech levels and N interaction request scores (where N is the number of seats).
  • the state space may have an additional state dimension so that it can handle the case where the voice input right has already been granted. Also, from step (ST5) to step (ST8) in FIG.
  • the in-vehicle audio separation device 100 does not limit the number of thresholds used in the determination flow to the above four. The number of thresholds may be determined appropriately according to the required specifications.
  • the voice input authority determining section 60 determines to which seat the voice input authority is granted. That is, the step (ST9) of determining the passenger to whom the voice input right is given is executed here. As described above, the voice input authority determining unit 60 is preset with a determination as to which seat is prioritized when there are a plurality of seats determined to have a "dialogue request".
  • step (ST9) in FIG. 2 When step (ST9) in FIG. 2 is implemented and the voice input right is given to the seat, the information of the seat to which the voice input right has been given is sent to the voice separation section 70 .
  • the voice separation unit 70 separates only the voice of the seat to which the voice input right is granted from the voice data sent from the voice acquisition device 30 . In other words, the step (ST10) of separating the voice of the seat with voice input right is executed.
  • the speech separation apparatus 100 since the speech separation apparatus 100 according to Embodiment 1 has the above configuration, it does not use a wake word, does not require a microphone to be placed in each seat, and performs speech separation that does not impose a high processing load. realizable.
  • Embodiment 2 shows a configuration example in which the information acquisition device 10 and the interaction request score calculation unit 20 are specified in the configuration example shown in Embodiment 1.
  • FIG. 2 the same reference numerals as those used in Embodiment 1 are used unless otherwise specified. Duplicate descriptions will be omitted as appropriate.
  • FIG. 3 is a block diagram showing the functional configuration of the in-vehicle audio separation device 100 according to Embodiment 2.
  • the information acquisition device 10 includes a vehicle-mounted device state acquisition unit 11 , a surrounding road condition acquisition unit 12 , and an occupant image acquisition unit 13 . Also, as shown in FIG. , face orientation detection unit 25 , posture detection unit 26 , mouth opening detection unit 27 , and result integration unit 28 .
  • the information from the vehicle-mounted device state acquisition unit 11 of the information acquisition device 10 is sent to the state-by-state dialogue request degree calculation unit 21 of the dialogue request score calculation unit 20 .
  • the vehicle-mounted device status acquisition unit 11 is connected to the vehicle-mounted device. or not, and the status of other onboard units.
  • the status-specific dialogue request level calculation unit 21 calculates the possibility that the occupant of that seat will talk to the vehicle-mounted device for each seat.
  • the dialogue request score calculator 20 may calculate the dialogue request score for each seat according to the calculated possibility.
  • the information from the surrounding road condition acquisition unit 12 of the information acquisition device 10 is sent to the road condition-specific request degree calculation unit 22 of the dialogue request score calculation unit 20 .
  • the surrounding road condition acquisition unit 12 specifically acquires information on the surrounding road conditions from the map data of the navigation system. More specifically, road conditions refer to the number of intersections within a certain distance of the vehicle's destination, and the presence or absence of intersections where five or more roads intersect, such as five-way intersections (hereafter referred to as "multi-way intersections"). , etc.
  • the road condition-specific request level calculation unit 22 calculates the possibility that the occupant of that seat will talk to the vehicle-mounted device for each seat. For example, the likelihood may be calculated based on the premise that the driver is likely to ask the navigation system which road to take before a multi-junction.
  • the dialogue request score calculator 20 may calculate the dialogue request score for each seat according to the calculated possibility.
  • FIG. 6 is an image diagram of the dialogue request score calculated based on the information from the surrounding road condition acquisition unit 12. As shown in FIG. As shown in FIG. 6, the dialogue request score can be calculated by scoring the likelihood that the occupant of each seat will talk to the vehicle-mounted device according to the type of situation obtained from the peripheral road condition acquisition unit 12 in advance. be done.
  • the information from the occupant image acquisition unit 13 of the information acquisition device 10 includes the occupant position detection unit 23 , line-of-sight detection unit 24 , face direction detection unit 25 , posture detection unit 26 , and and to the aperture detection unit 27 .
  • the occupant image acquisition unit 13 is specifically a driver monitor, a camera that captures an image of the occupant, and the like.
  • the occupant position detection unit 23 determines whether or not there is an occupant for each seat based on the sent information. Calculation of the interaction request score is omitted for seats with no occupants.
  • the line-of-sight detection unit 24 detects the line-of-sight of each passenger on each seat based on the sent information.
  • the detected line-of-sight information is sent to the result integration unit 28 .
  • the detected line-of-sight information is used from the viewpoint of whether there is an on-vehicle device ahead of the line of sight, or whether there is another passenger ahead of the line of sight. For example, it may be based on the premise that if there is an on-board device ahead of the line of sight, the owner of the line of sight is likely to be talking to the on-board device. For example, it may be based on the premise that if there is another passenger ahead of the line of sight, there is a high possibility that the owner of the line of sight is talking to the other passenger.
  • the dialogue request score calculator 20 may calculate the dialogue request score for each seat from the line-of-sight information.
  • the face direction detection unit 25 detects the direction of the face of each passenger on each seat based on the information sent.
  • the purpose of detecting the orientation of the face is the same as the purpose of detecting the line of sight described above.
  • the detected face orientation information is sent to the result integration unit 28 .
  • the detected face direction information is used especially from the viewpoint of whether or not there is an on-vehicle device in front of the face. For example, it may be based on the premise that if there is a vehicle-mounted device ahead of the person's face, there is a high possibility that the person is talking to the vehicle-mounted device.
  • the dialogue request score calculator 20 may calculate the dialogue request score for each seat from the face orientation information.
  • the posture detection unit 26 detects the posture of each passenger in each seat based on the information sent.
  • the purpose of detecting the posture is the same as the purpose of detecting the line of sight described above.
  • Information on the detected posture is sent to the result integration unit 28 .
  • Information on the detected posture is used from the viewpoint of whether the posture is a posture of talking to the vehicle-mounted device.
  • the dialogue request score calculation unit 20 may calculate the dialogue request score for each seat from the posture information.
  • the degree-of-openness detection unit 27 detects the degree of opening, which is the degree of opening of the mouth of the passenger, for each seat where the passenger is present.
  • the detected degree of opening for each passenger is sent to the result integrating section 28 .
  • the dialog request score calculator 20 may calculate the dialog request score based on the premise that if the mouth is open, the person is likely to be speaking.
  • the dialog request score calculation unit 20 may calculate the dialog request score for each seat based on multifaceted information. That is, the dialogue request score calculation unit 20 may include the result integration unit 28 shown in FIG. 3, make comprehensive judgments from multifaceted information, and calculate the dialogue request score for each seat.
  • the speech separation device 100 employs the configuration shown in Embodiment 2, does not use wake words, does not require microphones to be placed in each seat, and has a high processing load. It is possible to realize sound separation that does not need to be done.
  • Embodiment 3 shows a configuration example in which the audio separation section 70 is specified in the configuration examples shown in Embodiments 1 and 2.
  • FIG. In the third embodiment the same reference numerals as those used in the previous embodiments are used unless otherwise specified. Duplicate descriptions will be omitted as appropriate.
  • FIG. 4 is a block diagram showing the functional configuration of the in-vehicle audio separation device 100 according to Embodiment 3.
  • the speech separation device 100 includes an area-based speech level calculation unit 45 and an area-based dialogue request presence/absence determination unit 55 .
  • the speech separation unit 70 of the speech separation device 100 includes a beamforming unit 71 and a deep layer speech separation unit 72 .
  • the speech separation apparatus 100 according to Embodiment 3 realizes speech separation by beamforming and deep speech separation.
  • the audio separation apparatus 100 according to the third embodiment performs area-based audio separation by beamforming, and reviews audio input rights for each area.
  • the voice separation unit 70 receives information about the seat to which the voice input right is granted from the voice input right determination unit 60 . If there is a seat to which the voice input right has been granted, the beamforming unit 71 of the voice separation unit 70 transmits the voice data sent from the voice acquisition device 30 to the seat to which the voice input right has been granted as viewed from the array microphone. (hereinafter referred to as "target voice") and voices from other directions (hereinafter referred to as "untargeted voice"). For the non-target sound, areas may be further defined based on the direction ⁇ from the sound source, and the sound may be further separated for each area. An example definition of areas is a driver row area, a center row area, and a passenger row area.
  • array microphones which are generally used for voice separation in cars, are placed in the center of the dashboard inside the car.
  • sound source separation is relatively easy if the directions ⁇ of the sound sources viewed from the array microphones are different, such as between the driver's seat and the front passenger's seat.
  • the direction ⁇ of the sound source seen from the array microphone does not change much, such as between the right front seat and the right rear seat, it is difficult to separate the sound sources.
  • the area-by-area utterance level calculation unit 45 determines the utterance level for each of the target voice and non-target voice separated by the beamforming unit 71 . When areas are defined and voices are further separated for each area, the area-by-area speech level calculator 45 determines the speech level for each area. The determined area-by-area utterance level information is sent to area-by-area dialogue request presence/absence determination unit 55 .
  • the area-by-area dialogue request presence/absence determination unit 55 re-determines whether "there is a dialogue request" or "there is no dialogue request” for each area based on the transmitted area-by-area utterance level information. For example, the area-by-area dialogue request presence/absence determination unit 55 re-determines that "there is no dialogue request" for a seat belonging to an area with a low utterance level. Further, if the speech level is low in the seat area to which the voice input right is granted, the area-by-area dialogue request presence/absence determination unit 55 may notify the voice input right determination unit 60 to review the allocation of the voice input right.
  • the deep layer speech separation unit 72 performs speech separation for the target speech using a mathematical model that has undergone deep learning in advance so as to separate the speech from the voice quality.
  • the mathematical model of the deep layer speech separation unit 72 may be a DNN as disclosed in Patent Document 1.
  • FIG. 5 is a flowchart showing processing of the in-vehicle audio separation device 100 according to the third embodiment.
  • ST1 to ST9 in the flowchart of FIG. 5 are the same as the blocks in the flowchart of FIG. ST31 to ST35 in the flowchart of FIG. 5 are steps newly added in the third embodiment.
  • the processing of the speech separation apparatus 100 includes a step of separating the target speech from the speech data by beamforming (ST31) and a step of calculating the utterance level of the area of the target speech (ST32). Step (ST33) of determining whether the speech level is equal to or higher than the third threshold; and a separating step (ST35).
  • the beamforming unit 71 performs the step of separating the target sound from the sound data by beamforming (ST31).
  • the step of calculating the utterance level of the area of the target voice (ST32) is executed by the area-by-area utterance level calculation unit 45.
  • the step (ST33) of determining whether or not the speech level is equal to or higher than the third threshold is executed by the area-specific dialogue request presence/absence determination unit 55.
  • the magnitude relationship between the third threshold of speech level and the first threshold of speech level and the second threshold of speech level is not unique, and may be determined individually in consideration of different audio processing paths.
  • step (ST34) is executed in which the seat belonging to the area of the target voice is regarded as having "no dialogue request". After that, the processing of the speech separation device 100 returns to the step (ST5) of determining whether or not the speech level is equal to or higher than the first threshold.
  • step (ST33) in FIG. the process moves to the deep layer sound separation unit 72, and a step (ST35) of further separating the target sound by deep layer learning is executed.
  • the speech separation device 100 employs the configuration shown in Embodiment 3, does not use a wake word, does not require microphones to be placed in each seat, and has a high processing load. It is possible to realize sound separation that does not need to be done.
  • the disclosed technology can be applied to in-vehicle devices such as car navigation systems that can respond to voice instructions, and has industrial applicability.

Abstract

An in-vehicle voice separation device according to the present invention includes: a speech level calculation unit (40) that calculates a speech level on the basis of an acquired voice within a vehicle; a dialogue request score calculation unit (20) that calculates a dialogue request score for each seat on the basis of information required to detect an acquired dialogue request; and a voice input right determination unit (60) that uses the speech level and the dialogue request score as a basis to determine whether to grant a voice input right to any of the seats within the vehicle.

Description

車載向け音声分離装置及び音声分離方法In-vehicle audio separation device and audio separation method
 本開示技術は、車載向け音声分離装置に関する。 The technology disclosed herein relates to an in-vehicle audio separation device.
 車載用の装置であって、音声での指示に対して応答することが可能な装置(以下「車載器」と呼ぶ)が知られている。車載器としてよく知られているものに、例えば Amazon Echo Autoがある。Amazon Echo Autoに代表される従来の車載器は、指示音声の前にウエイクワード「アレクサ」を用いることで音声処理が高負荷になることを防いでいる。ただしこのウエイクワードは本来の人間同士の会話には不要なものであるため、ウエイクワードを用いない車載器の実現が望まれている。 A device (hereinafter referred to as "vehicle device") that is an in-vehicle device and is capable of responding to voice instructions is known. Well-known in-vehicle devices include, for example, Amazon Echo Auto. Conventional in-vehicle equipment, represented by Amazon Echo Auto, uses the wake word "Alexa" before the instruction voice to prevent voice processing from becoming a heavy load. However, since this wake word is not necessary for conversation between humans, it is desired to realize an on-vehicle device that does not use wake words.
 車載器は、騒々しい車内でも特定の声だけをクローズアップすることも求められている。この課題に対して、複数のマイクを用いた技術が知られている。例えばCerence社のPassenger Interference Cancellation (以下「PIC」という)とトランストロン社のマイクは、車載器に応用されていることが知られている。
Cerence社のPICに関する記事の一例:
<URL:https://response.jp/article/2016/03/01/270717.html>
トランストロン社のアレイマイクに関する記事の一例:
<URL:https://www.transtron.com/products/array.html>
 複数マイクの使用例として、各座席にマイクを配置する方法が考えられる。この方法は音声分離性能はよいが、各座席へのマイクの配置に配線のコストと手間を要する、という性質がある。この方法は、一部の高級車のみにしか採用されない。
 複数マイクの別の使用例として、ダッシュボード上のcenter information display付近に配置する方法がある。この方法は配線のコストと手間を要さないが、運転席とその後部座席とのように音声が入力される角度に差がない場合に音声分離性能が劣る、という性質がある。
In-vehicle equipment is also required to close up only a specific voice even in a noisy vehicle. A technique using a plurality of microphones is known for this problem. For example, Cerence's Passenger Interference Cancellation (hereinafter referred to as "PIC") and Transtron's microphone are known to be applied to vehicle-mounted equipment.
An example from Cerence's PIC article:
<URL: https://response. jp/article/2016/03/01/270717. html>
An example from an article about Transtron array microphones:
<URL: https://www. transtron. com/products/array. html>
As an example of using multiple microphones, a method of arranging a microphone at each seat is conceivable. Although this method has good voice separation performance, it has the property of requiring wiring costs and labor for arranging microphones to each seat. This method is only used in some luxury cars.
Another use for multiple microphones is to place them near the center information display on the dashboard. Although this method does not require wiring cost and labor, it has the property that the voice separation performance is inferior when there is no difference in the angle at which the voice is input, such as between the driver's seat and the rear seat.
 複数のマイクを整列させたアレイマイクも知られている。アレイマイクで観測された複数音源の混合信号の音源分離技術には、ニューラルネットワークを用いて時間周波数領域におけるスパース性が成立しない音源に対しても、高い分離性能を有する音源分離方法が開示されている(例えば特許文献1)。より具体的に特許文献1は、学習済みのディープニューラルネットワーク(以下「DNN」)を用いることで、方向θが既知の音源からの観測信号を教師データとして学習し、θ方向からの音を選択的に抽出することを開示している。
 各座席へマイクを配置することなく音声分離性能を向上させるために、ダッシュボード上に配置されたアレイマイクに、特許文献1に開示されたDNNの技術とを組み合わせることが考えられる。
Array microphones in which a plurality of microphones are aligned are also known. For sound source separation technology of mixed signals of multiple sound sources observed by array microphones, a sound source separation method is disclosed that has high separation performance even for sound sources that do not have sparsity in the time-frequency domain using neural networks. (For example, Patent Document 1). More specifically, Patent Document 1 uses a trained deep neural network (hereinafter referred to as "DNN") to learn observation signals from a sound source with a known direction θ as teacher data, and select sounds from the θ direction. It discloses that it extracts
In order to improve voice separation performance without placing a microphone in each seat, it is conceivable to combine the DNN technology disclosed in Patent Document 1 with an array microphone placed on the dashboard.
特開2020-38315号公報JP 2020-38315 A
 車載のアレイマイクにDNNの技術を組み合わせることは、各乗員の声をそれぞれ抽出するために、人数分の高負荷な処理の実施を意味する。処理が高負荷になることは、音声に対する応答の遅延、及び車載器が実行している他の機能への支障、等につながるおそれがある。 Combining DNN technology with the in-vehicle array microphone means implementing high-load processing for the number of people in order to extract the voice of each passenger. A high processing load may lead to a delay in response to voices, a hindrance to other functions being executed by the vehicle-mounted device, and the like.
 本開示技術は上記課題を解決し、ウエイクワードを用いず、各座席へのマイクの配置も必要とせず、かつ処理が高負荷とはならない車載器のための音声分離装置を提供することを目的とする。 An object of the disclosed technology is to solve the above problems, to provide a voice separation device for an in-vehicle device that does not use a wake word, does not require a microphone to be placed in each seat, and does not cause a high processing load. and
 本開示に係る車載向け音声分離装置は、取得された車内の音声に基づいて発話レベルを算出する発話レベル算出部と、取得された対話要求検知に必要な情報に基づいて座席ごとに対話要求スコアを算出する対話要求スコア算出部と、前記発話レベルと前記対話要求スコアとに基づいて、前記車内の座席の中のいずれに音声入力権を付与するかを判定する音声入力権判定部と、を含む。 An in-vehicle speech separation device according to the present disclosure includes a speech level calculation unit that calculates a speech level based on the acquired in-vehicle speech, and a dialogue request score for each seat based on the acquired information necessary for dialogue request detection. and a voice input authority determination unit that determines to which of the seats in the vehicle the voice input authority should be granted based on the utterance level and the dialogue request score. include.
 本開示技術に係る車載向け音声分離装置は上記構成を備えるため、ウエイクワードを用いず、各座席へのマイクの配置も必要とせず、かつ処理が高負荷とはならない車載器が実現される。 Since the in-vehicle audio separation device according to the disclosed technology has the above configuration, an in-vehicle device that does not use a wake word, does not require microphones to be placed in each seat, and does not impose a high processing load is realized.
図1は、実施の形態1に係る車載向け音声分離装置の機能構成を示すブロック図である。FIG. 1 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 1. As shown in FIG. 図2は、実施の形態1に係る車載向け音声分離装置の処理を示すフローチャートである。FIG. 2 is a flow chart showing processing of the in-vehicle audio separation device according to the first embodiment. 図3は、実施の形態2に係る車載向け音声分離装置の機能構成を示すブロック図である。FIG. 3 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 2. As shown in FIG. 図4は、実施の形態3に係る車載向け音声分離装置の機能構成を示すブロック図である。FIG. 4 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 3. As shown in FIG. 図5は、実施の形態3に係る車載向け音声分離装置の処理を示すフローチャートである。FIG. 5 is a flow chart showing processing of the in-vehicle audio separation device according to the third embodiment. 図6は、周辺道路状況取得部からの情報に基づいて対話要求スコアを算出したイメージ図である。FIG. 6 is an image diagram of a dialogue request score calculated based on information from the surrounding road condition acquisition unit.
 本開示技術に係る車載向け音声分離装置100は、以下の実施の形態ごとの図に沿った説明により明らかにされる。 The in-vehicle audio separation device 100 according to the technology disclosed herein will be clarified by the following description along the drawings for each embodiment.
実施の形態1.
 図1は、実施の形態1に係る車載向け音声分離装置100の機能構成を示すブロック図である。図1が示すとおり本開示技術に係る車載向け音声分離装置100は、対話要求スコア算出部20と、発話レベル算出部40と、対話要求有無判別部50と、音声入力権判定部60と、音声分離部70と、を含む。
 音声分離装置100は2つの入力系統を有し、情報取得装置10と音声取得装置30と、それぞれ接続される。音声分離装置100は少なくとも1つの出力系統を有し、音声認識装置80と接続される。音声分離装置100は、2つ目の出力系統を有してもよく、通知装置90と接続されてもよい。
Embodiment 1.
FIG. 1 is a block diagram showing the functional configuration of an in-vehicle audio separation device 100 according to Embodiment 1. As shown in FIG. As shown in FIG. 1, the in-vehicle speech separation device 100 according to the technology disclosed herein includes a dialogue request score calculation unit 20, an utterance level calculation unit 40, a dialogue request presence/absence determination unit 50, a voice input authority determination unit 60, a voice and a separation unit 70 .
The speech separation device 100 has two input systems, which are connected to the information acquisition device 10 and the speech acquisition device 30, respectively. The speech separation device 100 has at least one output system and is connected to the speech recognition device 80 . The voice separation device 100 may have a second output system and may be connected to the notification device 90 .
 情報取得装置10は、カメラ、ドライブレコーダ、ドライバーモニタ、又はその他の対話要求を検出し得る装置である。
 音声取得装置30は、マイクを含み、車内の音声を取得する装置である。
The information acquisition device 10 is a camera, drive recorder, driver monitor, or other device capable of detecting interaction requests.
The voice acquisition device 30 is a device that includes a microphone and acquires voice inside the vehicle.
 音声分離装置100の対話要求スコア算出部20は、情報取得装置10からの情報に基づいて、対話要求のスコアを算出する。具体的な例では、情報取得装置10がカメラである場合、対話要求スコア算出部20はカメラで撮像された動画像に基づいて乗員の視線の解析を行う。対話要求のスコアは、車内の座席ごとに算出される。 The dialog request score calculator 20 of the speech separation device 100 calculates the score of the dialog request based on the information from the information acquisition device 10 . In a specific example, when the information acquisition device 10 is a camera, the interaction request score calculator 20 analyzes the line of sight of the passenger based on the moving image captured by the camera. An interaction request score is calculated for each seat in the vehicle.
 音声分離装置100の発話レベル算出部40は、音声取得装置30からの情報に基づいて、車内の発話のレベルを算出する。具体的に発話レベル算出部40は、音声取得装置30から受け取った音声情報について発話以外のノイズを除去し、発話の音声レベルを算出する。発話レベル算出部40が受け取る音声情報は、ここではまだ乗員ごとに分離されていない。よって発話レベル算出部40が算出する車内の発話のレベル(以下、単に「発話レベル」と呼ぶ)は、乗員ごとではなく車内全体としてのレベルである。 The speech level calculation unit 40 of the speech separation device 100 calculates the speech level inside the vehicle based on the information from the speech acquisition device 30 . Specifically, the speech level calculator 40 removes noise other than speech from the speech information received from the speech acquisition device 30, and calculates the speech level of the speech. The speech information received by the speech level calculator 40 is not yet separated for each passenger here. Therefore, the in-vehicle utterance level calculated by the utterance level calculation unit 40 (hereinafter simply referred to as "utterance level") is not for each passenger but for the entire vehicle interior.
 音声分離装置100の対話要求有無判別部50は、対話要求スコア算出部20が算出する対話要求のスコアと発話レベル算出部40が算出する発話レベルとに基づいて対話要求の有無を判断する。例えば対話要求有無判別部50は、発話レベルが十分大きく対話要求のスコアが小さくない場合、「対話要求が有る」と判断する。例えば対話要求有無判別部50は、発話レベルが多少小さくても対話要求が十分大きい場合も、「対話要求が有る」と判断してもよい。「対話要求が有る」の条件を満たさない場合に対話要求有無判別部50は、「対話要求が無い」と判断してもよい。対話要求有無判別部50の処理フローは、後述により明らかにされる。 The dialogue request presence/absence determination unit 50 of the speech separation device 100 determines whether there is a dialogue request based on the dialogue request score calculated by the dialogue request score calculation unit 20 and the speech level calculated by the speech level calculation unit 40 . For example, when the speech level is sufficiently high and the score of the dialogue request is not small, the dialogue request presence/absence determination unit 50 determines that "there is a dialogue request". For example, the dialogue request presence/absence determination unit 50 may determine that "there is a dialogue request" even when the speech level is somewhat low but the dialogue request is sufficiently large. If the condition of "there is a demand for dialogue" is not satisfied, the dialogue request presence/absence determination unit 50 may determine that "there is no demand for dialogue". The processing flow of the dialog request presence/absence determination unit 50 will be clarified later.
 音声分離装置100の音声入力権判定部60は、「対話要求が有る」と判断された場合、対話要求のスコア及び発話レベルに基づいて、車載器に対して音声入力を行う権利(以下「音声入力権」と呼ぶ)が付与される乗員を決定する。音声入力権が付与された乗員の情報は、音声分離部70へ送付される。 When it is determined that "there is a dialogue request", the speech input right determination unit 60 of the speech separation device 100 grants the right to perform voice input to the vehicle-mounted device (hereinafter referred to as "voice determine the occupants to whom input rights are granted. Information on the passenger to whom the voice input right has been granted is sent to the voice separation section 70 .
 どの乗員に音声入力権を付与するかについての決め事は、いくつかのバリエーションが考えられる。
 例えば、車内の複数の座席について対話要求のスコアが閾値を超えた場合、運転席が含まれていれば常に運転席を優先する、と決めてもよい。
 例えば、対話要求のスコアの閾値は、運転席だけ低くしてもよい。この場合も運転席は、他の座席よりも優先される。
 例えば、車内の複数の座席について対話要求のスコアが閾値を超えた場合、早い者を優先する、と決めてもよい。この決め事は、運転席を含めてもよいし含めなくてもよい。
 例えば、車内の複数の座席について対話要求のスコアが閾値を超えた場合、対話要求のスコアが高い方を優先する、と決めてもよい。この場合、あらかじめ運転席のスコアが底上げされていれば、運転席が他の座席よりも優先される効果がある。
Several variations are conceivable for determining which occupants are given voice input rights.
For example, if the interaction request score exceeds a threshold for multiple seats in a vehicle, it may be determined that the driver's seat is always given priority if the driver's seat is included.
For example, the interaction request score threshold may be lowered only for the driver's seat. In this case also, the driver's seat has priority over the other seats.
For example, it may be determined that if the interaction request scores for multiple seats in the car exceed a threshold, first come first served. This decision may or may not include the driver's seat.
For example, if the interaction request scores for multiple seats in the vehicle exceed a threshold, it may be determined that priority is given to the interaction request with the higher score. In this case, if the score of the driver's seat is raised in advance, there is an effect that the driver's seat is prioritized over the other seats.
 音声入力権が付与されている座席(以下、「保持座席」と呼ぶ)が既にある場合、その音声入力権の有効期限についての決め事もいくつかのバリエーションが考えられる。
 例えば、保持座席が頻繁に切り替わらないように、音声入力権の有効期限をあらかじめ定めた一定時間としてもよい。
 例えば、保持座席以外の座席のスコアが、保持座席のスコアよりも一定期間超え続けた場合に音声入力権を切り替える、としてもよい。
 例えば、音声入力権を切り替える条件は、保持座席のスコアにマージンを付加したものを超えること、と決めてもよい。この決め事による音声入力権の切換えは、例えばマージンを5[pt]とした以下の数値例により明らかにされる。当初、保持座席のスコアが30[pt]、他の座席のスコアが28[pt]、であったとする。次の状態で、保持座席のスコアが32[pt]、他の座席のスコアが35[pt]、と変化したとする。この場合は35[pt]が32+5[pt]を超えていないため、音声入力権は切り替わらない。さらに次の状態で、保持座席のスコアが28[pt]、他の座席のスコアが34[pt]、と変化したとする。この場合は34[pt]が28+5[pt]を超えているため、音声入力権が切り替わる。
If there is already a seat to which the voice input right has been granted (hereafter referred to as a "held seat"), several variations are conceivable for the validity period of the voice input right.
For example, the expiration date of the voice input right may be set to a predetermined fixed time so that the held seat is not frequently switched.
For example, if the score of the seat other than the held seat continues to exceed the score of the held seat for a certain period of time, the voice input right may be switched.
For example, it may be determined that the condition for switching audio input rights is to exceed the score of the retained seat plus a margin. The switching of the voice input authority based on this rule is clarified by the following numerical example with a margin of 5 [pt], for example. Assume that the score of the held seat was initially 30 [pt] and the score of the other seats was 28 [pt]. Assume that the score of the held seat changes to 32 [pt] and the score of the other seats changes to 35 [pt] in the following state. In this case, since 35 [pt] does not exceed 32+5 [pt], the voice input right is not switched. Furthermore, suppose that the score of the held seat changes to 28 [pt] and the score of the other seats changes to 34 [pt] in the following state. In this case, since 34 [pt] exceeds 28+5 [pt], the voice input right is switched.
 音声分離装置100の音声分離部70は、音声取得装置30から送られた音声データから、音声入力権が付与された座席についての音声だけを分離する。
 このように本開示技術に係る音声分離装置100は、すべての座席についての音声分離を行わず、音声入力権が付与された特定の座席についてのみ音声分離を行うため、音声処理に要する負荷が高くならずにすむ。
The voice separation unit 70 of the voice separation device 100 separates only the voice of the seat to which the voice input right is granted from the voice data sent from the voice acquisition device 30 .
As described above, the sound separation device 100 according to the technology disclosed herein does not perform sound separation for all seats, but performs sound separation only for specific seats to which the sound input right has been granted. Therefore, the load required for sound processing is high. I can get away with it.
 音声分離部70によって分離の処理がなされた音声は、音声認識装置80へ出力される。音声認識装置80は、具体的にはカーナビゲーションシステム等の車載器を含む装置である。 The speech that has undergone separation processing by the speech separation unit 70 is output to the speech recognition device 80 . The speech recognition device 80 is specifically a device including an on-vehicle device such as a car navigation system.
 音声分離装置100は、音声入力権が付与された座席の情報を、外部に表示できるよう2つ目の出力系統を有してもよい。外部に表示する手段は、外部の通知装置90であってもよいし、音声分離装置100が有するLED表示、ディスプレイ画面、又はスピーカであってもよい。このように音声分離装置100は、音声入力権が付与された座席の情報を、何らかの方法で外部に表示できるような構成を備えていてよい。 The audio separation device 100 may have a second output system so that the information of the seat to which the audio input right is granted can be displayed outside. The external display means may be the external notification device 90, or may be an LED display, a display screen, or a speaker that the audio separation device 100 has. In this manner, the voice separation device 100 may have a configuration capable of externally displaying the information of the seat to which the voice input right has been granted in some way.
 図2は、実施の形態1に係る車載向け音声分離装置100の処理の例を示すフローチャートである。図2が示すとおり音声分離装置100の処理は、対話要求検知に必要な情報を取得するステップ(ST1)と、取得した情報を解析するステップ(ST2)と、車内の音声を取得するステップ(ST3)と、取得した音声から車内の発話レベルを算出するステップ(ST4)と、発話レベルが第1閾値以上であるか否かを判断するステップ(ST5)と、対話要求のスコアが第1閾値以上か否かを判断するステップ(ST6)と、発話レベルが第2閾値以上か否かを判断するステップ(ST7)と、対話要求のスコアが第2閾値以上か否かを判断するステップ(ST8)と、音声入力権を与える乗員を判定するステップ(ST9)と、音声入力権のある座席の音声を分離するステップ(ST10)と、を含む。
 なお、厳密には、対話要求検知に必要な情報を取得するステップ(ST1)は情報取得装置10が行う処理であり、車内の音声を取得するステップ(ST3)は音声取得装置30が行う処理である。しかし、システム全体としての動作を明らかにするため、図2のフローチャートにこれらのステップ(ST1、ST3)が含まれている。
FIG. 2 is a flow chart showing an example of processing of the in-vehicle audio separation device 100 according to the first embodiment. As shown in FIG. 2, the processing of the speech separation device 100 includes a step of acquiring information necessary for detecting a dialogue request (ST1), a step of analyzing the acquired information (ST2), and a step of acquiring voice inside the vehicle (ST3). ), a step of calculating the speech level in the vehicle from the acquired voice (ST4), a step of determining whether the speech level is equal to or higher than a first threshold (ST5), and a step of determining whether the score of the dialogue request is equal to or higher than the first threshold. step (ST6) of determining whether or not; step (ST7) of determining whether or not the speech level is equal to or greater than the second threshold; and step (ST8) of determining whether or not the dialogue request score is equal to or greater than the second threshold. , a step (ST9) of determining the occupant to whom the voice input right is given, and a step (ST10) of separating the voice of the seat having the voice input right.
Strictly speaking, the step (ST1) of acquiring the information necessary for detecting the dialogue request is a process performed by the information acquisition device 10, and the step (ST3) of acquiring the voice inside the vehicle is a process performed by the voice acquisition device 30. be. However, these steps (ST1, ST3) are included in the flow chart of FIG. 2 to clarify the operation of the system as a whole.
 音声分離装置100の対話要求スコア算出部20は、情報取得装置10を介して取得した情報を解析するステップ(ST2)を実行する。対話要求スコア算出部20は、解析結果に基づいて、座席ごとに対話要求スコアを算出する。 The dialog request score calculator 20 of the speech separation device 100 executes the step of analyzing the information acquired via the information acquisition device 10 (ST2). The dialog request score calculator 20 calculates a dialog request score for each seat based on the analysis result.
 音声分離装置100の発話レベル算出部40は、音声取得装置30を介して取得した音声から車内の発話レベルを算出するステップ(ST4)を実行する。対話要求スコア算出部20の処理するステップ(ST2)と発話レベル算出部40が処理するステップ(ST4)とは、実行の先後を問うものではない。好ましい態様においては、ステップ(ST2)とステップ(ST4)とが、同時並行で処理される。 The speech level calculation unit 40 of the speech separation device 100 executes a step (ST4) of calculating the speech level in the vehicle from the speech acquired via the speech acquisition device 30. The step (ST2) processed by the dialogue request score calculation unit 20 and the step (ST4) processed by the speech level calculation unit 40 do not matter before or after the execution. In a preferred embodiment, step (ST2) and step (ST4) are processed concurrently.
 音声分離装置100の対話要求有無判別部50は、発話レベル算出部40のステップ(ST4)により求められた発話レベルが、そもそも或る一定のレベル以上有るか否かを判断する。具体的に対話要求有無判別部50は、あらかじめ決めた発話レベル第1閾値と発話レベルとを比較する。つまりここで発話レベルが第1閾値以上であるか否かを判断するステップ(ST5)が実行される。 The dialogue request presence/absence determination unit 50 of the speech separation device 100 determines whether or not the utterance level obtained in step (ST4) of the utterance level calculation unit 40 is above a certain level. Specifically, the dialogue request presence/absence determination unit 50 compares a predetermined speech level first threshold with the speech level. That is, the step (ST5) of determining whether or not the speech level is equal to or higher than the first threshold is executed here.
 発話レベルが第1閾値以上である場合、すなわち図2のステップ(ST5)の判断結果がYESの場合、対話要求有無判別部50はこの音声が話しかけようとしている動作又は状況に伴って生じたものかを判断する。具体的に対話要求有無判別部50は、対話要求スコア算出部20が算出した座席ごとに対話要求スコアを、あらかじめ決めた対話要求スコア第1閾値と比較する。すなわち、対話要求スコアが第1閾値以上であることは、「対話要求が有る」と解釈される。逆に、発話レベルが一定のレベル以上あったとしても、対話要求スコアが第1閾値未満で話しかけようとしている動作又は状況ではない場合は「対話要求が無い」と解釈される。つまりここで対話要求のスコアが第1閾値以上か否かを判断するステップ(ST6)が実行される。 If the speech level is equal to or higher than the first threshold, that is, if the determination result in step (ST5) in FIG. determine whether Specifically, the dialogue request presence/absence determination unit 50 compares the dialogue request score for each seat calculated by the dialogue request score calculation unit 20 with a predetermined dialogue request score first threshold. That is, a dialogue request score equal to or greater than the first threshold is interpreted as "there is a dialogue request". Conversely, even if the utterance level is above a certain level, if the dialogue request score is less than the first threshold and the action or situation is not one of trying to talk, it is interpreted as "no dialogue request". That is, a step (ST6) of determining whether or not the score of the interaction request is equal to or greater than the first threshold is executed here.
 発話レベルが第1閾値未満である場合、すなわち図2のステップ(ST5)の判断結果がNOの場合、対話要求有無判別部50は小さな声で会話がなされていないかを判断する。具体的に対話要求有無判別部50は、あらかじめ決めた発話レベル第2閾値と発話レベルとを比較する。つまりここで発話レベルが第2閾値以上か否かを判断するステップ(ST7)が実行される。
 発話レベルの第1閾値と第2閾値との大小関係は、第2閾値が第1閾値よりも小さい。
If the utterance level is less than the first threshold, that is, if the determination result in step (ST5) of FIG. 2 is NO, the dialogue request presence/absence determination unit 50 determines whether the conversation is being conducted in a low voice. Specifically, the dialogue request presence/absence determination unit 50 compares a predetermined speech level second threshold with the speech level. That is, the step (ST7) of determining whether the speech level is equal to or higher than the second threshold is executed here.
As for the magnitude relationship between the first threshold and the second threshold of the speech level, the second threshold is smaller than the first threshold.
 発話レベルが第2閾値以上である場合、すなわち図2のステップ(ST7)の判断結果がYESの場合、対話要求有無判別部50はこの小さな音声が話しかけようとしている動作又は状況に伴って生じたものかを判断する。具体的に対話要求有無判別部50は、対話要求スコア算出部20が算出した座席ごとに対話要求スコアを、あらかじめ決めた対話要求スコア第2閾値と比較する。つまりここで対話要求のスコアが第2閾値以上か否かを判断するステップ(ST8)が実行される。
 対話要求スコアの第1閾値と第2閾値との大小関係は、第2閾値が第1閾値よりも大きい。すなわち、対話要求スコアが第2閾値以上であることは、確実に「対話要求が有る」と解釈される。
 なお発話レベルの閾値と対話要求スコアの閾値とを区別するため、混乱が生じやすい場面においては、前者を「発話レベル閾値」と呼び、後者を「対話要求スコア閾値」と呼ぶ。
If the speech level is equal to or higher than the second threshold, that is, if the determination result in step (ST7) in FIG. to judge what Specifically, the dialog request presence/absence determination unit 50 compares the dialog request score for each seat calculated by the dialog request score calculation unit 20 with a predetermined dialog request score second threshold. That is, the step (ST8) of determining whether or not the score of the interaction request is equal to or greater than the second threshold is executed here.
As for the magnitude relationship between the first threshold and the second threshold of the dialogue request score, the second threshold is larger than the first threshold. In other words, a dialog request score equal to or greater than the second threshold is interpreted as "there is a dialog request".
In addition, in order to distinguish between the utterance level threshold and the dialogue request score threshold, the former is referred to as the "utterance level threshold" and the latter is referred to as the "dialogue request score threshold" in situations where confusion is likely to occur.
 図2のステップ(ST5)からステップ(ST8)までは、複数のIF文を組み合わせて、発話レベルと対話要求スコアとのそれぞれの値から、どの座席に対して音声入力権を付与すべきかを決定するための判断フローを示している。本開示技術に係る車載向け音声分離装置100は、判断フローにおいてこの複数のIF文の組合せに限定したものではない。例えば判断手法は、発話レベルとN個(Nは座席の数)の対話要求スコアとからなるN+1次元の状態空間に、音声入力権を付与すべき座席を定義してもよい。また例えば状態空間は、既に音声入力権が付与されている場合も扱えるよう、さらに状態の次元が増えていてもよい。
 また、図2のステップ(ST5)からステップ(ST8)までは、2つの発話レベル閾値と2つの対話要求スコア閾値とを用いている。本開示技術に係る車載向け音声分離装置100は、判断フローで用いる閾値の数を上記4つに限定したものではない。閾値の数は、求められている仕様に応じて、適宜決めてもよい。
From step (ST5) to step (ST8) in FIG. 2, a plurality of IF sentences are combined to determine to which seat the voice input privilege should be given from the respective values of the speech level and dialogue request score. It shows the decision flow for The in-vehicle speech separation device 100 according to the technology disclosed herein is not limited to the combination of multiple IF sentences in the determination flow. For example, the decision method may define the seats to which voice input rights should be granted in an N+1 dimensional state space consisting of speech levels and N interaction request scores (where N is the number of seats). In addition, for example, the state space may have an additional state dimension so that it can handle the case where the voice input right has already been granted.
Also, from step (ST5) to step (ST8) in FIG. 2, two utterance level thresholds and two dialogue request score thresholds are used. The in-vehicle audio separation device 100 according to the technology disclosed herein does not limit the number of thresholds used in the determination flow to the above four. The number of thresholds may be determined appropriately according to the required specifications.
 図2のステップ(ST5)からステップ(ST8)までの判断フローにおいて「対話要求が有る」と判断された場合、音声入力権判定部60は音声入力権をどの座席に付与するかを決定する。つまりここで音声入力権を与える乗員を判定するステップ(ST9)が実行される。
 音声入力権判定部60は、前述のとおり、「対話要求が有る」と判断された座席が複数である場合にどの座席を優先させるかといった決め事があらかじめ設定されている。
If it is determined that "there is a request for dialogue" in the determination flow from steps (ST5) to (ST8) in FIG. 2, the voice input authority determining section 60 determines to which seat the voice input authority is granted. That is, the step (ST9) of determining the passenger to whom the voice input right is given is executed here.
As described above, the voice input authority determining unit 60 is preset with a determination as to which seat is prioritized when there are a plurality of seats determined to have a "dialogue request".
 図2のステップ(ST9)が実施され音声入力権が座席に付与されると、音声入力権が付与された座席の情報が音声分離部70へ送られる。音声分離部70は、音声取得装置30から送られた音声データから、音声入力権が付与された座席についての音声だけを分離する。つまりここで音声入力権のある座席の音声を分離するステップ(ST10)が実行される。 When step (ST9) in FIG. 2 is implemented and the voice input right is given to the seat, the information of the seat to which the voice input right has been given is sent to the voice separation section 70 . The voice separation unit 70 separates only the voice of the seat to which the voice input right is granted from the voice data sent from the voice acquisition device 30 . In other words, the step (ST10) of separating the voice of the seat with voice input right is executed.
 以上のとおり実施の形態1に係る音声分離装置100は上記の構成を備えるため、ウエイクワードを用いず、各座席へのマイクの配置も必要とせず、かつ処理が高負荷とはならない音声分離を実現できる。 As described above, since the speech separation apparatus 100 according to Embodiment 1 has the above configuration, it does not use a wake word, does not require a microphone to be placed in each seat, and performs speech separation that does not impose a high processing load. realizable.
実施の形態2.
 実施の形態2は、実施の形態1で示した構成例において、情報取得装置10と対話要求スコア算出部20とを具体化した構成例を示す。
 実施の形態2において、特に明示した場合を除いて、実施の形態1で用いた符号と同じものが用いられる。また重複する説明は、適宜省略される。
Embodiment 2.
Embodiment 2 shows a configuration example in which the information acquisition device 10 and the interaction request score calculation unit 20 are specified in the configuration example shown in Embodiment 1. FIG.
In Embodiment 2, the same reference numerals as those used in Embodiment 1 are used unless otherwise specified. Duplicate descriptions will be omitted as appropriate.
 図3は、実施の形態2に係る車載向け音声分離装置100の機能構成を示すブロック図である。図3が示すとおり情報取得装置10は、車載器状態取得部11と、周辺道路状況取得部12と、乗員映像取得部13と、を含む。
 また図3が示すとおり音声分離装置100の対話要求スコア算出部20は、状態別対話要求度合算出部21と、道路状況別要求度合算出部22と、乗員位置検出部23と、視線検出部24と、顔向き検出部25と、姿勢検出部26と、開口度検出部27と、結果統合部28と、を含む。
FIG. 3 is a block diagram showing the functional configuration of the in-vehicle audio separation device 100 according to Embodiment 2. As shown in FIG. As shown in FIG. 3 , the information acquisition device 10 includes a vehicle-mounted device state acquisition unit 11 , a surrounding road condition acquisition unit 12 , and an occupant image acquisition unit 13 .
Also, as shown in FIG. , face orientation detection unit 25 , posture detection unit 26 , mouth opening detection unit 27 , and result integration unit 28 .
 図3が示すとおり、情報取得装置10の車載器状態取得部11からの情報は、対話要求スコア算出部20の状態別対話要求度合算出部21へ送られる。
 車載器状態取得部11は、具体的には車載器に接続されている形態電話が着信中であるか否か、ナビゲーションシステムが道案内中であるか否か、車内で音楽が再生中であるか否か、及びその他の車載器の状態を取得する。
 状態別対話要求度合算出部21は、送られた情報に基づいて、座席ごとにその座席の乗員が車載器に話しかける可能性を算出する。対話要求スコア算出部20は、この算出された可能性に応じて座席ごとの対話要求スコアを算出してもよい。
As shown in FIG. 3 , the information from the vehicle-mounted device state acquisition unit 11 of the information acquisition device 10 is sent to the state-by-state dialogue request degree calculation unit 21 of the dialogue request score calculation unit 20 .
Specifically, the vehicle-mounted device status acquisition unit 11 is connected to the vehicle-mounted device. or not, and the status of other onboard units.
Based on the sent information, the status-specific dialogue request level calculation unit 21 calculates the possibility that the occupant of that seat will talk to the vehicle-mounted device for each seat. The dialogue request score calculator 20 may calculate the dialogue request score for each seat according to the calculated possibility.
 図3が示すとおり、情報取得装置10の周辺道路状況取得部12からの情報は、対話要求スコア算出部20の道路状況別要求度合算出部22へ送られる。
 周辺道路状況取得部12は、具体的にはナビゲーションシステムの地図データから周辺の道路状況の情報を取得する。より具体的に道路状況とは、クルマの進行先の一定距離間にある交差点の個数、五差路といった5以上の道が交差している交差点(以下、「多差路」と呼ぶ)の有無、等である。
 道路状況別要求度合算出部22は、送られた情報に基づいて、座席ごとにその座席の乗員が車載器に話しかける可能性を算出する。例えば、多差路の手前では運転手はどの道を進行すべきかナビゲーションシステムに問いかける可能性が高い、といった前提に基づいて、可能性を算出してもよい。対話要求スコア算出部20は、この算出された可能性に応じて座席ごとの対話要求スコアを算出してもよい。
As shown in FIG. 3 , the information from the surrounding road condition acquisition unit 12 of the information acquisition device 10 is sent to the road condition-specific request degree calculation unit 22 of the dialogue request score calculation unit 20 .
The surrounding road condition acquisition unit 12 specifically acquires information on the surrounding road conditions from the map data of the navigation system. More specifically, road conditions refer to the number of intersections within a certain distance of the vehicle's destination, and the presence or absence of intersections where five or more roads intersect, such as five-way intersections (hereafter referred to as "multi-way intersections"). , etc.
Based on the sent information, the road condition-specific request level calculation unit 22 calculates the possibility that the occupant of that seat will talk to the vehicle-mounted device for each seat. For example, the likelihood may be calculated based on the premise that the driver is likely to ask the navigation system which road to take before a multi-junction. The dialogue request score calculator 20 may calculate the dialogue request score for each seat according to the calculated possibility.
 図6は、周辺道路状況取得部12からの情報に基づいて対話要求スコアを算出したイメージ図である。
 図6が示すとおり対話要求スコアの算出は、あらかじめ周辺道路状況取得部12から得られる状況の種類に応じて、各座席の乗員が車載器に話しかける可能性をスコア化しておく、という方法が考えられる。
FIG. 6 is an image diagram of the dialogue request score calculated based on the information from the surrounding road condition acquisition unit 12. As shown in FIG.
As shown in FIG. 6, the dialogue request score can be calculated by scoring the likelihood that the occupant of each seat will talk to the vehicle-mounted device according to the type of situation obtained from the peripheral road condition acquisition unit 12 in advance. be done.
  図3が示すとおり、情報取得装置10の乗員映像取得部13からの情報は、対話要求スコア算出部20の乗員位置検出部23、視線検出部24、顔向き検出部25、姿勢検出部26、及び開口度検出部27、へ送られる。
 乗員映像取得部13は、具体的にはドライバーモニタ、及び乗員を撮像するカメラ等である。
As shown in FIG. 3 , the information from the occupant image acquisition unit 13 of the information acquisition device 10 includes the occupant position detection unit 23 , line-of-sight detection unit 24 , face direction detection unit 25 , posture detection unit 26 , and and to the aperture detection unit 27 .
The occupant image acquisition unit 13 is specifically a driver monitor, a camera that captures an image of the occupant, and the like.
 乗員位置検出部23は、送られた情報に基づいて、座席ごとに乗員の有無を判断する。乗員がいない座席については、対話要求スコアの算出が省略される。 The occupant position detection unit 23 determines whether or not there is an occupant for each seat based on the sent information. Calculation of the interaction request score is omitted for seats with no occupants.
 視線検出部24は、送られた情報に基づいて、乗員のいる座席ごとに乗員の視線を検出する。検出された視線情報は、結果統合部28へ送られる。検出された視線情報は、特に視線の先に車載器があるか、又は視線の先に他の乗員がいるか、等の観点で用いられる。例えば、視線の先に車載器があればその視線の持ち主は車載器に話しかけている可能性が高い、といった前提に基づいてもよい。例えば、視線の先に他の乗員がいればその視線の持ち主は他の乗員に話しかけている可能性が高い、といった前提に基づいてもよい。対話要求スコア算出部20は、視線情報から座席ごとの対話要求スコアを算出してもよい。 The line-of-sight detection unit 24 detects the line-of-sight of each passenger on each seat based on the sent information. The detected line-of-sight information is sent to the result integration unit 28 . The detected line-of-sight information is used from the viewpoint of whether there is an on-vehicle device ahead of the line of sight, or whether there is another passenger ahead of the line of sight. For example, it may be based on the premise that if there is an on-board device ahead of the line of sight, the owner of the line of sight is likely to be talking to the on-board device. For example, it may be based on the premise that if there is another passenger ahead of the line of sight, there is a high possibility that the owner of the line of sight is talking to the other passenger. The dialogue request score calculator 20 may calculate the dialogue request score for each seat from the line-of-sight information.
 顔向き検出部25は、送られた情報に基づいて、乗員のいる座席ごとに乗員の顔の向きを検出する。顔の向きを検出する趣旨は、前述の視線を検出する趣旨と同じである。検出された顔向き情報は、結果統合部28へ送られる。検出された顔向き情報は、特に顔の向いた先に車載器があるか等の観点で用いられる。例えば、顔の向きの先に車載器があればその者は車載器に話しかけている可能性が高い、といった前提に基づいてもよい。対話要求スコア算出部20は、顔向き情報から座席ごとの対話要求スコアを算出してもよい。 The face direction detection unit 25 detects the direction of the face of each passenger on each seat based on the information sent. The purpose of detecting the orientation of the face is the same as the purpose of detecting the line of sight described above. The detected face orientation information is sent to the result integration unit 28 . The detected face direction information is used especially from the viewpoint of whether or not there is an on-vehicle device in front of the face. For example, it may be based on the premise that if there is a vehicle-mounted device ahead of the person's face, there is a high possibility that the person is talking to the vehicle-mounted device. The dialogue request score calculator 20 may calculate the dialogue request score for each seat from the face orientation information.
 姿勢検出部26は、送られた情報に基づいて、乗員のいる座席ごとに乗員の姿勢を検出する。姿勢を検出する趣旨は、前述の視線を検出する趣旨と同じである。検出された姿勢の情報は、結果統合部28へ送られる。検出された姿勢の情報は、特にその姿勢が車載器に話しかけている体勢か等の観点で用いられる。対話要求スコア算出部20は、姿勢の情報から座席ごとの対話要求スコアを算出してもよい。 The posture detection unit 26 detects the posture of each passenger in each seat based on the information sent. The purpose of detecting the posture is the same as the purpose of detecting the line of sight described above. Information on the detected posture is sent to the result integration unit 28 . Information on the detected posture is used from the viewpoint of whether the posture is a posture of talking to the vehicle-mounted device. The dialogue request score calculation unit 20 may calculate the dialogue request score for each seat from the posture information.
 開口度検出部27は、送られた情報に基づいて、乗員のいる座席ごとに乗員の口の開き度合である開口度を検出する。検出された乗員ごとの開口度は、結果統合部28へ送られる。対話要求スコア算出部20は、口が開いていればその者は発話している可能性が高いという前提に基づいて、対話要求スコアを算出してもよい。 Based on the sent information, the degree-of-openness detection unit 27 detects the degree of opening, which is the degree of opening of the mouth of the passenger, for each seat where the passenger is present. The detected degree of opening for each passenger is sent to the result integrating section 28 . The dialog request score calculator 20 may calculate the dialog request score based on the premise that if the mouth is open, the person is likely to be speaking.
 対話要求スコア算出部20は、多角的な情報に基づいて、座席ごとの対話要求スコアを算出してもよい。すなわち対話要求スコア算出部20は、図3に示す結果統合部28を含み、多角的な情報から総合的に判断し、座席ごとの対話要求スコアを算出してもよい。 The dialog request score calculation unit 20 may calculate the dialog request score for each seat based on multifaceted information. That is, the dialogue request score calculation unit 20 may include the result integration unit 28 shown in FIG. 3, make comprehensive judgments from multifaceted information, and calculate the dialogue request score for each seat.
 以上のとおり本開示技術に係る音声分離装置100は、実施の形態2に示す構成を採用し、ウエイクワードを用いず、各座席へのマイクの配置も必要とせず、かつ処理が高負荷とはならない音声分離を実現できる。 As described above, the speech separation device 100 according to the technique of the present disclosure employs the configuration shown in Embodiment 2, does not use wake words, does not require microphones to be placed in each seat, and has a high processing load. It is possible to realize sound separation that does not need to be done.
実施の形態3.
 実施の形態3は、実施の形態1及び実施の形態2で示した構成例において、音声分離部70を具体化した構成例を示す。
 実施の形態3において、特に明示した場合を除いて、既出の実施の形態で用いた符号と同じものが用いられる。また重複する説明は、適宜省略される。
Embodiment 3.
Embodiment 3 shows a configuration example in which the audio separation section 70 is specified in the configuration examples shown in Embodiments 1 and 2. FIG.
In the third embodiment, the same reference numerals as those used in the previous embodiments are used unless otherwise specified. Duplicate descriptions will be omitted as appropriate.
 図4は、実施の形態3に係る車載向け音声分離装置100の機能構成を示すブロック図である。図4が示すとおり音声分離装置100は、エリア別発話レベル算出部45と、エリア別対話要求有無判定部55と、を含む。
 また図4が示すとおり音声分離装置100の音声分離部70は、ビームフォーミング部71と、深層音声分離部72と、を含む。実施の形態3に係る音声分離装置100は、ビームフォーミングと深層音声分離とにより音声分離を実現する。具体的に実施の形態3に係る音声分離装置100は、ビームフォーミングによりエリア別の音声分離を実施し、エリア別に音声入力権の見直しを行う。
FIG. 4 is a block diagram showing the functional configuration of the in-vehicle audio separation device 100 according to Embodiment 3. As shown in FIG. As shown in FIG. 4 , the speech separation device 100 includes an area-based speech level calculation unit 45 and an area-based dialogue request presence/absence determination unit 55 .
Further, as shown in FIG. 4 , the speech separation unit 70 of the speech separation device 100 includes a beamforming unit 71 and a deep layer speech separation unit 72 . The speech separation apparatus 100 according to Embodiment 3 realizes speech separation by beamforming and deep speech separation. Specifically, the audio separation apparatus 100 according to the third embodiment performs area-based audio separation by beamforming, and reviews audio input rights for each area.
 図4が示すとおり音声分離部70は、音声入力権判定部60からの音声入力権が付与された座席に関する情報を受け取る。
 音声入力権が付与された座席が存在する場合に音声分離部70のビームフォーミング部71は、音声取得装置30から送付された音声データを、アレイマイクから見て音声入力権が付与された座席への方向θからの音声(以下「対象音声」と呼ぶ)とその方向θ以外からの音声(以下「非対象音声」と呼ぶ)とに分離する。非対象音声は、さらに音源からの方向θに基づいたエリアが定義され、そのエリアごとにさらに音声を分離されていてもよい。エリアの定義の一例は、運転席列エリア、中央列エリア、及び助手席列エリア、である。
As shown in FIG. 4 , the voice separation unit 70 receives information about the seat to which the voice input right is granted from the voice input right determination unit 60 .
If there is a seat to which the voice input right has been granted, the beamforming unit 71 of the voice separation unit 70 transmits the voice data sent from the voice acquisition device 30 to the seat to which the voice input right has been granted as viewed from the array microphone. (hereinafter referred to as "target voice") and voices from other directions (hereinafter referred to as "untargeted voice"). For the non-target sound, areas may be further defined based on the direction θ from the sound source, and the sound may be further separated for each area. An example definition of areas is a driver row area, a center row area, and a passenger row area.
 一般に車内の音声分離に用いられるアレイマイクは、車内のダッシュボードの中央に配置されることが想定される。アレイマイクにビームフォーミングを適用した技術では、運転席と助手席とのようにアレイマイクからみた音源の方向θが異なっていれば音源分離は比較的容易である。ところが右前座席と右後座席とのようにアレイマイクからみた音源の方向θがあまり変わらない場合、音源分離は難しい。 It is assumed that array microphones, which are generally used for voice separation in cars, are placed in the center of the dashboard inside the car. In the technique of applying beamforming to array microphones, sound source separation is relatively easy if the directions θ of the sound sources viewed from the array microphones are different, such as between the driver's seat and the front passenger's seat. However, when the direction θ of the sound source seen from the array microphone does not change much, such as between the right front seat and the right rear seat, it is difficult to separate the sound sources.
 エリア別発話レベル算出部45は、ビームフォーミング部71で分離された対象音声と非対象音声とのそれぞれについて発話レベルを判定する。エリアが定義されそのエリアごとにさらに音声が分離されている場合、エリア別発話レベル算出部45は、エリア別に発話レベルを判定する。判定されたエリア別の発話レベルの情報は、エリア別対話要求有無判定部55へ送られる。 The area-by-area utterance level calculation unit 45 determines the utterance level for each of the target voice and non-target voice separated by the beamforming unit 71 . When areas are defined and voices are further separated for each area, the area-by-area speech level calculator 45 determines the speech level for each area. The determined area-by-area utterance level information is sent to area-by-area dialogue request presence/absence determination unit 55 .
 エリア別対話要求有無判定部55は、送られたエリア別の発話レベルの情報に基づいて、エリア別に「対話要求が有る」「対話要求が無い」の判断をし直す。例えばエリア別対話要求有無判定部55は、発話レベルが小さいエリアに属する座席について、「対話要求が無い」との判断をし直す。さらに音声入力権が付与された座席のエリアについて発話レベルが小さい場合、エリア別対話要求有無判定部55は音声入力権の割当てを見直すよう音声入力権判定部60へ通知してもよい。 The area-by-area dialogue request presence/absence determination unit 55 re-determines whether "there is a dialogue request" or "there is no dialogue request" for each area based on the transmitted area-by-area utterance level information. For example, the area-by-area dialogue request presence/absence determination unit 55 re-determines that "there is no dialogue request" for a seat belonging to an area with a low utterance level. Further, if the speech level is low in the seat area to which the voice input right is granted, the area-by-area dialogue request presence/absence determination unit 55 may notify the voice input right determination unit 60 to review the allocation of the voice input right.
 深層音声分離部72は、対象音声について、予め声質から音声分離するように深層学習された数理モデルを用いて、音声分離を行う。具体的に深層音声分離部72の数理モデルは、特許文献1に開示されたようなDNNであってよい。 The deep layer speech separation unit 72 performs speech separation for the target speech using a mathematical model that has undergone deep learning in advance so as to separate the speech from the voice quality. Specifically, the mathematical model of the deep layer speech separation unit 72 may be a DNN as disclosed in Patent Document 1.
 図5は、実施の形態3に係る車載向け音声分離装置100の処理を示すフローチャートである。図5のフローチャート中のST1からST9は、図2のフローチャートのブロックと同じである。図5のフローチャート中のST31からST35は、実施の形態3で新たに追加されたステップである。 FIG. 5 is a flowchart showing processing of the in-vehicle audio separation device 100 according to the third embodiment. ST1 to ST9 in the flowchart of FIG. 5 are the same as the blocks in the flowchart of FIG. ST31 to ST35 in the flowchart of FIG. 5 are steps newly added in the third embodiment.
 図5に示すとおり実施の形態3に係る音声分離装置100の処理は、音声データから対象音声をビームフォーミングで分離するステップ(ST31)と、対象音声のエリアの発話レベルを算出するステップ(ST32)と、発話レベルが第3閾値以上か否かを判断するステップ(ST33)と、対象音声のエリアに属する座席は「対話要求が無い」とみなすステップ(ST34)と、対象音声を深層学習によりさらに分離するステップ(ST35)と、を含む。 As shown in FIG. 5, the processing of the speech separation apparatus 100 according to the third embodiment includes a step of separating the target speech from the speech data by beamforming (ST31) and a step of calculating the utterance level of the area of the target speech (ST32). Step (ST33) of determining whether the speech level is equal to or higher than the third threshold; and a separating step (ST35).
 音声データから対象音声をビームフォーミングで分離するステップ(ST31)は、ビームフォーミング部71により実行される。 The beamforming unit 71 performs the step of separating the target sound from the sound data by beamforming (ST31).
 対象音声のエリアの発話レベルを算出するステップ(ST32)は、エリア別発話レベル算出部45により実行される。 The step of calculating the utterance level of the area of the target voice (ST32) is executed by the area-by-area utterance level calculation unit 45.
 発話レベルが第3閾値以上か否かを判断するステップ(ST33)は、エリア別対話要求有無判定部55により実行される。発話レベル第3閾値と発話レベル第1閾値及び発話レベル第2閾値との大小関係は一意ではなく、異なる音声処理の経路であることを考慮して個別に決定されてよい。 The step (ST33) of determining whether or not the speech level is equal to or higher than the third threshold is executed by the area-specific dialogue request presence/absence determination unit 55. The magnitude relationship between the third threshold of speech level and the first threshold of speech level and the second threshold of speech level is not unique, and may be determined individually in consideration of different audio processing paths.
 発話レベルが第3閾値未満である場合、すなわち図5のステップ(ST33)の判断結果がNOの場合、対象音声のエリアに属する座席は「対話要求が無い」とみなされる。つまりここで対象音声のエリアに属する座席は「対話要求が無い」とみなすステップ(ST34)が実行される。その後、音声分離装置100の処理は、発話レベルが第1閾値以上であるか否かを判断するステップ(ST5)へ戻る。 If the speech level is less than the third threshold, that is, if the judgment result in step (ST33) in FIG. In other words, a step (ST34) is executed in which the seat belonging to the area of the target voice is regarded as having "no dialogue request". After that, the processing of the speech separation device 100 returns to the step (ST5) of determining whether or not the speech level is equal to or higher than the first threshold.
 発話レベルが第3閾値以上である場合、すなわち図5のステップ(ST33)の判断結果がYESの場合、対象音声のエリアに属する座席のいずれかに「対話要求が有る」と考える。ここで処理は深層音声分離部72に移り、対象音声を深層学習によりさらに分離するステップ(ST35)が実行される。 If the speech level is equal to or higher than the third threshold, that is, if the determination result in step (ST33) in FIG. Here, the process moves to the deep layer sound separation unit 72, and a step (ST35) of further separating the target sound by deep layer learning is executed.
  以上のとおり本開示技術に係る音声分離装置100は、実施の形態3に示す構成を採用し、ウエイクワードを用いず、各座席へのマイクの配置も必要とせず、かつ処理が高負荷とはならない音声分離を実現できる。 As described above, the speech separation device 100 according to the technique of the present disclosure employs the configuration shown in Embodiment 3, does not use a wake word, does not require microphones to be placed in each seat, and has a high processing load. It is possible to realize sound separation that does not need to be done.
 本開示技術は、音声での指示に対して応答することが可能なカーナビゲーションシステム等の車載器に応用でき、産業上の利用可能性を有する。 The disclosed technology can be applied to in-vehicle devices such as car navigation systems that can respond to voice instructions, and has industrial applicability.
 10 情報取得装置、 11 車載器状態取得部、 12 周辺道路状況取得部、 13 乗員映像取得部、 20 対話要求スコア算出部、 21 状態別対話要求度合算出部、 22 道路状況別要求度合算出部、 23 乗員位置検出部、 24 視線検出部、 25 顔向き検出部、 26 姿勢検出部、 27 開口度検出部、 28 結果統合部、 30 音声取得装置、 40 発話レベル算出部、 45 エリア別発話レベル算出部、 50 対話要求有無判別部、 55 エリア別対話要求有無判定部、 60 音声入力権判定部、 70 音声分離部、 71 ビームフォーミング部、 72 深層音声分離部、 80 音声認識装置、 90 通知装置、 100 音声分離装置。 10 information acquisition device, 11 vehicle-mounted device status acquisition unit, 12 surrounding road condition acquisition unit, 13 occupant video acquisition unit, 20 dialog request score calculation unit, 21 state-specific dialog request degree calculation unit, 22 road condition-specific request degree calculation unit, 23 Passenger position detection unit, 24 Gaze detection unit, 25 Face direction detection unit, 26 Posture detection unit, 27 Opening degree detection unit, 28 Result integration unit, 30 Voice acquisition device, 40 Speech level calculation unit, 45 Speech level calculation for each area 50 Dialogue request presence/absence determination unit 55 Area-specific dialogue request presence/absence determination unit 60 Voice input authority determination unit 70 Voice separation unit 71 Beam forming unit 72 Deep voice separation unit 80 Voice recognition device 90 Notification device 100 audio separator.

Claims (8)

  1.  取得された車内の音声に基づいて発話レベルを算出する発話レベル算出部と、
     取得された対話要求検知に必要な情報に基づいて座席ごとに対話要求スコアを算出する対話要求スコア算出部と、
     前記発話レベルと前記対話要求スコアとに基づいて、前記車内の座席の中のいずれに音声入力権を付与するかを判定する音声入力権判定部と、
    を含む車載向け音声分離装置。
    an utterance level calculation unit that calculates an utterance level based on the acquired voice inside the vehicle;
    an interaction request score calculation unit that calculates an interaction request score for each seat based on the acquired information necessary for interaction request detection;
    a voice input authority determination unit that determines to which of the seats in the vehicle the voice input authority should be granted based on the speech level and the dialogue request score;
    In-vehicle audio separation equipment including.
  2.  取得された車内の音声に基づいて発話レベルを算出するステップと、
     取得された対話要求検知に必要な情報に基づいて座席ごとに対話要求スコアを算出するステップと、
     前記発話レベルと前記対話要求スコアとに基づいて、前記車内の座席の中のいずれに音声入力権を付与するかを判定するステップと、
    を含む音声分離方法。
    a step of calculating a speech level based on the obtained voice inside the vehicle;
    calculating an interaction request score for each seat based on the acquired information necessary for interaction request detection;
    determining which of the seats in the vehicle is to be given the voice input privilege based on the speech level and the interaction request score;
    Speech separation methods, including
  3.  前記音声入力権は、前記座席のうち運転席と前記運転席以外とでは付与される条件が異なる請求項1に記載の車載向け音声分離装置。  In-vehicle audio separation device according to claim 1, wherein conditions for the audio input right are different between the driver's seat and the seats other than the driver's seat.
  4.  前記音声入力権は、前記座席のうち運転席と前記運転席以外とでは付与される条件が異なる請求項2に記載の音声分離方法。 The voice separation method according to claim 2, wherein the voice input right has different conditions for the driver's seat and for the seats other than the driver's seat.
  5.   前記対話要求検知に必要な情報には周辺道路状況を含み、前記対話要求スコア算出部は道路状況別要求度合算出部を含む請求項1に記載の車載向け音声分離装置。   The in-vehicle speech separation device according to claim 1, wherein the information necessary for detecting the interaction request includes surrounding road conditions, and the interaction request score calculation unit includes a request degree calculation unit for each road condition.
  6.  前記対話要求検知に必要な情報には周辺道路状況を含む請求項2に記載の音声分離方法。 The speech separation method according to claim 2, wherein the information necessary for detecting the interaction request includes surrounding road conditions.
  7.  前記座席のうちで前記音声入力権が付与されたものが含まれるエリアについての前記発話レベルの大きさを判断するエリア別発話レベル算出部をさらに含む請求項1に記載の車載向け音声分離装置。  The in-vehicle speech separation device according to claim 1, further comprising an area-specific speech level calculation unit that determines the magnitude of the speech level for an area in which the seat to which the speech input right is granted is included.
  8.  前記座席のうちで前記音声入力権が付与されたものが含まれるエリアについての前記発話レベルの大きさを判断するステップをさらに含む請求項2に記載の音声分離方法。  The voice separation method according to claim 2, further comprising the step of determining the magnitude of the speech level for an area in which the seat to which the voice input right is granted is included.
PCT/JP2021/006024 2021-02-18 2021-02-18 In-vehicle voice separation device and voice separation method WO2022176085A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/006024 WO2022176085A1 (en) 2021-02-18 2021-02-18 In-vehicle voice separation device and voice separation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/006024 WO2022176085A1 (en) 2021-02-18 2021-02-18 In-vehicle voice separation device and voice separation method

Publications (1)

Publication Number Publication Date
WO2022176085A1 true WO2022176085A1 (en) 2022-08-25

Family

ID=82930335

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/006024 WO2022176085A1 (en) 2021-02-18 2021-02-18 In-vehicle voice separation device and voice separation method

Country Status (1)

Country Link
WO (1) WO2022176085A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006025106A1 (en) * 2004-09-01 2006-03-09 Hitachi, Ltd. Voice recognition system, voice recognizing method and its program
JP2016061888A (en) * 2014-09-17 2016-04-25 株式会社デンソー Speech recognition device, speech recognition subject section setting method, and speech recognition section setting program
WO2017145373A1 (en) * 2016-02-26 2017-08-31 三菱電機株式会社 Speech recognition device
JP2020134566A (en) * 2019-02-13 2020-08-31 パナソニックIpマネジメント株式会社 Voice processing system, voice processing device and voice processing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006025106A1 (en) * 2004-09-01 2006-03-09 Hitachi, Ltd. Voice recognition system, voice recognizing method and its program
JP2016061888A (en) * 2014-09-17 2016-04-25 株式会社デンソー Speech recognition device, speech recognition subject section setting method, and speech recognition section setting program
WO2017145373A1 (en) * 2016-02-26 2017-08-31 三菱電機株式会社 Speech recognition device
JP2020134566A (en) * 2019-02-13 2020-08-31 パナソニックIpマネジメント株式会社 Voice processing system, voice processing device and voice processing method

Similar Documents

Publication Publication Date Title
EP3675121B1 (en) Computer-implemented interaction with a user
WO2017081960A1 (en) Voice recognition control system
DE102016212647B4 (en) Method for operating a voice control system in an indoor space and voice control system
WO2018029789A1 (en) Control method and control device for automatically driven vehicle
US10764536B2 (en) System and method for a dynamic human machine interface for video conferencing in a vehicle
JP2006194633A (en) Voice information providing device for vehicle
US10674003B1 (en) Apparatus and system for identifying occupants in a vehicle
WO2020120754A1 (en) Audio processing device, audio processing method and computer program thereof
JP2023011945A (en) Voice processing apparatus, voice processing method and voice processing system
Angkititrakul et al. Getting start with UTDrive: driver-behavior modeling and assessment of distraction for in-vehicle speech systems.
WO2022176085A1 (en) In-vehicle voice separation device and voice separation method
KR102331882B1 (en) Method and apparatus for controlling an vehicle based on voice recognition
CN109243457B (en) Voice-based control method, device, equipment and storage medium
JP6332072B2 (en) Dialogue device
CN115831141A (en) Noise reduction method and device for vehicle-mounted voice, vehicle and storage medium
US20220415318A1 (en) Voice assistant activation system with context determination based on multimodal data
CN111267864B (en) Information processing system, program, and control method
DE102019218058B4 (en) Device and method for recognizing reversing maneuvers
JP2022148823A (en) Agent device
CN114842840A (en) Voice control method and system based on in-vehicle subareas
US11938958B2 (en) Voice dialogue device, voice dialogue system, and control method for voice dialogue system
US20200151742A1 (en) Information processing system, program, and control method
JP2020160181A (en) Speech processing apparatus and speech processing method
WO2023090057A1 (en) Information processing device, information processing method, and information processing program
JP7068875B2 (en) Sound output device control device, control method, program and vehicle

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21926523

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21926523

Country of ref document: EP

Kind code of ref document: A1