WO2022176085A1

WO2022176085A1 - In-vehicle voice separation device and voice separation method

Info

Publication number: WO2022176085A1
Application number: PCT/JP2021/006024
Authority: WO
Inventors: 真宗平; 尚嘉竹裏
Original assignee: 三菱電機株式会社
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2022-08-25

Abstract

An in-vehicle voice separation device according to the present invention includes: a speech level calculation unit (40) that calculates a speech level on the basis of an acquired voice within a vehicle; a dialogue request score calculation unit (20) that calculates a dialogue request score for each seat on the basis of information required to detect an acquired dialogue request; and a voice input right determination unit (60) that uses the speech level and the dialogue request score as a basis to determine whether to grant a voice input right to any of the seats within the vehicle.

Description

In-vehicle audio separation device and audio separation method

The technology disclosed herein relates to an in-vehicle audio separation device.

A device (hereinafter referred to as "vehicle device") that is an in-vehicle device and is capable of responding to voice instructions is known. Well-known in-vehicle devices include, for example, Amazon Echo Auto. Conventional in-vehicle equipment, represented by Amazon Echo Auto, uses the wake word "Alexa" before the instruction voice to prevent voice processing from becoming a heavy load. However, since this wake word is not necessary for conversation between humans, it is desired to realize an on-vehicle device that does not use wake words.

In-vehicle equipment is also required to close up only a specific voice even in a noisy vehicle. A technique using a plurality of microphones is known for this problem. For example, Cerence's Passenger Interference Cancellation (hereinafter referred to as "PIC") and Transtron's microphone are known to be applied to vehicle-mounted equipment.
An example from Cerence's PIC article:
<URL: https://response. jp/article/2016/03/01/270717. html>
An example from an article about Transtron array microphones:
<URL: https://www. transtron. com/products/array. html>
As an example of using multiple microphones, a method of arranging a microphone at each seat is conceivable. Although this method has good voice separation performance, it has the property of requiring wiring costs and labor for arranging microphones to each seat. This method is only used in some luxury cars.
Another use for multiple microphones is to place them near the center information display on the dashboard. Although this method does not require wiring cost and labor, it has the property that the voice separation performance is inferior when there is no difference in the angle at which the voice is input, such as between the driver's seat and the rear seat.

Array microphones in which a plurality of microphones are aligned are also known. For sound source separation technology of mixed signals of multiple sound sources observed by array microphones, a sound source separation method is disclosed that has high separation performance even for sound sources that do not have sparsity in the time-frequency domain using neural networks. (For example, Patent Document 1). More specifically, Patent Document 1 uses a trained deep neural network (hereinafter referred to as "DNN") to learn observation signals from a sound source with a known direction θ as teacher data, and select sounds from the θ direction. It discloses that it extracts
In order to improve voice separation performance without placing a microphone in each seat, it is conceivable to combine the DNN technology disclosed in Patent Document 1 with an array microphone placed on the dashboard.

JP 2020-38315 A

Combining DNN technology with the in-vehicle array microphone means implementing high-load processing for the number of people in order to extract the voice of each passenger. A high processing load may lead to a delay in response to voices, a hindrance to other functions being executed by the vehicle-mounted device, and the like.

An object of the disclosed technology is to solve the above problems, to provide a voice separation device for an in-vehicle device that does not use a wake word, does not require a microphone to be placed in each seat, and does not cause a high processing load. and

An in-vehicle speech separation device according to the present disclosure includes a speech level calculation unit that calculates a speech level based on the acquired in-vehicle speech, and a dialogue request score for each seat based on the acquired information necessary for dialogue request detection. and a voice input authority determination unit that determines to which of the seats in the vehicle the voice input authority should be granted based on the utterance level and the dialogue request score. include.

Since the in-vehicle audio separation device according to the disclosed technology has the above configuration, an in-vehicle device that does not use a wake word, does not require microphones to be placed in each seat, and does not impose a high processing load is realized.

FIG. 1 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 1. As shown in FIG. FIG. 2 is a flow chart showing processing of the in-vehicle audio separation device according to the first embodiment. FIG. 3 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 2. As shown in FIG. FIG. 4 is a block diagram showing a functional configuration of an in-vehicle audio separation device according to Embodiment 3. As shown in FIG. FIG. 5 is a flow chart showing processing of the in-vehicle audio separation device according to the third embodiment. FIG. 6 is an image diagram of a dialogue request score calculated based on information from the surrounding road condition acquisition unit.

The in-vehicle audio separation device 100 according to the technology disclosed herein will be clarified by the following description along the drawings for each embodiment.

Embodiment 1.
FIG. 1 is a block diagram showing the functional configuration of an in-vehicle audio separation device 100 according to Embodiment 1. As shown in FIG. As shown in FIG. 1, the in-vehicle speech separation device 100 according to the technology disclosed herein includes a dialogue request score calculation unit 20, an utterance level calculation unit 40, a dialogue request presence/absence determination unit 50, a voice input authority determination unit 60, a voice and a separation unit 70 .
The speech separation device 100 has two input systems, which are connected to the information acquisition device 10 and the speech acquisition device 30, respectively. The speech separation device 100 has at least one output system and is connected to the speech recognition device 80 . The voice separation device 100 may have a second output system and may be connected to the notification device 90 .

The information acquisition device 10 is a camera, drive recorder, driver monitor, or other device capable of detecting interaction requests.
The voice acquisition device 30 is a device that includes a microphone and acquires voice inside the vehicle.

The dialog request score calculator 20 of the speech separation device 100 calculates the score of the dialog request based on the information from the information acquisition device 10 . In a specific example, when the information acquisition device 10 is a camera, the interaction request score calculator 20 analyzes the line of sight of the passenger based on the moving image captured by the camera. An interaction request score is calculated for each seat in the vehicle.

The speech level calculation unit 40 of the speech separation device 100 calculates the speech level inside the vehicle based on the information from the speech acquisition device 30 . Specifically, the speech level calculator 40 removes noise other than speech from the speech information received from the speech acquisition device 30, and calculates the speech level of the speech. The speech information received by the speech level calculator 40 is not yet separated for each passenger here. Therefore, the in-vehicle utterance level calculated by the utterance level calculation unit 40 (hereinafter simply referred to as "utterance level") is not for each passenger but for the entire vehicle interior.

The dialogue request presence/absence determination unit 50 of the speech separation device 100 determines whether there is a dialogue request based on the dialogue request score calculated by the dialogue request score calculation unit 20 and the speech level calculated by the speech level calculation unit 40 . For example, when the speech level is sufficiently high and the score of the dialogue request is not small, the dialogue request presence/absence determination unit 50 determines that "there is a dialogue request". For example, the dialogue request presence/absence determination unit 50 may determine that "there is a dialogue request" even when the speech level is somewhat low but the dialogue request is sufficiently large. If the condition of "there is a demand for dialogue" is not satisfied, the dialogue request presence/absence determination unit 50 may determine that "there is no demand for dialogue". The processing flow of the dialog request presence/absence determination unit 50 will be clarified later.

When it is determined that "there is a dialogue request", the speech input right determination unit 60 of the speech separation device 100 grants the right to perform voice input to the vehicle-mounted device (hereinafter referred to as "voice determine the occupants to whom input rights are granted. Information on the passenger to whom the voice input right has been granted is sent to the voice separation section 70 .

Several variations are conceivable for determining which occupants are given voice input rights.
For example, if the interaction request score exceeds a threshold for multiple seats in a vehicle, it may be determined that the driver's seat is always given priority if the driver's seat is included.
For example, the interaction request score threshold may be lowered only for the driver's seat. In this case also, the driver's seat has priority over the other seats.
For example, it may be determined that if the interaction request scores for multiple seats in the car exceed a threshold, first come first served. This decision may or may not include the driver's seat.
For example, if the interaction request scores for multiple seats in the vehicle exceed a threshold, it may be determined that priority is given to the interaction request with the higher score. In this case, if the score of the driver's seat is raised in advance, there is an effect that the driver's seat is prioritized over the other seats.

If there is already a seat to which the voice input right has been granted (hereafter referred to as a "held seat"), several variations are conceivable for the validity period of the voice input right.
For example, the expiration date of the voice input right may be set to a predetermined fixed time so that the held seat is not frequently switched.
For example, if the score of the seat other than the held seat continues to exceed the score of the held seat for a certain period of time, the voice input right may be switched.
For example, it may be determined that the condition for switching audio input rights is to exceed the score of the retained seat plus a margin. The switching of the voice input authority based on this rule is clarified by the following numerical example with a margin of 5 [pt], for example. Assume that the score of the held seat was initially 30 [pt] and the score of the other seats was 28 [pt]. Assume that the score of the held seat changes to 32 [pt] and the score of the other seats changes to 35 [pt] in the following state. In this case, since 35 [pt] does not exceed 32+5 [pt], the voice input right is not switched. Furthermore, suppose that the score of the held seat changes to 28 [pt] and the score of the other seats changes to 34 [pt] in the following state. In this case, since 34 [pt] exceeds 28+5 [pt], the voice input right is switched.

The voice separation unit 70 of the voice separation device 100 separates only the voice of the seat to which the voice input right is granted from the voice data sent from the voice acquisition device 30 .
As described above, the sound separation device 100 according to the technology disclosed herein does not perform sound separation for all seats, but performs sound separation only for specific seats to which the sound input right has been granted. Therefore, the load required for sound processing is high. I can get away with it.

The speech that has undergone separation processing by the speech separation unit 70 is output to the speech recognition device 80 . The speech recognition device 80 is specifically a device including an on-vehicle device such as a car navigation system.

The audio separation device 100 may have a second output system so that the information of the seat to which the audio input right is granted can be displayed outside. The external display means may be the external notification device 90, or may be an LED display, a display screen, or a speaker that the audio separation device 100 has. In this manner, the voice separation device 100 may have a configuration capable of externally displaying the information of the seat to which the voice input right has been granted in some way.

FIG. 2 is a flow chart showing an example of processing of the in-vehicle audio separation device 100 according to the first embodiment. As shown in FIG. 2, the processing of the speech separation device 100 includes a step of acquiring information necessary for detecting a dialogue request (ST1), a step of analyzing the acquired information (ST2), and a step of acquiring voice inside the vehicle (ST3). ), a step of calculating the speech level in the vehicle from the acquired voice (ST4), a step of determining whether the speech level is equal to or higher than a first threshold (ST5), and a step of determining whether the score of the dialogue request is equal to or higher than the first threshold. step (ST6) of determining whether or not; step (ST7) of determining whether or not the speech level is equal to or greater than the second threshold; and step (ST8) of determining whether or not the dialogue request score is equal to or greater than the second threshold. , a step (ST9) of determining the occupant to whom the voice input right is given, and a step (ST10) of separating the voice of the seat having the voice input right.
Strictly speaking, the step (ST1) of acquiring the information necessary for detecting the dialogue request is a process performed by the information acquisition device 10, and the step (ST3) of acquiring the voice inside the vehicle is a process performed by the voice acquisition device 30. be. However, these steps (ST1, ST3) are included in the flow chart of FIG. 2 to clarify the operation of the system as a whole.

The dialog request score calculator 20 of the speech separation device 100 executes the step of analyzing the information acquired via the information acquisition device 10 (ST2). The dialog request score calculator 20 calculates a dialog request score for each seat based on the analysis result.

The speech level calculation unit 40 of the speech separation device 100 executes a step (ST4) of calculating the speech level in the vehicle from the speech acquired via the speech acquisition device 30. The step (ST2) processed by the dialogue request score calculation unit 20 and the step (ST4) processed by the speech level calculation unit 40 do not matter before or after the execution. In a preferred embodiment, step (ST2) and step (ST4) are processed concurrently.

The dialogue request presence/absence determination unit 50 of the speech separation device 100 determines whether or not the utterance level obtained in step (ST4) of the utterance level calculation unit 40 is above a certain level. Specifically, the dialogue request presence/absence determination unit 50 compares a predetermined speech level first threshold with the speech level. That is, the step (ST5) of determining whether or not the speech level is equal to or higher than the first threshold is executed here.

If the speech level is equal to or higher than the first threshold, that is, if the determination result in step (ST5) in FIG. determine whether Specifically, the dialogue request presence/absence determination unit 50 compares the dialogue request score for each seat calculated by the dialogue request score calculation unit 20 with a predetermined dialogue request score first threshold. That is, a dialogue request score equal to or greater than the first threshold is interpreted as "there is a dialogue request". Conversely, even if the utterance level is above a certain level, if the dialogue request score is less than the first threshold and the action or situation is not one of trying to talk, it is interpreted as "no dialogue request". That is, a step (ST6) of determining whether or not the score of the interaction request is equal to or greater than the first threshold is executed here.

If the utterance level is less than the first threshold, that is, if the determination result in step (ST5) of FIG. 2 is NO, the dialogue request presence/absence determination unit 50 determines whether the conversation is being conducted in a low voice. Specifically, the dialogue request presence/absence determination unit 50 compares a predetermined speech level second threshold with the speech level. That is, the step (ST7) of determining whether the speech level is equal to or higher than the second threshold is executed here.
As for the magnitude relationship between the first threshold and the second threshold of the speech level, the second threshold is smaller than the first threshold.

If the speech level is equal to or higher than the second threshold, that is, if the determination result in step (ST7) in FIG. to judge what Specifically, the dialog request presence/absence determination unit 50 compares the dialog request score for each seat calculated by the dialog request score calculation unit 20 with a predetermined dialog request score second threshold. That is, the step (ST8) of determining whether or not the score of the interaction request is equal to or greater than the second threshold is executed here.
As for the magnitude relationship between the first threshold and the second threshold of the dialogue request score, the second threshold is larger than the first threshold. In other words, a dialog request score equal to or greater than the second threshold is interpreted as "there is a dialog request".
In addition, in order to distinguish between the utterance level threshold and the dialogue request score threshold, the former is referred to as the "utterance level threshold" and the latter is referred to as the "dialogue request score threshold" in situations where confusion is likely to occur.

From step (ST5) to step (ST8) in FIG. 2, a plurality of IF sentences are combined to determine to which seat the voice input privilege should be given from the respective values of the speech level and dialogue request score. It shows the decision flow for The in-vehicle speech separation device 100 according to the technology disclosed herein is not limited to the combination of multiple IF sentences in the determination flow. For example, the decision method may define the seats to which voice input rights should be granted in an N+1 dimensional state space consisting of speech levels and N interaction request scores (where N is the number of seats). In addition, for example, the state space may have an additional state dimension so that it can handle the case where the voice input right has already been granted.
Also, from step (ST5) to step (ST8) in FIG. 2, two utterance level thresholds and two dialogue request score thresholds are used. The in-vehicle audio separation device 100 according to the technology disclosed herein does not limit the number of thresholds used in the determination flow to the above four. The number of thresholds may be determined appropriately according to the required specifications.

If it is determined that "there is a request for dialogue" in the determination flow from steps (ST5) to (ST8) in FIG. 2, the voice input authority determining section 60 determines to which seat the voice input authority is granted. That is, the step (ST9) of determining the passenger to whom the voice input right is given is executed here.
As described above, the voice input authority determining unit 60 is preset with a determination as to which seat is prioritized when there are a plurality of seats determined to have a "dialogue request".

When step (ST9) in FIG. 2 is implemented and the voice input right is given to the seat, the information of the seat to which the voice input right has been given is sent to the voice separation section 70 . The voice separation unit 70 separates only the voice of the seat to which the voice input right is granted from the voice data sent from the voice acquisition device 30 . In other words, the step (ST10) of separating the voice of the seat with voice input right is executed.

As described above, since the speech separation apparatus 100 according to Embodiment 1 has the above configuration, it does not use a wake word, does not require a microphone to be placed in each seat, and performs speech separation that does not impose a high processing load. realizable.

Embodiment 2.
Embodiment 2 shows a configuration example in which the information acquisition device 10 and the interaction request score calculation unit 20 are specified in the configuration example shown in Embodiment 1. FIG.
In Embodiment 2, the same reference numerals as those used in Embodiment 1 are used unless otherwise specified. Duplicate descriptions will be omitted as appropriate.

FIG. 3 is a block diagram showing the functional configuration of the in-vehicle audio separation device 100 according to Embodiment 2. As shown in FIG. As shown in FIG. 3 , the information acquisition device 10 includes a vehicle-mounted device state acquisition unit 11 , a surrounding road condition acquisition unit 12 , and an occupant image acquisition unit 13 .
Also, as shown in FIG. , face orientation detection unit 25 , posture detection unit 26 , mouth opening detection unit 27 , and result integration unit 28 .

As shown in FIG. 3 , the information from the vehicle-mounted device state acquisition unit 11 of the information acquisition device 10 is sent to the state-by-state dialogue request degree calculation unit 21 of the dialogue request score calculation unit 20 .
Specifically, the vehicle-mounted device status acquisition unit 11 is connected to the vehicle-mounted device. or not, and the status of other onboard units.
Based on the sent information, the status-specific dialogue request level calculation unit 21 calculates the possibility that the occupant of that seat will talk to the vehicle-mounted device for each seat. The dialogue request score calculator 20 may calculate the dialogue request score for each seat according to the calculated possibility.

As shown in FIG. 3 , the information from the surrounding road condition acquisition unit 12 of the information acquisition device 10 is sent to the road condition-specific request degree calculation unit 22 of the dialogue request score calculation unit 20 .
The surrounding road condition acquisition unit 12 specifically acquires information on the surrounding road conditions from the map data of the navigation system. More specifically, road conditions refer to the number of intersections within a certain distance of the vehicle's destination, and the presence or absence of intersections where five or more roads intersect, such as five-way intersections (hereafter referred to as "multi-way intersections"). , etc.
Based on the sent information, the road condition-specific request level calculation unit 22 calculates the possibility that the occupant of that seat will talk to the vehicle-mounted device for each seat. For example, the likelihood may be calculated based on the premise that the driver is likely to ask the navigation system which road to take before a multi-junction. The dialogue request score calculator 20 may calculate the dialogue request score for each seat according to the calculated possibility.

FIG. 6 is an image diagram of the dialogue request score calculated based on the information from the surrounding road condition acquisition unit 12. As shown in FIG.
As shown in FIG. 6, the dialogue request score can be calculated by scoring the likelihood that the occupant of each seat will talk to the vehicle-mounted device according to the type of situation obtained from the peripheral road condition acquisition unit 12 in advance. be done.

As shown in FIG. 3 , the information from the occupant image acquisition unit 13 of the information acquisition device 10 includes the occupant position detection unit 23 , line-of-sight detection unit 24 , face direction detection unit 25 , posture detection unit 26 , and and to the aperture detection unit 27 .
The occupant image acquisition unit 13 is specifically a driver monitor, a camera that captures an image of the occupant, and the like.

The occupant position detection unit 23 determines whether or not there is an occupant for each seat based on the sent information. Calculation of the interaction request score is omitted for seats with no occupants.

The line-of-sight detection unit 24 detects the line-of-sight of each passenger on each seat based on the sent information. The detected line-of-sight information is sent to the result integration unit 28 . The detected line-of-sight information is used from the viewpoint of whether there is an on-vehicle device ahead of the line of sight, or whether there is another passenger ahead of the line of sight. For example, it may be based on the premise that if there is an on-board device ahead of the line of sight, the owner of the line of sight is likely to be talking to the on-board device. For example, it may be based on the premise that if there is another passenger ahead of the line of sight, there is a high possibility that the owner of the line of sight is talking to the other passenger. The dialogue request score calculator 20 may calculate the dialogue request score for each seat from the line-of-sight information.

The face direction detection unit 25 detects the direction of the face of each passenger on each seat based on the information sent. The purpose of detecting the orientation of the face is the same as the purpose of detecting the line of sight described above. The detected face orientation information is sent to the result integration unit 28 . The detected face direction information is used especially from the viewpoint of whether or not there is an on-vehicle device in front of the face. For example, it may be based on the premise that if there is a vehicle-mounted device ahead of the person's face, there is a high possibility that the person is talking to the vehicle-mounted device. The dialogue request score calculator 20 may calculate the dialogue request score for each seat from the face orientation information.

The posture detection unit 26 detects the posture of each passenger in each seat based on the information sent. The purpose of detecting the posture is the same as the purpose of detecting the line of sight described above. Information on the detected posture is sent to the result integration unit 28 . Information on the detected posture is used from the viewpoint of whether the posture is a posture of talking to the vehicle-mounted device. The dialogue request score calculation unit 20 may calculate the dialogue request score for each seat from the posture information.

Based on the sent information, the degree-of-openness detection unit 27 detects the degree of opening, which is the degree of opening of the mouth of the passenger, for each seat where the passenger is present. The detected degree of opening for each passenger is sent to the result integrating section 28 . The dialog request score calculator 20 may calculate the dialog request score based on the premise that if the mouth is open, the person is likely to be speaking.

The dialog request score calculation unit 20 may calculate the dialog request score for each seat based on multifaceted information. That is, the dialogue request score calculation unit 20 may include the result integration unit 28 shown in FIG. 3, make comprehensive judgments from multifaceted information, and calculate the dialogue request score for each seat.

As described above, the speech separation device 100 according to the technique of the present disclosure employs the configuration shown in Embodiment 2, does not use wake words, does not require microphones to be placed in each seat, and has a high processing load. It is possible to realize sound separation that does not need to be done.

Embodiment 3.
Embodiment 3 shows a configuration example in which the audio separation section 70 is specified in the configuration examples shown in Embodiments 1 and 2. FIG.
In the third embodiment, the same reference numerals as those used in the previous embodiments are used unless otherwise specified. Duplicate descriptions will be omitted as appropriate.

FIG. 4 is a block diagram showing the functional configuration of the in-vehicle audio separation device 100 according to Embodiment 3. As shown in FIG. As shown in FIG. 4 , the speech separation device 100 includes an area-based speech level calculation unit 45 and an area-based dialogue request presence/absence determination unit 55 .
Further, as shown in FIG. 4 , the speech separation unit 70 of the speech separation device 100 includes a beamforming unit 71 and a deep layer speech separation unit 72 . The speech separation apparatus 100 according to Embodiment 3 realizes speech separation by beamforming and deep speech separation. Specifically, the audio separation apparatus 100 according to the third embodiment performs area-based audio separation by beamforming, and reviews audio input rights for each area.

As shown in FIG. 4 , the voice separation unit 70 receives information about the seat to which the voice input right is granted from the voice input right determination unit 60 .
If there is a seat to which the voice input right has been granted, the beamforming unit 71 of the voice separation unit 70 transmits the voice data sent from the voice acquisition device 30 to the seat to which the voice input right has been granted as viewed from the array microphone. (hereinafter referred to as "target voice") and voices from other directions (hereinafter referred to as "untargeted voice"). For the non-target sound, areas may be further defined based on the direction θ from the sound source, and the sound may be further separated for each area. An example definition of areas is a driver row area, a center row area, and a passenger row area.

It is assumed that array microphones, which are generally used for voice separation in cars, are placed in the center of the dashboard inside the car. In the technique of applying beamforming to array microphones, sound source separation is relatively easy if the directions θ of the sound sources viewed from the array microphones are different, such as between the driver's seat and the front passenger's seat. However, when the direction θ of the sound source seen from the array microphone does not change much, such as between the right front seat and the right rear seat, it is difficult to separate the sound sources.

The area-by-area utterance level calculation unit 45 determines the utterance level for each of the target voice and non-target voice separated by the beamforming unit 71 . When areas are defined and voices are further separated for each area, the area-by-area speech level calculator 45 determines the speech level for each area. The determined area-by-area utterance level information is sent to area-by-area dialogue request presence/absence determination unit 55 .

The area-by-area dialogue request presence/absence determination unit 55 re-determines whether "there is a dialogue request" or "there is no dialogue request" for each area based on the transmitted area-by-area utterance level information. For example, the area-by-area dialogue request presence/absence determination unit 55 re-determines that "there is no dialogue request" for a seat belonging to an area with a low utterance level. Further, if the speech level is low in the seat area to which the voice input right is granted, the area-by-area dialogue request presence/absence determination unit 55 may notify the voice input right determination unit 60 to review the allocation of the voice input right.

The deep layer speech separation unit 72 performs speech separation for the target speech using a mathematical model that has undergone deep learning in advance so as to separate the speech from the voice quality. Specifically, the mathematical model of the deep layer speech separation unit 72 may be a DNN as disclosed in Patent Document 1.

FIG. 5 is a flowchart showing processing of the in-vehicle audio separation device 100 according to the third embodiment. ST1 to ST9 in the flowchart of FIG. 5 are the same as the blocks in the flowchart of FIG. ST31 to ST35 in the flowchart of FIG. 5 are steps newly added in the third embodiment.

As shown in FIG. 5, the processing of the speech separation apparatus 100 according to the third embodiment includes a step of separating the target speech from the speech data by beamforming (ST31) and a step of calculating the utterance level of the area of the target speech (ST32). Step (ST33) of determining whether the speech level is equal to or higher than the third threshold; and a separating step (ST35).

The beamforming unit 71 performs the step of separating the target sound from the sound data by beamforming (ST31).

The step of calculating the utterance level of the area of the target voice (ST32) is executed by the area-by-area utterance level calculation unit 45.

The step (ST33) of determining whether or not the speech level is equal to or higher than the third threshold is executed by the area-specific dialogue request presence/absence determination unit 55. The magnitude relationship between the third threshold of speech level and the first threshold of speech level and the second threshold of speech level is not unique, and may be determined individually in consideration of different audio processing paths.

If the speech level is less than the third threshold, that is, if the judgment result in step (ST33) in FIG. In other words, a step (ST34) is executed in which the seat belonging to the area of the target voice is regarded as having "no dialogue request". After that, the processing of the speech separation device 100 returns to the step (ST5) of determining whether or not the speech level is equal to or higher than the first threshold.

If the speech level is equal to or higher than the third threshold, that is, if the determination result in step (ST33) in FIG. Here, the process moves to the deep layer sound separation unit 72, and a step (ST35) of further separating the target sound by deep layer learning is executed.

As described above, the speech separation device 100 according to the technique of the present disclosure employs the configuration shown in Embodiment 3, does not use a wake word, does not require microphones to be placed in each seat, and has a high processing load. It is possible to realize sound separation that does not need to be done.

The disclosed technology can be applied to in-vehicle devices such as car navigation systems that can respond to voice instructions, and has industrial applicability.

10 information acquisition device, 11 vehicle-mounted device status acquisition unit, 12 surrounding road condition acquisition unit, 13 occupant video acquisition unit, 20 dialog request score calculation unit, 21 state-specific dialog request degree calculation unit, 22 road condition-specific request degree calculation unit, 23 Passenger position detection unit, 24 Gaze detection unit, 25 Face direction detection unit, 26 Posture detection unit, 27 Opening degree detection unit, 28 Result integration unit, 30 Voice acquisition device, 40 Speech level calculation unit, 45 Speech level calculation for each area 50 Dialogue request presence/absence determination unit 55 Area-specific dialogue request presence/absence determination unit 60 Voice input authority determination unit 70 Voice separation unit 71 Beam forming unit 72 Deep voice separation unit 80 Voice recognition device 90 Notification device 100 audio separator.

Claims

an utterance level calculation unit that calculates an utterance level based on the acquired voice inside the vehicle;
an interaction request score calculation unit that calculates an interaction request score for each seat based on the acquired information necessary for interaction request detection;
a voice input authority determination unit that determines to which of the seats in the vehicle the voice input authority should be granted based on the speech level and the dialogue request score;
In-vehicle audio separation equipment including.
a step of calculating a speech level based on the obtained voice inside the vehicle;
calculating an interaction request score for each seat based on the acquired information necessary for interaction request detection;
determining which of the seats in the vehicle is to be given the voice input privilege based on the speech level and the interaction request score;
Speech separation methods, including
　In-vehicle audio separation device according to claim 1, wherein conditions for the audio input right are different between the driver's seat and the seats other than the driver's seat.
The voice separation method according to claim 2, wherein the voice input right has different conditions for the driver's seat and for the seats other than the driver's seat.
　　The in-vehicle speech separation device according to claim 1, wherein the information necessary for detecting the interaction request includes surrounding road conditions, and the interaction request score calculation unit includes a request degree calculation unit for each road condition.
The speech separation method according to claim 2, wherein the information necessary for detecting the interaction request includes surrounding road conditions.
　The in-vehicle speech separation device according to claim 1, further comprising an area-specific speech level calculation unit that determines the magnitude of the speech level for an area in which the seat to which the speech input right is granted is included.
　The voice separation method according to claim 2, further comprising the step of determining the magnitude of the speech level for an area in which the seat to which the voice input right is granted is included.