WO2022038724A1 - Dispositif d'interaction vocale et procédé de détermination de cible d'interaction mis en œuvre dans un dispositif d'interaction vocale - Google Patents

Dispositif d'interaction vocale et procédé de détermination de cible d'interaction mis en œuvre dans un dispositif d'interaction vocale Download PDF

Info

Publication number
WO2022038724A1
WO2022038724A1 PCT/JP2020/031359 JP2020031359W WO2022038724A1 WO 2022038724 A1 WO2022038724 A1 WO 2022038724A1 JP 2020031359 W JP2020031359 W JP 2020031359W WO 2022038724 A1 WO2022038724 A1 WO 2022038724A1
Authority
WO
WIPO (PCT)
Prior art keywords
dialogue
voice
response
utterance
unit
Prior art date
Application number
PCT/JP2020/031359
Other languages
English (en)
Japanese (ja)
Inventor
政信 大澤
直哉 馬場
友紀 古本
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2020/031359 priority Critical patent/WO2022038724A1/fr
Publication of WO2022038724A1 publication Critical patent/WO2022038724A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Definitions

  • the present disclosure relates to a voice dialogue device and a dialogue target determination method in the voice dialogue device.
  • Patent Document 1 discloses a voice recognition device that determines whether or not an utterance by a speaker is an utterance to a dialogue device.
  • the voice recognition device disclosed in Patent Document 1 determines whether or not the utterance by the speaker is an utterance to the dialogue device based on the voice signal characteristics of the speaker such as the change in pitch frequency, the speed of utterance, or the volume. Judge. Specifically, for example, the voice recognition device disclosed in Patent Document 1 determines whether the change in the pitch frequency of the utterance of the speaker is within a predetermined range, and the change is within a predetermined range. For example, it is determined that the utterance by the speaker is the utterance to the dialogue device.
  • the utterance by the speaker is directed to the person other than the speaker depending on whether or not a person other than the speaker responds to the utterance within a predetermined time. Determine if it is an utterance or an utterance to a dialogue device.
  • the voice disclosed in Patent Document 1 is used when determining whether the dialogue request utterance by the speaker is a dialogue request utterance to the voice dialogue device or a dialogue request utterance to a person other than the speaker.
  • the voice uttering device may make an erroneous judgment when judging from the voice signal characteristics of the speaker as in the recognition device. For example, when the speaker speaks without inflection, the voice dialogue device may respond even if the dialogue request is made to a person other than the speaker.
  • the speaker when it utters with intonation, even if it is a dialogue request utterance to the voice dialogue device, it may be a dialogue request utterance to a person other than the speaker, and a person other than the speaker. May wait for a response.
  • This disclosure is made in order to solve the above-mentioned problems, and whether the dialogue request utterance by the speaker is a dialogue request utterance to a voice dialogue device or a dialogue request utterance to a person other than the speaker. It is an object of the present invention to provide a voice dialogue device capable of reducing erroneous determination as compared with the conventional determination technique.
  • the voice dialogue device is a voice dialogue device, which includes a voice acquisition unit that acquires spoken voice, a speaker identification unit that identifies a speaker based on the spoken voice acquired by the voice acquisition unit, and voice acquisition. Dialogue request by the user based on the voice recognition unit that performs voice recognition for the spoken voice acquired by the unit, the information about the speaker specified by the speaker identification unit, and the voice recognition result performed by the voice recognition unit.
  • a dialogue request detection unit that detects a speech
  • a response sign detection unit that detects a response sign by another user based on occupant status information indicating the status of another user when the dialogue request detection unit detects a dialogue request speech
  • a response sign detection unit that detects a response sign by another user based on occupant status information indicating the status of another user when the dialogue request detection unit detects a dialogue request speech.
  • the response sign detection unit detects a response sign by another user
  • the response is determined after the response sign is detected based on the information about the speaker specified by the speaker identification unit and the voice recognition result performed by the voice recognition unit.
  • the response detection unit that determines whether or not the other user has detected a voice within the usage time, the detection result of whether or not the response sign detection unit has detected the response sign, and the response detection unit detects the speech by the other user.
  • the dialogue request utterance detected by the dialogue request detection unit has a dialogue target determination unit that determines whether it is for the voice dialogue device or for other users. It is prepared.
  • the voice dialogue device is a conventional determination technique for determining whether a dialogue request utterance by a speaker is a dialogue request utterance to a voice dialogue device or a dialogue request utterance to a person other than the speaker. It is possible to reduce erroneous determination.
  • FIG. 4A and 4B are diagrams showing an example of the hardware configuration of the voice dialogue device according to the first embodiment.
  • FIG. 1 is a diagram showing a configuration example of the voice dialogue device 1 according to the first embodiment.
  • the voice dialogue device 1 is mounted on the vehicle.
  • the user of the voice dialogue device 1 is a vehicle occupant.
  • the voice dialogue device 1 when there is a dialogue request utterance by a certain occupant (hereinafter referred to as "dialogue request occupant") among the occupants existing in the vehicle, the dialogue request utterance is the voice dialogue device 1.
  • dialogue request utterance means an utterance made by the speaker in anticipation of a response from another person, as described above.
  • dialogue request utterance means an utterance that another person needs to respond to.
  • dialogue request utterance is an utterance such as "Hey,” or "I wonder if there is a supermarket around here.”
  • the other includes the other occupant and the voice dialogue device 1.
  • the voice dialogue device 1 determines whether the dialogue request utterance by the dialogue request occupant is the dialogue request utterance to the voice dialogue device 1 or the dialogue request utterance to the other occupants, in response to the dialogue request utterance by the dialogue request occupant.
  • the detection result of whether or not the response sign (hereinafter referred to as "response sign") is detected, and whether the utterance by other occupants is detected within a predetermined time (hereinafter referred to as "response determination time"). Judgment is made based on the judgment result of whether or not.
  • the details of the determination by the voice dialogue device 1 whether the dialogue request utterance is for the voice dialogue device 1 or for other occupants will be described later.
  • the determination of whether the dialogue request utterance performed by the voice dialogue device 1 is for the voice dialogue device 1 or for other occupants is also referred to as "dialogue target determination".
  • the voice dialogue device 1 determines that the dialogue request utterance is a dialogue request utterance to the voice dialogue request occupant, and determines that the dialogue request utterance is a response to the dialogue request utterance. return it.
  • the voice dialogue device 1 is installed in, for example, an in-vehicle car navigation device installed on a dashboard of a vehicle.
  • the voice dialogue device 1 is connected to the microphone 2, the image pickup device 3, and the output device 4.
  • the microphone 2 collects utterance voices by an occupant sitting in the vehicle.
  • the microphone 2 is, for example, an array microphone including a plurality of omnidirectional microphones.
  • the microphone 2 is an array microphone, and the array microphone is installed on the upper part of the rearview mirror. Note that this is only an example, and the array microphone may be installed in a place other than the upper part of the rearview mirror. For example, the array microphone may be located in the center of the dashboard.
  • the array microphone may be capable of collecting spoken voices by occupants sitting in the vehicle.
  • the array microphone outputs the collected utterance voice to the voice dialogue device 1.
  • the image pickup device 3 is installed in the vehicle and captures at least the face of the occupant seated in the vehicle.
  • the image pickup device 3 is installed, for example, on the dashboard or ceiling of the vehicle. It should be noted that this is only an example, and the image pickup device 3 may be installed at least in a place where the face of the occupant seated in the vehicle can be imaged.
  • the image pickup device 3 outputs the captured image (hereinafter referred to as "in-vehicle image capture image”) to the voice dialogue device 1.
  • the output device 4 is, for example, a speaker or a display device installed in the vehicle.
  • the output device 4 is mounted on, for example, an in-vehicle car navigation device installed on the dashboard of a vehicle.
  • the output device 4 outputs the response information output from the voice dialogue device 1.
  • the speaker outputs the response information by voice.
  • the display device displays the response information. The details of the response information output from the voice dialogue device 1 will be described later.
  • the voice dialogue device 1 includes a voice acquisition unit 11, a speaker identification unit 12, a voice recognition unit 13, a dialogue request detection unit 14, a state information acquisition unit 15, a response sign detection unit 16, a dialogue target determination unit 17, and a response generation unit 18. , And a response output unit 19.
  • the dialogue target determination unit 17 includes a response detection unit 171.
  • the voice acquisition unit 11 acquires the utterance voice collected by the array microphone.
  • the voice acquisition unit 11 outputs the acquired utterance voice to the speaker identification unit 12.
  • the speaker identification unit 12 identifies the speaker based on the utterance voice acquired by the voice acquisition unit 11. Specifically, the speaker identification unit 12 identifies the speaker together with the position of the speaker. For example, the speaker identification unit 12 analyzes the characteristics of the sound, such as frequency analysis, with respect to the spoken voice acquired by the voice acquisition unit 11. The speaker identification unit 12 may analyze the spoken voice by using a known voice analysis technique. Then, the speaker identification unit 12 identifies the speaker based on the analysis result for the spoken voice.
  • the speaker identification unit 12 specifies the direction of the sound source based on the analysis result of the spoken voice.
  • the speaker identification unit 12 may specify the direction of the sound source based on the analysis result of the utterance voice acquired from the array microphone by using a known technique.
  • the speaker specifying unit 12 specifies the direction of the sound source, the speaker specifying unit 12 identifies the occupant existing in the specified direction as the speaker.
  • the array microphone is installed on the upper part of the rear-view mirror. Therefore, for example, in the speaker identification unit 12, the direction of the sound source is the direction of the array microphone in the vehicle with respect to the array microphone. If it is in front of the right side of a straight line passing through the center and parallel to the direction of travel of the vehicle, the speaker identifies the first occupant, in other words, the driver sitting in the driver's seat. On the other hand, in the speaker identification unit 12, when the direction of the sound source is the front left side of the straight line passing through the center of the array microphone and parallel to the traveling direction of the vehicle in the vehicle, the speaker is the speaker. Identify the second occupant, in other words, the occupant seated in the passenger seat. In the first embodiment, it is assumed that the vehicle has a right-hand drive. Further, in the first embodiment, "parallel” is not limited to strictly “parallel”, but also includes “substantially parallel”.
  • the speaker identification unit 12 outputs information indicating the specified speaker (hereinafter referred to as “speaker information”) to the voice recognition unit 13.
  • the speaker information is, for example, a seat ID associated with each seat.
  • the seat ID is set in advance.
  • the voice recognition unit 13 performs voice recognition on the utterance voice acquired by the voice acquisition unit 11 and recognizes the utterance content.
  • the voice recognition unit 13 may recognize the utterance content by using the existing voice recognition technology.
  • the voice recognition unit 13 may acquire the utterance voice acquired by the voice acquisition unit 11 via the speaker identification unit 12.
  • the voice recognition unit 13 performs voice recognition on the spoken voice acquired by the voice acquisition unit 11 without receiving an instruction to start voice recognition by the user, such as pressing a button displayed on the display device.
  • the voice recognition result of the utterance content by the voice recognition unit 13 includes a character string indicating the utterance content.
  • the voice recognition unit 13 outputs information in which the voice recognition result of the utterance content and the information indicating the speaker are associated with each other as voice-related information to the dialogue request detection unit 14 and the dialogue target determination unit 17.
  • the information indicating the speaker is information about the speaker specified by the speaker specifying unit 12.
  • the dialogue request detection unit 14 is based on the voice-related information output from the voice recognition unit 13, in other words, based on the information about the speaker specified by the speaker identification unit 12 and the voice recognition result performed by the voice recognition unit 13. , Dialogue request Detects dialogue request utterances by occupants. Specifically, the dialogue request detection unit 14 determines, for example, whether or not the utterance content based on the voice recognition result and the preset utterance (hereinafter referred to as “dialogue request determination utterance”) match. By doing so, the dialogue request utterance is detected.
  • the utterance for determining the dialogue request is set in advance to be an utterance presumed to be a dialogue request utterance.
  • the utterance for determining the dialogue request is, for example, "Hey” or "I wonder if there is a supermarket in this area".
  • the dialogue request detection unit 14 matches the utterance content based on the voice recognition result and the dialogue request determination utterance, it is assumed that the dialogue request utterance is detected. Then, the dialogue request detection unit 14 identifies which occupant is the dialogue request utterance based on the information indicating the speaker associated with the voice recognition result. That is, the dialogue request detection unit 14 identifies which occupant is the dialogue request occupant.
  • the dialogue request detection unit 14 estimates the intention of the utterance from the utterance content based on the voice recognition result, and the estimated intention and the preset intention of the utterance (hereinafter referred to as "dialogue request determination preparation diagram"). ) May be determined.
  • the dialogue request detection unit 14 may estimate the intention of the utterance by using a known intention estimation technique.
  • an intention presumed as an intention to utter a dialogue request is set in advance.
  • an intention presumed as an intention to make a dialogue request utterance for example, an intention of a facility search such as a restaurant search or a gas station search can be mentioned.
  • the dialogue request detection unit 14 matches the estimated intention of the utterance and the dialogue request determination preparation diagram, it is assumed that the dialogue request utterance is detected. Then, the dialogue request detection unit 14 identifies which occupant is the dialogue request utterance based on the information indicating the speaker associated with the voice recognition result. That is, the dialogue request detection unit 14 identifies which occupant is the dialogue request occupant.
  • the dialogue request detection unit 14 When the dialogue request detection unit 14 detects a dialogue request utterance, the dialogue request detection unit 14 outputs information regarding the detected dialogue request utterance (hereinafter referred to as “dialogue request utterance information”) to the response sign detection unit 16.
  • the dialogue request utterance information is information in which the dialogue request utterance and the speaker information of the dialogue request occupant are associated with each other.
  • the state information acquisition unit 15 acquires information on the state inside the vehicle (hereinafter referred to as "in-vehicle state information").
  • the state information acquisition unit 15 acquires the in-vehicle image captured as the in-vehicle state information from the image pickup device 3. Then, the state information acquisition unit 15 detects the state of the occupant based on the acquired in-vehicle state information, and acquires information indicating the state of the occupant (hereinafter referred to as "occupant state information").
  • the state information acquisition unit 15 performs known image recognition processing on the captured image in the vehicle, and detects the state of the occupant, for example.
  • the state of the occupant is the direction of the occupant's line of sight, the direction of the occupant's face, the facial expression of the occupant, the emotion of the occupant, the posture of the occupant, the gesture of the occupant, the degree of opening of the occupant, and the like.
  • the state information acquisition unit 15 detects the state of the occupant, it also detects the position of the occupant.
  • the state information acquisition unit 15 acquires the in-vehicle captured image as the in-vehicle state information from the image pickup device 3, but this is only an example.
  • the state information acquisition unit 15 may acquire the voice recognition result of the utterance content as the vehicle interior state information from the voice recognition unit 13.
  • the state information acquisition unit 15 detects the state of the occupant based on the voice recognition result of the utterance content.
  • the state information acquisition unit 15 detects the feeling that the occupant is surprised as the occupant's state.
  • the state information acquisition unit 15 may acquire distance information as vehicle interior state information from a distance sensor (not shown). In this case, the state information acquisition unit 15 detects the state of the occupant based on the distance information.
  • the state information acquisition unit 15 detects the posture of the occupant leaning forward as the occupant's state based on the distance information.
  • the state information acquisition unit 15 acquires two or more of the in-vehicle image, the voice recognition result of the utterance content, or the distance information as the in-vehicle state information, and combines the methods as described above to obtain the occupant's state. It may be detected.
  • the state information acquisition unit 15 is information on the line of sight of the occupant, information on the orientation of the occupant's face, information on the degree of opening of the occupant, information on the facial expression of the occupant, information on the emotion of the occupant, information on the attitude of the occupant, or information on the occupant's posture.
  • Information about the gesture and the like are output to the response sign detection unit 16 as occupant status information in association with information that can identify the occupant.
  • the occupant-identifiable information includes at least information indicating the position of the occupant.
  • the occupant status information may include at least one of the above-mentioned information.
  • the response sign detection unit 16 is among the occupant status information indicating the status of each occupant output from the status information acquisition unit 15 when the dialogue request detection unit 14 detects the dialogue request utterance by the dialogue request occupant. Based on the occupant status information indicating the status of, other occupant response signs are detected.
  • the response sign detection unit 16 includes other information on the occupant's line of sight, other information on the occupant's face orientation, other information on the occupant's opening degree, other information on the occupant's facial expression, and other information on the occupant's emotions, which are included in the occupant status information. , Other information about the occupant's posture, or at least one of the other information about the occupant's gesture may be used to detect other signs of response by the occupant. Since the response sign detection unit 16 can identify the dialogue request occupant based on the dialogue request utterance information output from the dialogue request detection unit 14, other occupants can also be specified based on the dialogue request utterance information.
  • the response sign detected by the response sign detection unit 16 is specifically a change in the state of the other occupant or the state of the other occupant, which suggests that the other occupant may respond to the dialogue request utterance. , Or other occupant's actions.
  • the response sign detection unit 16 is in a state in which other occupants are preset (hereinafter referred to as "predictive detection state") based on the occupant state information acquired from the state information acquisition unit 15, or is in the state of other occupants. It is determined whether there is a change or whether the occupant has performed a preset action (hereinafter referred to as "predictive detection action").
  • the response sign detection unit 16 determines that the other occupant is in the sign detection state, the other occupant's state has changed, or the other occupant has performed the sign detection action, the response sign detection unit 16 causes the response sign by the other occupant. It is assumed that it is detected.
  • the sign detection state means, for example, a state in which the opening degree is larger than a predetermined threshold value (hereinafter referred to as “opening determination threshold value”).
  • opening determination threshold value a predetermined threshold value
  • the sign detection state may be, for example, a predetermined emotion (hereinafter referred to as “predict detection emotion”) or a state expressing the sign detection emotion. .. Emotions for predictive detection are, for example, surprises.
  • a change in the occupant's condition means, for example, a change in the degree of opening, or a change in emotions or facial expressions.
  • the opening degree is defined in the range of 0 to 10
  • the facial expression or facial expression of the occupant changes from the emotion or facial expression expressing "calmness” to the emotion or facial expression expressing "surprise”
  • the state of the occupant is said to change.
  • the sign detection action means, for example, turning the line of sight, the direction of the face, or the posture toward the dialogue request uttering occupant.
  • the action for detecting a sign may be, for example, performing a gesture such as leaning forward, nodding, or clapping a hand.
  • the response sign detection unit 16 stores the occupant status information acquired from the status information acquisition unit 15 in association with the acquisition date and time of the occupant status information, and refers to the past occupant status information to refer to other occupants. It suffices to detect that there is a change in the state of the vehicle, or that the occupant has performed a sign detection action.
  • the response sign detection unit 16 outputs the detection result of whether or not the response sign of the other occupants has been detected to the dialogue target determination unit 17.
  • the dialogue target determination unit 17 determines whether the dialogue request utterance detected by the dialogue request detection unit 14 is for the voice dialogue device 1 or for other occupants. In other words, the dialogue target determination unit 17 determines the dialogue target.
  • the dialogue target determination unit 17 determines whether or not the response sign detection unit 16 outputs a detection result indicating that another occupant's response sign has been detected.
  • the dialogue target determination unit 17 determines that the response sign detection unit 16 does not output the detection result indicating that the response sign of the other occupant has been detected, in other words, the response sign of the other occupant has not been detected.
  • the detection result is output, it is determined that the dialogue request utterance is for the voice dialogue device 1.
  • the response detection unit 171 of the dialogue target determination unit 17 determines the response sign. Within the response determination time after the detection of the above, it is determined whether or not the other occupant's utterance is detected. Specifically, the response detection unit 171 is based on the voice-related information output from the voice recognition unit 13, in other words, the information about the speaker specified by the speaker identification unit 12 and the voice recognition performed by the voice recognition unit 13. Based on the result, it is determined whether or not the other occupant's utterance is detected within the response determination time after the response sign is detected.
  • the response sign detection unit 171 attaches information on the detection time and outputs the information indicating that the response sign has been detected.
  • the time when the response sign is detected may be specified based on the information output from the response sign detection unit 16.
  • the response detection unit 171 may set the time when the dialogue target determination unit 17 acquires the information that the response sign of the other occupant is detected from the response sign detection unit 16 as the time when the response sign is detected.
  • the dialogue target determination unit 17 is a case where the detection result indicating that the response sign of the other occupant is detected is output from the response sign detection unit 16, and after the response sign detection unit 171 detects the response sign. If the utterance by the other occupant is detected within the response determination time, it is determined that the dialogue request utterance is the dialogue request utterance to the other occupant. Even if the dialogue target determination unit 17 outputs a detection result indicating that another occupant's response sign has been detected from the response sign detection unit 16, the response detection unit 171 responds after the response sign is detected. If no other utterance by the occupant is detected within the determination time, it is determined that the dialogue request utterance is a dialogue request utterance for the voice dialogue device 1.
  • the dialogue target determination unit 17 determines that the dialogue request utterance is a dialogue request utterance to the voice dialogue device 1, information indicating that a response is required for the dialogue request utterance (hereinafter referred to as "response required information"). ) Is output to the response generation unit 18. The dialogue target determination unit 17 outputs the voice recognition result of the voice recognition of the dialogue request utterance to the response generation unit 18 in association with the response required information.
  • the response generation unit 18 When the response required information is output from the dialogue target determination unit 17, the response generation unit 18 generates response information regarding the response content to the dialogue request utterance. The response generation unit 18 outputs the generated response information to the response output unit 19.
  • the response output unit 19 outputs the response information to the output device 4.
  • the response output unit 19 may display the response information on the display device, or may output the response information by voice from the speaker.
  • FIG. 2 is a flowchart for explaining the operation of the voice dialogue device 1 according to the first embodiment.
  • the driver who is the first occupant makes a dialogue request utterance
  • the voice dialogue device 1 determines whether the dialogue request utterance is for the voice dialogue device 1 or the second occupant. It shall be determined whether it is for an occupant seated in a passenger seat. That is, in the following operation explanation, the dialogue request utterance occupant is the first occupant, and the other occupants are the second occupants.
  • the voice acquisition unit 11 acquires the utterance voice collected by the array microphone (step ST201).
  • the voice acquisition unit 11 outputs the acquired utterance voice to the speaker identification unit 12.
  • the speaker identification unit 12 identifies the speaker based on the utterance voice acquired by the voice acquisition unit 11 in step ST201 (step ST202).
  • the speaker identification unit 12 outputs the speaker information to the voice recognition unit 13.
  • the voice recognition unit 13 performs voice recognition on the utterance voice acquired by the voice acquisition unit 11 in step ST201, and recognizes the utterance content (step ST203).
  • the voice recognition unit 13 outputs information in which the voice recognition result of the utterance content and the information indicating the speaker are associated with each other as voice-related information to the dialogue request detection unit 14 and the dialogue target determination unit 17.
  • the dialogue request detection unit 14 is based on the voice-related information output from the voice recognition unit 13 in step ST203, in other words, the information about the speaker specified by the speaker identification unit 12 and the voice recognition performed by the voice recognition unit 13. Based on the result, it is determined whether or not the dialogue request utterance by the dialogue request occupant is detected. Here, the dialogue request detection unit 14 determines whether or not the dialogue request utterance by the first occupant has been detected (step ST204).
  • step ST204 When the dialogue request detection unit 14 does not detect the dialogue request utterance by the first occupant (when "NO” in step ST204), the operation of the voice dialogue device 1 returns to the operation of step ST201.
  • the dialogue request detection unit 14 detects the dialogue request utterance by the first occupant (when "YES” in step ST204)
  • the dialogue request detection unit 14 outputs the dialogue request utterance information to the response sign detection unit 16. .. Then, the operation of the voice dialogue device 1 proceeds to step ST205.
  • the response sign detection unit 16 detects a response sign by other occupants based on the occupant status information indicating the status of other occupants among the occupant status information indicating the status of each occupant output from the status information acquisition unit 15.
  • the response sign detection unit 16 detects the response sign by the second occupant (step ST205).
  • the state information acquisition unit 15 has acquired the occupant state information from the vehicle interior state information.
  • the response sign detection unit 16 outputs the detection result of whether or not the response sign of the other occupant, here, the second occupant, has been detected to the dialogue target determination unit 17.
  • the dialogue target determination unit 17 determines in step ST205 whether or not the response sign detection unit 16 outputs a detection result indicating that the response sign of the second occupant has been detected (step ST206). In step ST206, when the dialogue target determination unit 17 determines that the response sign detection unit 16 does not output the detection result indicating that the response sign of the second occupant has been detected, in other words, the response of the second occupant. When it is determined that the detection result indicating that the sign has not been detected is output (in the case of "NO" in step ST206), the operation of the voice dialogue device 1 proceeds to step ST208.
  • step ST206 when the dialogue target determination unit 17 determines that the response sign detection unit 16 has output a detection result indicating that the response sign of the second occupant has been detected (when "YES" in step ST206). ), The response detection unit 171 of the dialogue target determination unit 17 determines whether or not the utterance by the second occupant is detected within the response determination time after the response sign is detected (step ST207).
  • step ST207 when it is determined that the response detection unit 171 has detected the utterance by the second occupant within the response determination time after the response sign is detected (in the case of "YES" in step ST207), the dialogue target determination is performed.
  • the unit 17 determines that the dialogue request utterance is a dialogue request utterance to the second occupant (step ST209). Then, the operation of the voice dialogue device 1 ends the processing.
  • step ST207 when the response detection unit 171 determines that the utterance by the second occupant has not been detected within the response determination time after the response sign is detected (in the case of "NO" in step ST207), the voice dialogue The operation of the device 1 proceeds to step ST208.
  • step ST206 when the dialogue target determination unit 17 determines that the response sign detection unit 16 has not output a detection result indicating that the response sign of the second occupant has been detected (in the case of "NO" in step ST206).
  • step ST207 when the response detection unit 171 determines that the utterance by the second occupant has not been detected within the response determination time after the response sign is detected (in the case of "NO" in step ST207).
  • the dialogue target determination unit 17 determines that the dialogue request utterance is a dialogue request utterance for the voice dialogue device 1 (step ST208).
  • the dialogue target determination unit 17 outputs the response required information to the response generation unit 18. Then, the operation of the voice dialogue device 1 proceeds to step ST210.
  • the response generation unit 18 generates response information regarding the content of the response to the dialogue request utterance.
  • the response generation unit 18 outputs the generated response information to the response output unit 19.
  • the response output unit 19 outputs the response information to the output device 4 (step ST210).
  • the voice dialogue device 1 detects the response sign by the other occupants based on the occupant status information indicating the status of the other occupants.
  • the voice dialogue device 1 detects a response sign by another occupant, the dialogue target determination is made based on whether or not the other occupant's utterance is detected within the response determination time after the response sign is detected based on the voice-related information. I do.
  • the voice dialogue device 1 Assuming that the voice dialogue device 1 first tries to determine the dialogue target from the voice signal characteristics of the speaker as in the above-mentioned conventional technique, the voice dialogue device 1 has a dialogue request utterance for the voice dialogue device 1. There is a possibility of misjudging whether it is for the occupants or for other occupants.
  • the voice dialogue device 1 according to the first embodiment first determines the dialogue target depending on whether or not a response sign by another occupant is detected for the dialogue request utterance. As a result, the voice dialogue device 1 can perform the dialogue target determination with less erroneous determination than in the case of performing the dialogue target determination from the voice signal characteristics of the speaker as in the prior art.
  • the voice dialogue device 1 When the voice dialogue device 1 according to the first embodiment detects a response sign by another occupant, the dialogue target determination is made based on whether or not the other occupant's utterance is detected within the response determination time. I do. First, the voice dialogue device 1 can determine the dialogue target with less erroneous determination depending on whether or not a response sign by the other occupant is detected. As a result, the other occupant speaks within the response determination time. Dialogue target determination based on whether or not it is detected can also further reduce erroneous determination.
  • the voice dialogue device 1 determines that the dialogue request utterance is for the voice dialogue device 1 even though the dialogue request utterance is for other occupants, the voice dialogue device 1 makes a dialogue request. It returns a response to the utterance. This means that an unexpected response was made to the dialogue request speaker.
  • the voice dialogue device 1 determines a response. Waiting for the time and other utterances by the occupants, the response to the dialogue request occupants will be delayed.
  • the voice dialogue device 1 returns a response. That is, an unexpected response is made to the dialogue request speaker.
  • the voice dialogue device 1 first determines the dialogue target based on whether or not a response sign by another occupant is detected in response to the dialogue request utterance, and thus makes a more erroneous determination as compared with the prior art. It is possible to make a dialogue target determination with a reduced number of. As a result, it is possible to achieve both a reduction in the response delay for the dialogue request speaker and a reduction in the response from the voice dialogue device that the dialogue request speaker does not expect.
  • the response detection unit 171 detects the response sign.
  • the dialogue target is determined based on whether or not the other occupant's utterance is detected within the response determination time.
  • the dialogue target determination unit 17 is a case where the response sign detection unit 16 detects a response sign by another occupant, and the response sign detection unit 16 detects the response sign.
  • the dialogue target determination may be performed depending on whether or not the other occupant's line of sight or face direction is directed to the target device.
  • the target device is a device presumed to expect a response from the voice dialogue device 1 when the dialogue request occupant turns his / her eyes or face to the dialogue request occupant. ..
  • the target device is, for example, a voice dialogue device 1 or a navigation device on which the voice dialogue device 1 is mounted. Further, the target device may be, for example, a speaker or a display device. Which device is the target device can be set as appropriate.
  • the dialogue target determination unit 17 may acquire occupant status information from the status information acquisition unit 15 and determine whether or not the other occupant's line of sight or face is facing the target device based on the occupant status information. good.
  • the state information acquisition unit 15 or the response sign detection unit 16 determines whether or not the other occupant's line of sight or face direction is directed toward the target device, and the dialogue target determination unit 17 acquires the determination result. You may.
  • the dialogue target determination unit 17 determines whether or not the dialogue target determination unit 17 has detected an utterance by another occupant within the response determination time after the response sign is detected. Instead of the determination, it is determined whether or not the line of sight or the face of the occupant is facing the direction of the target device. After detecting the response sign, the dialogue target determination unit 17 determines that the dialogue request utterance is a dialogue request utterance to the voice dialogue device 1 when the line of sight or the face of the occupant faces the direction of the target device. After detecting the response sign, the dialogue target determination unit 17 determines that the dialogue request utterance is a dialogue request utterance to the other occupant when the line of sight or the face of the occupant does not face the direction of the target device. In this case, the voice dialogue device 1 may be configured not to include the response detection unit 171.
  • the dialogue target determination unit 17 determines whether or not the dialogue target determination unit 17 has detected an utterance by another occupant within the response determination time after the response sign is detected. In addition to the determination, it may be determined whether or not the line of sight or the face of the occupant faces the direction of the target device. Specifically, in this case, the dialogue target determination unit 17 targets the other occupant's line of sight or face after the response sign detection unit 16 detects the response sign by the other occupant and before the response determination time elapses. When facing the direction of, it is determined that the dialogue request utterance is a dialogue request utterance for the voice dialogue device 1.
  • the dialogue target determination unit 17 determines that the dialogue request utterance is a dialogue request utterance to the voice dialogue device 1 when the line of sight or the face of the occupant faces the direction of the target device. If the line of sight or face of the other occupant does not turn toward the target device after the response sign is detected and before the response determination time elapses, the dialogue target determination unit 17 is determined by the other occupant within the response determination time. The dialogue target is determined depending on whether or not an utterance is detected.
  • FIG. 3 shows whether or not, in the first embodiment, after the dialogue target determination unit 17 detects the response sign of another occupant by the response sign detection unit 16, the line of sight or face of the other occupant faces the target device. It is a flowchart for demonstrating the operation of the voice dialogue apparatus 1 when the dialogue target determination is made by this.
  • the operation of the voice dialogue device 1 shown in the flowchart of FIG. 3 is different from the operation of the voice dialogue device 1 shown in the flowchart of FIG. 2 in that step ST301 is performed instead of step ST207 of FIG.
  • step ST301 is performed instead of step ST207 of FIG.
  • the dialogue target determination unit 17 detect utterances by other occupants within the response determination time after the response sign was detected in the operation of the voice dialogue device 1 described with reference to FIG. Instead of determining whether or not (see step ST207), it is determined whether or not the line of sight or face of the occupant is facing the direction of the target device.
  • the operations described in FIG. 2 are given the same step numbers, and duplicate explanations are
  • step ST206 When it is determined in step ST206 that the dialogue target determination unit 17 has output a detection result indicating that the response sign of the second occupant has been detected from the response sign detection unit 16 (in the case of "YES" in step ST206).
  • the dialogue target determination unit 17 determines whether or not the line of sight or face of the occupant is directed toward the target device (step ST301).
  • step ST301 when it is determined that the line of sight or face of the other occupant does not face the direction of the target device (in the case of "NO" in step ST301), the dialogue target determination unit 17 has the dialogue request utterance as the second occupant. It is determined that the dialogue request utterance is made to (step ST209). Then, the operation of the voice dialogue device 1 ends the processing.
  • step ST301 when it is determined that the other occupant's line of sight or face direction is directed toward the target device (in the case of "YES" in step ST207), the dialogue target determination unit 17 tells the dialogue request utterance to the voice dialogue device 1. It is determined that the dialogue request is spoken (step ST208). The dialogue target determination unit 17 outputs the response required information to the response generation unit 18. Then, the operation of the voice dialogue device 1 proceeds to step ST210.
  • the dialogue target determination unit 17 determines whether or not the other occupant's utterance is detected within the response determination time after the response sign is detected, and also the line of sight or face of the other occupant.
  • the voice dialogue device 1 has the case of "NO" in step ST301 in the flowchart of FIG. 3, followed by step ST207 of FIG. Make it work.
  • the voice dialogue device 1 depends on whether or not the other occupant's line of sight or face is directed toward the target device after detecting the response sign by the other occupant.
  • the dialogue target may be determined.
  • the dialogue request conversation is more accurate than the case where the other occupant's line of sight or face orientation is not determined after the other occupant's response sign is detected by the voice dialogue device 1.
  • the response delay to the dialogue request occupant can be reduced.
  • FIG. 4A and 4B are diagrams showing an example of the hardware configuration of the voice dialogue device 1 according to the first embodiment.
  • the functions of 17, the response generation unit 18, and the response output unit 19 are realized by the processing circuit 401. That is, when the dialogue request utterance is made by the speaker, the voice dialogue device 1 is a dialogue request utterance to a person other than the speaker, whether the dialogue request utterance is a dialogue request utterance to the voice dialogue device 1.
  • a processing circuit 401 for controlling the determination is provided.
  • the processing circuit 401 may be dedicated hardware as shown in FIG. 4A, or may be a CPU (Central Processing Unit) 405 that executes a program stored in the memory 406 as shown in FIG. 4B.
  • CPU Central Processing Unit
  • the processing circuit 401 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), or an FPGA (Field-Programmable). Gate Array) or a combination of these is applicable.
  • the processing circuit 401 is the CPU 405, the voice acquisition unit 11, the speaker identification unit 12, the voice recognition unit 13, the dialogue request detection unit 14, the state information acquisition unit 15, the response sign detection unit 16, and the dialogue target.
  • the functions of the determination unit 17, the response generation unit 18, and the response output unit 19 are realized by software, firmware, or a combination of software and firmware. That is, the voice acquisition unit 11, the speaker identification unit 12, the voice recognition unit 13, the dialogue request detection unit 14, the state information acquisition unit 15, the response sign detection unit 16, the dialogue target determination unit 17, and the response.
  • the generation unit 18 and the response output unit 19 are realized by a processing circuit 401 such as an HDD (Hard Disk Drive) 402, a CPU 405 that executes a program stored in a memory 406, and a system LSI (Large-Scale Dialogue). Further, the programs stored in the HDD 402, the memory 406, etc. include the voice acquisition unit 11, the speaker identification unit 12, the voice recognition unit 13, the dialogue request detection unit 14, the state information acquisition unit 15, and the response sign detection. It can also be said that the procedure or method of the unit 16, the dialogue target determination unit 17, the response generation unit 18, and the response output unit 19 is executed by the computer.
  • a processing circuit 401 such as an HDD (Hard Disk Drive) 402, a CPU 405 that executes a program stored in a memory 406, and a system LSI (Large-Scale Dialogue).
  • the programs stored in the HDD 402, the memory 406, etc. include the voice acquisition unit 11, the speaker identification unit 12, the voice recognition unit 13, the dialogue request detection unit 14,
  • the memory 406 is, for example, a RAM, a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Read Only Memory), an EEPROM (Electrically Erasable Molecular) volatile Read, etc.
  • a semiconductor memory, a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, a DVD (Digital Versaille Disc), or the like is applicable.
  • Some of the functions of the generation unit 18 and the response output unit 19 may be realized by dedicated hardware, and some may be realized by software or firmware.
  • the voice acquisition unit 11, the state information acquisition unit 15, and the response output unit 19 are realized by the processing circuit 401 as dedicated hardware, and the speaker identification unit 12, the voice recognition unit 13, and the voice recognition unit 13
  • the functions of the dialogue request detection unit 14, the response sign detection unit 16, the dialogue target determination unit 17, and the response generation unit 18 are realized by the processing circuit 401 reading and executing the program stored in the memory 406. It is possible.
  • the voice dialogue device 1 includes a device such as a microphone 2, an image pickup device 3, or an output device 4, and an input interface device 403 and an output interface device 404 that perform wired communication or wireless communication.
  • the microphone 2 is an array microphone, and one array microphone is installed in the vehicle, but this is only an example.
  • the microphone 2 may be, for example, a directional microphone installed in each seat so as to collect spoken voice in each seat.
  • the speaker identification unit 12 identifies the speaker based on, for example, the position in the vehicle where the directional microphone in which the utterance voice is collected is installed. Specifically, the speaker identification unit 12 identifies, for example, an occupant seated in a seat in which a directional microphone in which utterance voices are collected is installed as a speaker.
  • the voice dialogue device 1 is an in-vehicle device mounted on a vehicle, and includes a voice acquisition unit 11, a speaker identification unit 12, a voice recognition unit 13, and a dialogue request detection unit 14.
  • the state information acquisition unit 15, the response sign detection unit 16, the dialogue target determination unit 17, the response generation unit 18, and the response output unit 19 are assumed to be provided in the voice dialogue device 1.
  • a part or all of the response generation unit 18 and the response output unit 19 shall be mounted on the vehicle-mounted device of the vehicle, and the others shall be provided on the server connected to the vehicle-mounted device via the network.
  • the voice dialogue system may be configured by the in-vehicle device and the server.
  • the response sign detection unit 16 may detect any response sign among the plurality of other occupants. Further, the response sign detection unit 16 may be configured so that the response sign detection unit 16 detects an utterance by another occupant who has detected the response sign.
  • the voice dialogue device 1 is mounted on the vehicle, and the user of the voice dialogue device 1 is the occupant of the vehicle, but this is only an example.
  • the voice dialogue device 1 may be installed in, for example, a living room, and the user of the voice dialogue device 1 may be a resident of the living room.
  • the voice acquisition unit 11 that acquires the spoken voice and the speaker identification unit 12 that identifies the speaker based on the spoken voice acquired by the voice acquisition unit 11.
  • the voice recognition unit 13 that performs voice recognition for the spoken voice acquired by the voice acquisition unit 11, the information about the speaker specified by the speaker identification unit 12, and the voice recognition result performed by the voice recognition unit 13.
  • the dialogue request detection unit 14 that detects the dialogue request speech by the dialogue request user (dialogue request occupant) and the occupant that indicates the status of the other user (other occupant) when the dialogue request detection unit 14 detects the dialogue request speech.
  • a response sign detection unit 16 that detects a response sign by another user based on the state information, and information about the speaker specified by the speaker identification unit 12 when the response sign detection unit 16 detects a response sign by another user. Based on the voice recognition result performed by the voice recognition unit 13, the response detection unit 171 that determines whether or not another user's speech is detected within the response determination time after the response sign is detected, and the response sign detection unit.
  • the dialogue request speech detected by the dialogue request detection unit 14 is based on the detection result of whether or not 16 has detected a response sign and the determination result of whether or not the response detection unit 171 has detected a speech by another user. It is configured to include a dialogue target determination unit 17 for determining whether it is for the voice dialogue device 1 or for other users.
  • the voice dialogue device 1 is erroneous as compared with the conventional determination technique in determining whether the dialogue request utterance by the speaker is a dialogue request utterance to the voice dialogue device or a dialogue request utterance to a person other than the speaker. The judgment can be reduced.
  • the voice dialogue device can perform the dialogue target determination with less erroneous determination than the conventional determination technique for determining the dialogue target based on the voice signal characteristics of the speaker, the dialogue target determination can be performed. It can be applied to a voice dialogue device that performs the device.
  • 1 voice dialogue device 2 microphone, 3 image pickup device, 4 output device, 11 voice acquisition unit, 12 speaker identification unit, 13 voice recognition unit, 14 dialogue request detection unit, 15 status information acquisition unit, 16 response sign detection unit, 17 Dialogue target judgment unit, 171 response detection unit, 18 response generation unit, 19 response output unit, 401 processing circuit, 402 HDD, 403 input interface device, 404 output interface device, 405 CPU, 406 memory.

Abstract

L'invention concerne un dispositif d'interaction vocale comprenant : une unité d'acquisition audio (11) qui acquiert un signal audio vocal ; une unité d'identification de locuteur (12) qui identifie un locuteur ; une unité de reconnaissance vocale (13) qui effectue une reconnaissance vocale ; une unité de détection de requête d'interaction (14) qui détecte une voix de requête d'interaction sur la base du résultat de la reconnaissance vocale et des informations relatives au locuteur ; une unité de détection d'indication de réponse (16) qui, lorsque l'unité de détection de requête d'interaction (14) détecte la voix de requête d'interaction, détecte une indication qu'un autre utilisateur répondra ; une unité de détection de réponse (171) qui, lorsque l'unité de détection d'indication de réponse (16) détecte l'indication de réponse, détermine, sur la base du résultat de la reconnaissance vocale et des informations relatives au locuteur, si la voix émise par l'autre utilisateur a été détectée au cours d'une période de temps pour la détermination de réponse ; et une unité de détermination de cible d'interaction (17) qui détermine, sur la base du résultat de la détection de détection de l'indication de réponse par l'unité de détection d'indication de réponse (16) et du résultat de la détermination de détection de la voix émise par l'autre utilisateur par l'unité de détection de réponse (171), si la voix de requête d'interaction est dirigée vers le dispositif d'interaction vocale (1) ou vers l'autre utilisateur.
PCT/JP2020/031359 2020-08-20 2020-08-20 Dispositif d'interaction vocale et procédé de détermination de cible d'interaction mis en œuvre dans un dispositif d'interaction vocale WO2022038724A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/031359 WO2022038724A1 (fr) 2020-08-20 2020-08-20 Dispositif d'interaction vocale et procédé de détermination de cible d'interaction mis en œuvre dans un dispositif d'interaction vocale

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/031359 WO2022038724A1 (fr) 2020-08-20 2020-08-20 Dispositif d'interaction vocale et procédé de détermination de cible d'interaction mis en œuvre dans un dispositif d'interaction vocale

Publications (1)

Publication Number Publication Date
WO2022038724A1 true WO2022038724A1 (fr) 2022-02-24

Family

ID=80323469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/031359 WO2022038724A1 (fr) 2020-08-20 2020-08-20 Dispositif d'interaction vocale et procédé de détermination de cible d'interaction mis en œuvre dans un dispositif d'interaction vocale

Country Status (1)

Country Link
WO (1) WO2022038724A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020080110A (ja) * 2018-11-14 2020-05-28 本田技研工業株式会社 制御装置、エージェント装置及びプログラム

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020080110A (ja) * 2018-11-14 2020-05-28 本田技研工業株式会社 制御装置、エージェント装置及びプログラム

Similar Documents

Publication Publication Date Title
US11437020B2 (en) Techniques for spatially selective wake-up word recognition and related systems and methods
CN106796786B (zh) 语音识别系统
JP3910898B2 (ja) 指向性設定装置、指向性設定方法及び指向性設定プログラム
JP6977004B2 (ja) 車載装置、発声を処理する方法およびプログラム
JP4859982B2 (ja) 音声認識装置
JP5154363B2 (ja) 車室内音声対話装置
US11176948B2 (en) Agent device, agent presentation method, and storage medium
JP2017090611A (ja) 音声認識制御システム
CN112397065A (zh) 语音交互方法、装置、计算机可读存储介质及电子设备
JP2002091466A (ja) 音声認識装置
JP2017090612A (ja) 音声認識制御システム
JP6459330B2 (ja) 音声認識装置、音声認識方法、及び音声認識プログラム
KR20130046759A (ko) 차량에서 운전자 명령 인지장치 및 방법
JP6847324B2 (ja) 音声認識装置、音声認識システム、及び音声認識方法
JP2001013994A (ja) 複数搭乗者機器用音声制御装置、複数搭乗者機器用音声制御方法及び車両
WO2022038724A1 (fr) Dispositif d'interaction vocale et procédé de détermination de cible d'interaction mis en œuvre dans un dispositif d'interaction vocale
JP2008250236A (ja) 音声認識装置および音声認識方法
JP2018144534A (ja) 運転支援システムおよび運転支援方法並びに運転支援プログラム
CN109243457B (zh) 基于语音的控制方法、装置、设备及存储介质
JP2001296891A (ja) 音声認識方法および装置
WO2022176038A1 (fr) Dispositif de reconnaissance vocale et procédé de reconnaissance vocale
WO2022137534A1 (fr) Dispositif et procédé de reconnaissance vocale embarquée
JP2019197964A (ja) マイク制御装置
JP7407665B2 (ja) 音声出力制御装置および音声出力制御プログラム
WO2020240789A1 (fr) Dispositif de commande de l'interaction de la parole et procédé de commande de l'interaction de la parole

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20950289

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20950289

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP