WO2019150708A1

WO2019150708A1 - Information processing device, information processing system, information processing method, and program

Info

Publication number: WO2019150708A1
Application number: PCT/JP2018/042409
Authority: WO
Inventors: 富士夫荒井; 祐介工藤; 秀憲青木; 元濱田; 佐藤　直之; 邦在鳥居
Original assignee: ソニー株式会社
Priority date: 2018-02-01
Filing date: 2018-11-16
Publication date: 2019-08-08

Abstract

The present invention provides a device and a method for accurately identifying a speaker under various conditions. A speaker identification unit for identifying a speaker performs (a) speaker identification based on the comparison between the results of voice recognition for a spoken voice and the results of lipreading recognition for analyzing an utterance from the lip movement of a speaker, or (b) speaker identification based on the results of analysis for the action of the speaker or a listener. For example, the speaker identification unit: compares data estimating the times a speaker or listener nods with real nod timing data for a person nodding in an image captured by a camera, the estimated nod timing being estimated on the basis of an utterance text string obtained as voice recognition results; and determines whether the person nodding in the image captured by the camera is a speaker or a listener on the basis of the degree of coincidence.

Description

Information processing apparatus, information processing system, information processing method, and program

The present disclosure relates to an information processing apparatus, an information processing system, an information processing method, and a program. More specifically, the present invention relates to an information processing apparatus, an information processing system, an information processing method, and a program that perform processing and response based on a speech recognition result of a user utterance.

Recently, the use of speech recognition systems that perform speech recognition of user utterances and perform various processes and responses based on recognition results is increasing.
In this speech recognition system, a user utterance input via a microphone is recognized and understood, and processing corresponding to the user utterance is performed.
For example, when the user utters “tell about tomorrow's weather”, the weather information is acquired from the weather information providing server, a system response based on the acquired information is generated, and the generated response is output from the speaker. Specifically, for example,
System utterance = “Tomorrow's weather is sunny. However, there may be a thunderstorm in the evening.”
Such a system utterance is output.

Devices that perform such voice recognition include mobile devices such as smartphones, smart speakers, agent devices, signage devices, and the like.
In a configuration using smart speakers, agent devices, signage devices, etc., there are many cases where there are many people around these devices.
The voice recognition device is required to specify a speaker (speaking user) for the device and provide a service corresponding to the speaker.

For example, Patent Document 1 (Japanese Patent Application Laid-Open No. 11-24691) and Patent Document 2 (Japanese Patent Application Laid-Open No. 2008-146054) are conventional techniques that disclose a method for identifying a speaker.
These disclose a method of identifying a speaker based on an analysis result by analyzing sound quality (frequency identification / voiceprint) of a voice input to a device.
However, in these methods, in the case of a speaker with a similar sound quality, there is a high possibility of erroneous detection, and it is necessary to register the sound quality in advance.

Patent Document 3 (Japanese Patent Laid-Open No. 08-235358) discloses a method for identifying a speaker by detecting mouth movement from an image taken by a camera or the like.
However, this method makes it impossible to specify the speaker when there are a plurality of people moving his mouth or when the mouth of the speaker cannot be photographed.

JP-A-11-24691 JP 2008-146054 A Japanese Patent Laid-Open No. 08-235358

The present disclosure has been made in view of the above-described problems, for example, and an information processing apparatus and an information processing system capable of specifying a speaker with high accuracy under various circumstances using both sound and images And an information processing method and a program.

The first aspect of the present disclosure is:
It has a speaker identification unit that executes speaker identification processing,
The speaker specifying unit
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
It exists in the information processing apparatus which performs the speaker specific process of at least any one of said (a) and (b).

Furthermore, the second aspect of the present disclosure is:
An information processing system having an information processing terminal and a server,
The information processing terminal
A voice input unit;
An image input unit;
A communication unit that transmits the audio acquired through the audio input unit and the camera-captured image acquired through the image input unit to the server;
The server
A speech recognition unit that performs speech recognition processing of the speech of the speaker received from the information processing terminal;
An image recognition unit that performs analysis of the camera-captured image received from the information processing terminal;
The voice recognition unit
Generating a character string indicating the utterance content of the speaker, and generating nodding timing estimation data of the speaker and the listener based on the character string;
The image recognition unit
Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
In at least one of the information processing terminal and the server,
The present invention is in an information processing system that performs speaker identification processing based on the degree of coincidence between the nodding timing estimation data and the nodding timing actual data.

Furthermore, the third aspect of the present disclosure is:
An information processing method executed in an information processing apparatus,
The information processing apparatus includes:
It has a speaker identification unit that executes speaker identification processing,
The speaker specifying unit
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
In the information processing method for executing the speaker specifying process of at least one of the above (a) and (b).

Furthermore, the fourth aspect of the present disclosure is:
An information processing method executed in an information processing system having an information processing terminal and a server,
The information processing terminal
Transmitting the voice acquired through the voice input unit and the camera-captured image acquired through the image input unit to the server;
The server
From the voice received from the information processing terminal, voice recognition processing of the voice of the speaker is executed, a character string indicating the utterance content of the speaker is generated, and nodding timing estimation of the speaker and the listener based on the character string Generate data,
Performing analysis of the camera-captured image received from the information processing terminal;
Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
In at least one of the information processing terminal and the server,
The present invention is an information processing method for executing a speaker identification process based on the degree of coincidence between the nod timing estimation data and the actual nod timing data.

Furthermore, the fifth aspect of the present disclosure is:
A program for executing information processing in an information processing apparatus;
The information processing apparatus includes:
It has a speaker identification unit that executes speaker identification processing,
The program is stored in the speaker specifying unit.
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
In the program for executing the speaker specifying process of at least one of the above (a) and (b).

Note that the program of the present disclosure is a program that can be provided by, for example, a storage medium or a communication medium provided in a computer-readable format to an information processing apparatus or a computer system that can execute various program codes. By providing such a program in a computer-readable format, processing corresponding to the program is realized on the information processing apparatus or the computer system.

Further objects, features, and advantages of the present disclosure will become apparent from a more detailed description based on embodiments of the present disclosure described below and the accompanying drawings. In this specification, the system is a logical set configuration of a plurality of devices, and is not limited to one in which the devices of each configuration are in the same casing.

According to the configuration of an embodiment of the present disclosure, an apparatus and a method for executing a speaker identification process with high accuracy under various circumstances are realized.
Specifically, for example, the speaker specifying unit that executes the speaker specifying process (a) compares the speech recognition result for the uttered voice with the lip reading recognition result for analyzing the utterance from the lip movement of the speaker. Or (b) the speaker identification process based on the analysis result of the operation of the speaker or the listener. For example, a process of comparing the nodding timing estimation data of a speaker or listener estimated based on an utterance character string obtained as a speech recognition result and the nodding timing actual data of the nodding performer included in the camera-captured image is executed. Based on the degree of coincidence, it is determined whether the nod performer included in the camera-captured image is a speaker or a listener.
With this configuration, an apparatus and a method for executing a speaker identification process with high accuracy under various circumstances are realized.
Note that the effects described in the present specification are merely examples and are not limited, and may have additional effects.

It is a figure explaining the specific process example of the information processing apparatus which responds with respect to a user utterance. FIG. 2 is a diagram illustrating a configuration example and a usage example of an information processing device. It is a figure explaining the process which the information processing apparatus of this indication performs. FIG. 25 is a diagram for describing a configuration example of an information processing device. FIG. 11 is a diagram illustrating a flowchart for describing a sequence of processing executed by the information processing apparatus. It is a figure which shows the flowchart explaining the sequence of the speech recognition process which information processing apparatus performs. It is a figure explaining the analysis process of the nod timing which an information processing apparatus performs. It is a figure explaining the example of the table which recorded the nodding timing which information processing apparatus produces | generates. FIG. 11 is a diagram illustrating a flowchart for describing a sequence of image recognition processing executed by the information processing apparatus. It is a figure explaining the some pattern of the information required in the speaker specific process of this indication. It is a sequence diagram explaining the sequence in the case of performing a speaker specific process using the server by the side of a cloud. It is a sequence diagram explaining the sequence in the case of performing a speaker specific process using the server by the side of a cloud. It is a figure explaining the structural example of an information processing system. FIG. 25 is a diagram for describing an example hardware configuration of an information processing device.

The details of the information processing apparatus, the information processing system, the information processing method, and the program of the present disclosure will be described below with reference to the drawings. The description will be made according to the following items.
1. 1. Outline of processing executed by information processing apparatus 2. Outline of operation executed by information processing apparatus of present disclosure 3. Configuration example of information processing apparatus 4. Sequence of speaker identification processing 5. Other speaker identification methods 6. Configuration using cloud side devices (servers) 7. Specific usage example of information processing apparatus of present disclosure 8. Configuration examples of information processing apparatus and information processing system 9. Example of hardware configuration of information processing apparatus Summary of composition of this disclosure

[1. Outline of processing executed by information processing apparatus]
First, an overview of processing executed by the information processing apparatus according to the present disclosure will be described with reference to FIG.

FIG. 1 is a diagram illustrating a processing example of an information processing apparatus 10 that recognizes and responds to a user utterance made by a speaker 1.
The information processing apparatus 10 is a user utterance of the speaker 1, for example,
User utterance = "Tell me the weather in the afternoon tomorrow in Osaka"
The voice recognition process of this user utterance is executed.

Furthermore, the information processing apparatus 10 executes processing based on the speech recognition result of the user utterance.
In the example shown in FIG. 1, data for responding to the user utterance = “Tell me about Osaka's tomorrow and afternoon weather” is acquired, and a response is generated based on the acquired data. To output.
In the example illustrated in FIG. 1, the information processing apparatus 10 performs the following system response.
System response = “Tomorrow in Osaka, the afternoon weather is fine, but there may be a shower in the evening.”
The information processing apparatus 10 executes speech synthesis processing (TTS: Text To Speech) to generate and output the system response.

The information processing apparatus 10 generates and outputs a response using knowledge data acquired from a storage unit in the apparatus or knowledge data acquired via a network.
An information processing apparatus 10 illustrated in FIG. 1 includes a camera 11, a microphone 12, a display unit 13, and a speaker 14, and has a configuration capable of audio input / output and image input / output.

The information processing apparatus 10 illustrated in FIG. 1 is called, for example, a smart speaker or an agent device.
As illustrated in FIG. 2, the information processing apparatus 10 according to the present disclosure is not limited to the agent device 10 a but may be various device forms such as a smartphone 10 b and a PC 10 c, or a signage device installed in a public place. Is possible.

The information processing apparatus 10 recognizes the utterance of the speaker 1 and makes a response based on the user's utterance, and also executes control of the external device 30 such as a television and an air conditioner shown in FIG. 2 according to the user's utterance.
For example, when the user utterance is a request such as “change the TV channel to 1” or “set the air conditioner temperature to 20 degrees”, the information processing apparatus 10 determines whether the user utterance is based on the voice recognition result of the user utterance. A control signal (Wi-Fi, infrared light, etc.) is output to the external device 30 to execute control according to the user utterance.

The information processing apparatus 10 is connected to the server 20 via the network, and can acquire information necessary for generating a response to the user utterance from the server 20. Moreover, it is good also as a structure which makes a server perform a speech recognition process and a semantic analysis process.

[2. Outline of operation executed by information processing apparatus of present disclosure]
Next, an overview of operations performed by the information processing apparatus according to the present disclosure will be described with reference to FIG.
An information processing device such as an agent device or a signage device that responds to a user utterance is required to specify a speaker for the information processing device and provide a service corresponding to the speaker.

In the information processing apparatus of the present disclosure, voice data and image data are used to identify a speaker.
For example, as shown in FIG. 3A, in a case where the information processing apparatus 10 can photograph the mouth of the speaker 31, the speech of the speaker is input via the microphone 12 and the speech photographed by the camera 11. The lip reading process based on the movement of the person 31's mouth is performed and the content of the speech is analyzed.
Furthermore, the utterance content analysis result using the lip reading process is compared with the voice recognition result of the utterance voice data, and the speaker is specified based on the matching degree.

On the other hand, as shown in FIG. 3B, when the information processing apparatus 10 cannot capture the movement of the mouth of the speaker 31, a reaction such as the nodding or gesture of the speaker 31 or the companion or nodding of the listener 32 is performed. Analysis is performed from the camera-captured image, and a speaker is specified based on the analysis result.

The information processing apparatus 10 according to the present disclosure enables, for example, such a process to perform a speaker's highly accurate specifying process under various situations.

[3. Configuration example of information processing apparatus]
Next, a specific configuration example of the information processing apparatus will be described with reference to FIG.
FIG. 4 is a diagram illustrating a configuration example of the information processing apparatus 10 that recognizes a user utterance and responds.

As illustrated in FIG. 4, the information processing apparatus 10 includes a voice input unit 101, a voice recognition unit 102, an utterance meaning analysis unit 103, an image input unit 105, an image recognition unit 106, a speaker identification unit 111, and a speaker analysis unit 112. , Speaker data storage unit 113, storage unit (user database, etc.) 114, response generation unit 121, system utterance speech synthesis unit 122, display image generation unit 123, output (speech, image) control unit 124, speech output unit 125, An image output unit 126 is included.
Note that all of these components can be configured inside one information processing apparatus 10, but a part of the configuration and functions may be included in another information processing apparatus or an external server.

The user's uttered voice and ambient sounds are input to the voice input unit 101 such as a microphone.
The voice input unit (microphone) 101 inputs voice data including the input user uttered voice to the voice recognition unit 102.

The voice recognition unit 102 has, for example, an ASR (Automatic Speech Recognition) function, and converts voice data into a character string (text data) composed of a plurality of words.
The text data generated in the speech recognition unit 102 is input to the utterance meaning analysis unit 103.

Note that the speech recognition unit 102 also generates a nod timing of a speaker or listener estimated based on the speech based on the speech recognition result.
This process will be described in detail later.

The utterance meaning analysis unit 103 selects and outputs user intention candidates included in the text.
The utterance meaning analysis unit 103 has a natural language understanding function such as NLU (Natural Language Understanding), for example, and from text data, an intention (intent) of a user utterance and a meaningful element included in the utterance ( The entity information (entity: Entity) which is a significant element) is estimated, and the utterance meaning analysis processing is performed.

If the intention (entity) and the entity information (entity) can be accurately estimated and acquired from the user utterance, the information processing apparatus 10 can perform accurate processing on the user utterance.

The response generation unit 121 generates a response to the user based on the utterance meaning analysis result of the user utterance estimated by the utterance meaning analysis unit 103. The response is constituted by at least one of sound and image.
When the response voice is output, the voice information generated by the system utterance voice synthesis unit 122 through the voice synthesis process (TTS: Text To Speech) is output via the voice output unit 125 such as a speaker.
When outputting the response image, the display image information generated in the display image composition unit 123 is output via the image output unit 126 such as a display.
Note that the output control of sound and image is controlled by the output (sound and image) control unit 124.

The image output unit 126 includes, for example, a display such as an LCD or an organic EL display, or a projector that performs projection display.
The information processing apparatus 10 can also output and display images on externally connected devices such as televisions, smartphones, PCs, tablets, AR (Argented Reality) devices, VR (Virtual Reality) devices, and other home appliances. It is.

The image input unit 105 is a camera and inputs an image of a speaker or a listener.
The image recognition unit 106 inputs a photographed image from the image input unit 105 and performs image analysis. For example, lip reading processing is performed by analyzing the movement of the lips of the speaker, and a character string corresponding to the utterance content of the speaker is generated from the result.
Furthermore, the nod timing of the speaker or listener is acquired from the image.
Details of these processes will be described later.

The speaker specifying unit 111 executes speaker specifying processing.
The speaker specifying unit 111 inputs the following data.
(1) Speech recognition result (character string) related to the utterance of the speaker generated by the speech recognition unit 102, nodding timing estimation information of the speaker or listener,
(2) Estimated utterance content (character string) based on the lip reading result of the speaker acquired by the image recognition unit 106, nodding timing information of the speaker or listener,

The speaker specifying unit 111 inputs each of the above data, and executes speaker specifying processing using at least one of these pieces of information.
A specific example of the speaker specifying process will be described later.

The speaker analysis unit 112 analyzes the external characteristics (age, sex, body shape, etc.) of the speaker specified by the speaker specifying unit 110, and includes a user database including the speaker registered in the storage unit 114 in advance. To search for a speaker and perform a specific process.

The speaker data storage unit 113 executes processing for storing the information of the speaker and the content of the speech in the storage unit (user database or the like) 114.
In the storage unit (user database, etc.) 114, in addition to a user database consisting of speaker information such as age, gender, and body type for each user, the content of each speaker's speech is associated with each user (speaker). To be recorded.

[4. About the sequence of speaker identification processing]
Next, a sequence of speaker identification processing executed by the information processing apparatus 10 according to the present disclosure will be described with reference to a flowchart illustrated in FIG.

Note that the processing shown in the flow of FIG. 5 and the subsequent steps can be executed in accordance with a program stored in the storage unit of the information processing apparatus 10, and for example, executed as a program execution process by a processor such as a CPU having a program execution function Can do.
Hereinafter, the process of each step of the flow shown in FIG. 5 will be described.

(Step S101)
First, in step S101, the speech recognition unit 102 performs speech recognition processing on the speech input from the speech input unit 101.
The voice recognition unit 102 has, for example, an ASR (Automatic Speech Recognition) function, and converts voice data into a character string (text data) composed of a plurality of words.

(Step S102)
In step S102, the speech recognition unit 102 acquires a character string that is a speech recognition result of the speaker.
The voice recognition unit 102 generates a character string of the utterance content acquired as a voice recognition result and outputs it to the speaker specifying unit 111.

(Step S103)
In step S 103, the voice recognition unit 102 generates estimation data of the nodding timing of the speaker or listener based on the character string that is the speech recognition result of the speaker.
Details of this processing will be described later.

(Step S104)
In step S104, the image recognition unit 106 that has input the captured image of the image input unit 105 analyzes the image and searches for a person who is estimated to be a speaker. Specifically, a person whose lips are moving is searched.

(Step S105)
Step S105 is a step of determining whether or not the search of the person (the person who moves the lips) estimated as the speaker based on the image in Step S104 is successful. If the search of the estimated speaker is successful, Proceed to step S111.
If the search for the estimated speaker fails, the process proceeds to step S121.

(Step S111)
When the search of the person who is estimated to be the speaker based on the image (person who moves the lips) is successful, the image recognition unit 106 in step S111, the lips of the estimated speaker (person who moves the lips). Analyzing the movements of the lips, the lip reading recognition process is executed.
The image recognition unit 106 generates a character string of the utterance content based on the lip reading recognition process, and outputs the character string to the speaker specifying unit 111.

(Steps S112 to S113)
The process in step S112 is a process executed by the speaker specifying unit 112.
The speaker specifying unit 112
(1) A character string of utterance content generated by the voice recognition unit 102 based on voice recognition,
(2) a character string of the utterance content generated by the image recognition unit 106 based on the lip recognition processing;
Input these two character strings.
The speaker specifying unit 112 compares these two character strings.

When these two character strings substantially match (step S113 = Yes), the process proceeds to step S131, and the estimated speaker (person who moves the lips) detected based on the image in step S104 is specified as the speaker.
Specifically, when the matching rate between the speech recognition result character string and the lip reading recognition result character string is equal to or higher than a predetermined threshold, the estimated speaker (moving the lip is detected based on the image). Person) as the speaker.

On the other hand, if the two character strings do not match (step S113 = No), that is, if the matching rate is less than the prescribed threshold value, the estimated speaker (moving lips) detected based on the image in step S104. Is determined not to be a speaker, and the process returns to step S104 to search for an estimated speaker again.

Note that in this estimated speaker search process, the estimated speaker detected based on the image in step S104 is excluded from the search target, and only other persons are searched.

(Step S121)
On the other hand, if it is determined that the search for the person who is estimated to be a speaker (the person who is moving the lips) based on the image in step S104 is not successful (step S105 = No), the process proceeds to step S121.

The process in step S121 is a process executed by the image recognition unit 106.
The image recognition unit 106 that has input the captured image of the image input unit 105 analyzes the image and searches for a person who nods.
When a nodding person can be found, the image recognition unit 106 generates nodding time-series data, that is, a “nodding timing time table”, and outputs it to the speaker identification unit 111.
The “nod timing timing table” will be described in detail later.

(Steps S122 to S123)
The process of step S122 is a process executed by the speaker specifying unit 112.
The speaker specifying unit 111 compares the following two data.
(1) Nodding timing estimation data of the speaker or listener estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker in step S103;
(2) The “nodding timing time table” generated by the image recognition unit 106 using the captured image of the image input unit 105 in step S121;
The speaker identification unit 112 compares these two nodding timings.

The nodding timing estimation data of the speaker estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the “nodding” generated by the image recognition unit 106 using the captured image of the image input unit 105. If the “timing time tables” are almost the same (step S123 = (a) coincides with the estimated nodding timing of the speaker), the process proceeds to step S131, and the “person who nods” detected based on the image in step S121. Identify as a speaker.

Specifically, the nodding timing estimation data of the speaker estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the image recognition unit 106 use the captured image of the image input unit 105. If the matching rate of the time stamps recorded in the “nodding timing time table” generated above is equal to or higher than a predetermined threshold, the “person who nods” detected based on the image is used as the speaker. Identify.

Also, the listener's nod timing estimation data estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the image recognition unit 106 generated using the captured image of the image input unit 105. If the “nodding timing time table” substantially matches (step S123 = (b) matches the listener's estimated nodding timing), the process proceeds to step S124.

Specifically, the listener's nod timing estimation data estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the image recognition unit 106 use the captured image of the image input unit 105. When the coincidence rate of the time stamps recorded in the generated “nodding timing time table” is equal to or higher than a predetermined threshold value, the process proceeds to step S124.

Further, the nodding timing estimation data of the speaker and the listener estimated by the voice recognition unit 102 based on the character string that is the voice recognition result of the speaker, and the image recognition unit 106 use the captured image of the image input unit 105. When the generated “nodding timing time table” does not match (step S123 = No), that is, when the matching rate is less than the prescribed threshold, the nodding performer detected based on the image in step S121 utters It is determined that the user is neither a speaker nor a listener, and the process returns to step S104 to search for an estimated speaker again.

In this estimated speaker search process, the nod performer detected based on the image in step S121 is excluded from the search target, and a search using only other persons as the search target is executed.

(Step S124)
In step S124, the listener's nod timing estimation data estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker in step S123, and the image recognition unit 106 obtain the captured image of the image input unit 105. This is executed when the “nodding timing time table” generated by using the data generally matches.

In step S124, the speaker identifying unit 111 determines that the nod performer detected from the image is a listener, acquires information on the line of sight and head orientation of the listener from the image recognition unit 106, and in step S131, The person in the direction of is identified as the speaker.

(Step S131)
Step S131 is a speaker specifying process executed by the speaker specifying unit 111.
The speaker specifying unit 111 executes any of the following three types of speaker specifying processes.

(Process 1) Step S113 = Yes
When the character string of the utterance content generated by the voice recognition unit 102 based on the voice recognition and the character string of the utterance content generated by the image recognition unit 106 based on the lip reading recognition process are substantially the same (step S113 = Yes), step The estimated speaker (person who moves the lips) detected based on the image in S104 is specified as the speaker.

(Process 2) Step S123 = (a) Matches the estimated nodding timing of the speaker The nodding timing estimation data of the speaker estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the image recognition unit When the “nodding timing time table” generated using the captured image of the image input unit 105 substantially matches 106, the “person who nods” detected based on the image in step S121 is identified as the speaker. To do.

(Process 3) Step S123 = (b) Matches with the listener's estimated nodding timing The listener's nodding timing estimation data estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the image recognition unit 106 When the “nodding timing time table” generated using the captured image of the image input unit 105 substantially matches,
The nod performer detected from the image is determined to be a listener, and information on the line of sight and head orientation of the listener is acquired from the image recognition unit 106, and the person ahead of the line of sight and head direction is identified as the speaker.

(Step S132)
Finally, the speaker specifying unit 111 analyzes the characteristics of the speaker after the speaker specifying process. Then, the analysis result of the speaker and the utterance content are stored in the storage unit (user database or the like) 114 via the speaker data storage unit 113.

The data stored in the storage unit (user database, etc.) 114 includes, for example, the external features (age, gender, body type, etc.) of the speaker analyzed from the image acquired by the image input unit 105, and the generation of the voice recognition unit 102. A character string indicating the content of the uttered speech.

Next, the detailed sequence of the processing of steps S101 to S103 in the flowchart of FIG. 5, that is, the processing executed by the voice recognition unit 102 will be described with reference to the flowchart shown in FIG.

In steps S101 to S103, the speech recognition unit 102 performs speech recognition processing on the speech input from the speech input unit 101 to generate a character string indicating the utterance content. Based on the character string that is the voice recognition result of the speaker, nodding timing estimation data for the speaker or listener is generated.
The flow shown in FIG. 6 is a flowchart showing the effective procedure of these processes.
The processing of each step will be described sequentially.

(Step S201)
First, in step S 201, the voice recognition unit 102 inputs a speaker's voice from the voice input unit 101.

(Step S202)
Next, in step S202, the voice recognition unit 102 converts the voice of the speaker input from the voice input unit 101 into a character string (text data) composed of a plurality of words.
By this process, the utterance character string 251 shown in FIG. 6 is generated.
This uttered character string 251 is input to the speaker specifying unit 111.

(Step S203)
In step S 203, the speech recognition unit 102 generates estimation data of the speaker's nod timing based on the character string that is the speech recognition result of the speaker.
Through this process, a speaker nodding timing estimation time table 252 shown in FIG. 6 is generated.
The speaker nodding timing estimation time table 252 is input to the speaker specifying unit 111.

(Step S204)
In step S204, the voice recognition unit 102 generates estimation data of the listener's nod timing based on the character string that is the voice recognition result of the speaker.
This process generates the listener nodding timing estimation time table 253 shown in FIG.
The listener nodding timing estimation time table 253 is input to the speaker specifying unit 111.

Specific examples of processing for generating the speaker nod timing estimation time table 252 and the listener nod timing estimation time table 253 will be described with reference to FIG.

FIG. 7 shows a character string generated by the speech recognition unit 102 based on the uttered speech.
String = Hello, It's good weather today. By the way, did you watch that movie? It's popular now and every movie theater seems to be full.

The voice recognition unit 102 estimates the nodding timing of the speaker and the listener according to the utterance voice made up of this character string.
The voice recognition unit 102 estimates the nodding timing of the speaker and the listener from the timing at which a punctuation mark that interrupts the voice enters.

The speaker performs “nodding” with his / her speech. For example, “nodding” associated with one's utterance is performed at the value immediately before the punctuation mark in the utterance sentence.
On the other hand, the listener often performs “nodding” as a companion to the utterance of the speaker. For example, there are many nods immediately after punctuation in utterances.

The voice recognition unit 102 uses this characteristic to set the speaker's nod estimation timing just before the punctuation mark and the listener's nod estimation timing just after the punctuation mark.

The speech recognition unit 102 performs this estimation process,
A speaker nod timing estimation time table 252 with the nod timing set immediately before the punctuation mark;
A listener nodding timing estimation time table 253 in which the nodding timing is set immediately after the punctuation mark is generated.
Note that the timing estimation time table is a table in which time information estimated to be nodding is sequentially recorded as a “time stamp”.
The voice recognition unit 102 obtains time information from the clock of the information processing apparatus 10 or via a network, and generates a time table in which time information estimated to be nodding is recorded as a “time stamp”.

A specific example of the speaker nodding timing estimation time table 252 is shown in FIG.
As shown in FIG. 8, the speaker nodding timing estimation time table 252 is a table in which time information (time stamp) estimated to be nodding is recorded in time series.

Next, a detailed sequence of processing executed by the image recognition unit 106 will be described with reference to a flowchart shown in FIG.
Note that the processing executed by the image recognition unit 106 is each of steps S104 to S105, S111, and S121 to S in the flowchart described with reference to FIG.
The flowchart shown in FIG. 9 is a flowchart for explaining a specific processing sequence of these processes.
The processing of each step will be described sequentially.

(Step S301)
In step S 301, the image recognition unit 106 inputs a captured image of the image input unit 105.

(Step S302)
Next, in step S302, the image recognition unit 106 analyzes the image and searches for a person estimated to be a speaker. Specifically, a person whose lips are moving is searched.

(Step S303)
Step S303 is a step of determining whether or not the search for the person (moving person's lips) estimated to be a speaker based on the image in step S302 is successful. If the search for the estimated speaker is successful, Proceed to step S304.
If the search for the estimated speaker fails, the process proceeds to step S311.

(Step S304)
If the search of the person who is estimated to be the speaker based on the image (the person who moves the lips) is successful, in step S304, the image recognition unit 106 determines the lips of the estimated speaker (the person who moves the lips). Analyzing the movements of the lips, the lip reading recognition process is executed.
The image recognition unit 106 generates an utterance character string 351 that is a character string indicating the utterance content based on the lip reading recognition process.
The uttered character string 351 generated based on the lip reading recognition process is input to the speaker specifying unit 111.

(Step S311)
On the other hand, if it is determined that the search of the person (the person who is moving the lips) estimated as the speaker based on the image in step S302 is not successful (step S303 = No), the process proceeds to step S311.

In step S 311, the image recognition unit 106 analyzes the captured image of the image input unit 105 and searches for a person who nods.
If a nodding person can be found, the image recognition unit 106 generates nodding time-series data, that is, “nodding timing time table 352”, and outputs it to the speaker identification unit 111.

The nod timing time table 352 is a table in which the actual nod timing of the nodding person in the image is recorded. That is, it is nod timing actual data.
The table configuration is the same as the configuration described above with reference to FIG.
However, the “nodding timing time table 352” generated by the image recognition unit 106 is a table in which the nodding timing observed from the image is recorded, and the actual time when the nodding is executed is recorded.

On the other hand, according to the flow described with reference to FIG. 6, the nod timing timing tables 252 and 253 generated by the speech recognition unit 102 are estimated times that the nod is estimated to be executed according to the character string indicating the utterance content. Is a table in which is recorded.
This is the process described above with reference to FIG.

FIG. 10 shows a table summarizing the minimum necessary elements for speaker identification.
There are three different patterns of combinations of elements that are the minimum necessary for speaker identification.
Pattern 1 is a pattern in which the speaker is specified by the lip recognition of the speaker.
In this pattern 1,
Recognizing the lip reading of the speaker is required.
Nodding recognition is unnecessary for both the speaker and the listener.
Recognition of the listener's gaze is also unnecessary.

Pattern 2 is a pattern in which the speaker is identified by recognition of the speaker's nod.
In this pattern 2,
Recognizing the lip reading of the speaker becomes unnecessary.
Nodding recognition requires the speaker's nodding recognition and no listener's nodding recognition.
The viewer's gaze recognition becomes unnecessary.

Pattern 3 is a pattern in which the speaker is identified by the listener's nodding recognition and line-of-sight information.
In this pattern 3,
Recognizing the lip reading of the speaker becomes unnecessary.
Nodding recognition does not require the speaker's nodding, and the listener's nodding is required.
The viewer's gaze recognition is necessary.

In addition, in the patterns 1 to 3 in FIG. 10, it is possible to improve the accuracy of speaker identification by adding information about items that are not required. For example, it is possible to improve the accuracy of speaker identification by using not only the speaker's nodding timing but also the listener's nodding timing and its gaze and head direction information.

[5. About other speaker identification methods]
In the embodiment described above, as a method for specifying a speaker, a method of comparing a speech character string (text) obtained from a speech recognition result with a speech character string (text) obtained from a lip reading recognition result, and a speech recognition result Although the method of comparing the nodding timing and the nodding timing of the image recognition result has been described, it is possible to identify the speaker by using the following data of voice and image.

(1) An utterance rhythm is acquired from speech, and if the rhythm of the subject's head or body obtained from the image matches the rhythm, the target person is identified as the speaker.
(2) A character string is acquired from speech, and a head skin vibration pattern depending on a vowel at the time of utterance is acquired from an image. The speaker is specified by the vowel of the character string and the head skin vibration pattern matching.
(3) Utilizing the property that a speaker's blink is likely to occur at a break in the speech, the speech break information is acquired from the voice, the blink timing is acquired from the image, and the speaker is identified by matching.

(4) The body volume obtained from the image is compared with the volume level, the sound level, and the speed change. The volume of the volume and the magnitude of the movement, the height of the voice and the raising and lowering of the face, the speed of the voice and the speed of the body movement are compared, and a person who shows a similar tendency is identified as the speaker.
(5) Recognize the utterance content from the utterance voice, and predict the change in the listener's emotion that would be caused by the utterance content (laughter, joy, anger, sadness, etc.). A person who expresses the same emotion as the prediction result is identified by image recognition. The person is identified as the listener, and the person ahead of the line of sight or head is identified as the speaker.

[6. Configuration using cloud side device (server)]
In the above-described embodiment, the description has been given of the processing example in which all the processing is executed inside the information processing apparatus 10 illustrated in FIG.

However, at least a part of the processing of the above-described embodiment may be configured to be executed using a cloud side device (server).
For example, voice recognition processing by the voice recognition unit 102, nod timing generation processing, lip reading recognition processing by the image recognition unit 106, nod detection processing, verification of character string recognition results, nod timing comparison processing, etc. A part or all of them may be executed in a cloud-side device (server).

A processing sequence in the case of performing processing using a device (server) on the cloud side will be described with reference to FIGS.
FIG. 11 is a processing sequence in which processing for specifying a speaker by voice recognition and lip reading recognition is executed by processing involving communication between the information processing apparatus 10 and the server 20.
FIG. 12 is a processing sequence in which processing for identifying a speaker using nodding timing is executed by processing involving communication between the information processing apparatus 10 and the server 20.

First, with reference to the sequence diagram shown in FIG. 11, a processing sequence in a case where processing for specifying a speaker by voice recognition and lip reading recognition is executed by processing involving communication between the information processing apparatus 10 and the server 20 will be described. .
Note that this processing sequence can be advanced by, for example, the information processing apparatus 10 outputting an API (Application Programming Interface) request to the server 20 and returning an API response to the information processing apparatus 20.
Hereinafter, a specific API example will be described.

(Step S401)
First, in step S401, the information processing apparatus 10 transmits a voice recognition request to the server 20 together with the acquired voice.

(Step S402)
Next, the server 20 transmits a speech recognition result to the information processing apparatus 10 in step S402.

The API used for the speech recognition processing in steps S401 to S402 is, for example, the following API.
GET / api / v1 / recognition / audio
This API is an API for starting speech recognition and acquiring the recognition result (text character string).

The data required for this API request is audio data (audio stream data) input by the information processing apparatus 10 via the audio input unit 101.
The API response that is a response from the server 20 includes a speech recognition result character string (speech (string)) that is result data of the speech recognition processing executed in the speech recognition processing unit on the server 20 side.

(Step S403)
In step S 403, the information processing apparatus 10 transmits a lip reading recognition request to the server 20 together with the acquired image.

(Step S404)
Next, the server 20 transmits a lip reading recognition result to the information processing apparatus 10 in step S404.

The API used for the lip reading recognition process in steps S403 to S404 is, for example, the following API.
GET / api / v1 / recognition / lip
This API is an API for starting lip reading recognition and acquiring a recognition result (text character string) and a speaker identifier (ID).

Data necessary for the API request is image data (video stream data) input by the information processing apparatus 10 via the image input unit 105.
The API response that is the response from the server 20 includes a lip reading recognition character string (lip (string)) that is the result data of the lip reading recognition process executed by the lip reading recognition processing unit on the server 20 side, and a speaker ID ( spike-id (string)).

Note that the server 20 holds a user database in which a user ID and user feature information are recorded in association with each other. The server 20 identifies a user based on image data received from the information processing apparatus 10 and a user identifier (ID). Is provided to the information sail processing apparatus 10.
However, in the case of an unregistered user, a result that it is unknown is returned.

(Step S405)
Next, in step S 405, the information processing apparatus 10 transmits a comparison request between the speech recognition result (character string) and the lip reading recognition result (character string) to the server 20.

(Step S406)
Next, in step S 406, the server 20 transmits a comparison result (match rate) between the speech recognition result (character string) and the lip reading recognition result (character string) to the information processing apparatus 10.

The API used for the comparison process between the speech recognition result (character string) and the lip reading recognition result (character string) in steps S405 to S406 is, for example, the following API.
GET / api / v1 / recognition / audio / comparison
This API is an API for comparing the character string of the voice recognition result and the character string of the lip reading recognition result and acquiring the result of matching.

The data required for this API request is received from the server 20 by the information processing apparatus 10.
Speech recognition result character string (speech (string))
Lip reading recognition result string (lip (string))
These data.
Further, the API response that is a response from the server 20 includes a character string match rate (concordance-rate (integer)) that is the result data of the character string comparison processing executed in the character string comparison unit on the server 20 side. .

(Step S407)
Next, in step S407, the information processing apparatus 10 causes the speaker specifying unit 111 to perform speaker specifying processing based on the character string match rate (concordance-rate (integer)) received from the server 20.
That is,
Speech recognition result character string (speech (string))
Lip reading recognition result string (lip (string))
When the matching rate of these character strings is equal to or higher than a predetermined threshold, an estimated speaker (a person who moves the lips) detected based on the image is specified as a speaker.

(Step S408)
Next, in step S 408, the information processing apparatus 10 transmits a speaker information acquisition request to the server 20.

(Step S409)
Next, the server 20 transmits speaker information to the information processing apparatus 10 in step S409.

The API used for the speaker information acquisition processing in steps S408 to S409 is, for example, the following API.
GET / api / v1 / recognition / speakers / {speaker-id}
The data required for this API request includes image data (video stream data) input by the information processing apparatus 10 via the image input unit 105 and a speaker ID (speaker-id (string)) received from the server 20. is there.

The API response that is a response from the server 20 includes user information corresponding to the speaker ID acquired from the user database on the server 20 side.
Specifically, it is configured by the following information of the speaker.
Gender (sex (string))
Age (age (integer))
Height (integer)
Physical status (string)
These are data registered in the user database on the server 20 side.

(Step S410)
Next, in step S 410, the information processing apparatus 10 transmits a speaker information registration request to the server 20.

(Step S411)
Next, in step S411, the server 20 transmits a speaker information registration completion notification to the information processing apparatus 10.

The API used for the speaker information registration processing in steps S410 to S411 is, for example, the following API.
POST / api / v1 / recognition / speakers / {speaker-id}

The data required for the API request includes a speaker ID (speaker-id (string)) received by the information processing apparatus 10 from the server 20, a voice recognition result character string (speech (string)) received from the server 20. The image data (video stream data) input by the information processing apparatus 10 via the image input unit 105.

The API response that is a response from the server 20 includes a speaker registration ID (registered-speaker-id (string)).

In the data registration of the speaker, the next information of the speaker can be stored in the database, so that the speaker can be specified at the time of the next utterance content analysis.
(A) Face photograph (b) Voice quality In addition, by storing the following data, it is possible to perform recognition according to the person's eyelid at the time of recognition, and to improve recognition accuracy.
(C) Mouth movement characteristics (d) Utterance sound characteristics (e) Utterance contents (trends of utterance contents)

Further, by storing the following data, it is possible to omit the re-analysis processing of the speaker data, or to acquire the speaker data even when only the face photograph can be taken.
(F) Gender (g) Age (h) Height (i) Physical characteristics

For example, the server 20 registers the speaker information (a) to (i) in the data heart.

Next, with reference to the sequence diagram shown in FIG. 12, a processing sequence in a case where processing for specifying a speaker using nodding timing is executed by processing involving communication between the information processing apparatus 10 and the server 20 will be described. .
This processing sequence can also be advanced by, for example, the information processing apparatus 10 outputting an API request to the server 20 and returning an API response to the information processing apparatus 20.
Hereinafter, a specific API example will be described.

(Step S421)
First, in step S421, the information processing apparatus 10 transmits a nod timing estimation data generation request to the server 20 along with the acquired voice.

(Step S422)
Next, in step S422, the server 20 transmits a voice-based nod timing estimation time table to the information processing apparatus 10.

The API used for the nod timing estimation time table generation processing in steps S421 to S422 is, for example, the following API.
GET / api / v1 / recognition / audio / nodes
This API is an API for analyzing a character string obtained from a speech recognition result and obtaining a nodding timing estimation time table (a time stamp list of times at which nodding is estimated) and a nodding person's ID. .

The data required for this API request is audio data (audio stream data) input by the information processing apparatus 10 via the audio input unit 101.
The API response that is a response from the server 20 includes a speech-based nod timing estimation time table acquired by the speech recognition processing unit on the server 20 side analyzing the speech recognition result (character string).

The voice-based nod timing estimation time table includes the following two types of tables.
Speaker nodding timing estimation time table (speaker-nods (string-array))
Listener nodding timing estimation timetable (listener-nodes (string-array))
These are generated in the voice recognition unit of the server 20 in accordance with the processing described above with reference to FIG. That is, the optimal nod timing is determined from the voice recognition result and generated.

(Step S423)
First, in step S423, the information processing apparatus 10 transmits a nod timing data generation request to the server 20 together with the acquired image.

(Step S424)
Next, in step S424, the server 20 transmits an image-based nodding timing time table to the information processing apparatus 10.

The API used for the nod timing estimation time table generation processing in steps S423 to S424 is, for example, the following API.
GET / api / v1 / recognition / nodes
This API is an API for analyzing an image and obtaining a nodding timing time table (a time stamp list of times when nodding is performed) and an ID of the nodding person.

Data necessary for the API request is image data (video stream data) input by the information processing apparatus 10 via the image input unit 105.
In addition, an API response that is a response from the server 20 includes an image-based nodding timing time table (an image recognition processing unit on the server 20 side analyzing and acquiring an image).
nod-timings (string-array)) and nodding ID (string-person-id (string)).

As described above, the server 20 holds the user database in which the user ID and the user characteristic information are recorded in association with each other, specifies the user based on the image data received from the information processing apparatus 10, and identifies the user identifier. (ID) is provided to the information sail processing apparatus 10.
However, in the case of an unregistered user, a result that it is unknown is returned.

(Step S425)
Next, in step S425, the information processing apparatus 10
(A1) a voice-based speaker nodding timing estimation time table received in step S422;
(A2) a voice-based listener nodding timing estimation time table;
Each of these two nod timing estimation timetables,
(V1) image-based nodding timing time table received in step S424;
A request for comparing these timetables is transmitted.

(Step S426)
Next, in step S426, the server 20 transmits a comparison result (match rate) between the two voice-based nodding timing estimation time tables and the image-based nodding timing time table to the information processing apparatus 10.

The API used for the comparison process of the nod timing table in steps S425 to S426 is, for example, the following API.
GET / api / v1 / recognition / nodes / comparison
This API is an API for comparing the nodding timings of the two audio-based nodding timing estimation time tables and the nodding timing time table of the image-based nodding timing time table to obtain a result of matching.

The data required for this API request is the following data received from the server 20 by the information processing apparatus 10.
(A1) voice-based speaker nodding timing estimation time table (speaker-nod (string-array)) received in step S422,
(A2) voice-based listener nodding timing estimation timetable (listener-nod (string-array)),
These two nod timing estimation timetables,
(V1) image-based nodding timing time table (nod-timing (string-array)) received in step S424,

Further, the API response which is a response from the server 20 is the result data of the nodding timing time table comparison processing executed in the speaker specifying unit on the server 20 side.
Speaker-concordance-rate (integer),
Listener-concordance-rate (integer)
These data are included.

Speaker-concordance-rate (integer) is:
(A1) a voice-based speaker nodding timing estimation time table (speaker-nod (string-array));
(V1) Image-based nodding timing timetable (nod-timing (string-array)),
This is the coincidence rate (%) data obtained as a comparison result of the nod timing of these two tables.

The listener-concordance-rate (integer) is:
(A2) a voice-based listener nodding timing estimation time table (listener-nod (string-array));
(V1) Image-based nodding timing timetable (nod-timing (string-array)),
This is the coincidence rate (%) data obtained as a comparison result of the nod timing of these two tables.

(Step S427)
Next, in step S427, the information processing apparatus 10 receives the speaker nod match rate (%) (speaker-concordance-rate (integer)) received from the server 20,
Listener-concordance-rate (integer)
Based on these data, the speaker specifying unit 111 executes speaker specifying processing.

Specifically, when the speaker-nod match rate (speaker-concordance-rate (integer)) is equal to or higher than a predetermined threshold value,
(V1) A nod performer detected from an image that is a generation target of an image-based nod timing timing table (node-timing (string-array)) is determined to be a speaker.

On the other hand, if the listener-concordance-rate (integer) is greater than or equal to a predetermined threshold value,
(V1) A nod performer detected from an image that is a generation target of an image-based nod timing timing table (nod-timing (string-array)) is determined to be a listener.
In this case, the processes of steps S428 to S430 are further executed to specify the speaker.

If a speaker is specified in step S427, the processes in steps S428 to S430 can be omitted.
That is, in step S427, if the speaker-concordance-rate (integer) is equal to or greater than a predetermined threshold value and a speaker is specified, steps S428 to S430 are performed. You may abbreviate | omit a process and start the process of step S431.

(Steps S428 to 430)
The processing of steps S428 to S430 is executed when the listener-nodding match rate (%) (listener-concordance-rate (integer)) is equal to or greater than a predetermined threshold value in step S427. That is,
(V1) This process is performed when it is determined that a nod performer detected from an image that is a generation target of an image-based nod timing timing table (node-timing (string-array)) is a listener.

In this case, the information processing apparatus 10 outputs a listener's gaze point detection request to the server in step S428.
This is a request to detect the person (speaker) ahead of the line of sight of the nod performer (listener) detected from the image.

In step S429, the server 20 transmits detection information of the person (speaker) ahead of the line of sight of the nod performer (listener) detected from the image to the information processing apparatus 10. This is performed using image data received from the information processing apparatus 10.

In step S430, the information processing apparatus 10 determines, based on the detection information of the person (speaker) ahead of the gaze of the nod performer (listener) detected from the image received from the server 20 by the speaker specifying unit 111. It is determined that the person at the line of sight is the speaker.

(Step S431)
The following processes in steps S431 to S434 are the same as the processes in steps S408 to S411 described above with reference to FIG.
In step S 431, the information processing apparatus 10 transmits a speaker information acquisition request to the server 20.

(Step S432)
Next, the server 20 transmits the speaker information to the information processing apparatus 10 in step S432.

The response from the server 20 includes user information corresponding to the speaker ID acquired from the user database on the server 20 side.
Specifically, it is configured by the following information of the speaker.
Gender (sex (string))
Age (age (integer))
Height (integer)
Physical status (string)
These are data registered in the user database on the server 20 side.

(Step S433)
Next, in step S 433, the information processing apparatus 10 transmits a speaker information registration request to the server 20.

(Step S434)
Next, in step S434, the server 20 transmits a speaker information registration completion notification to the information processing apparatus 10.

As described above with reference to FIG. 11, it is preferable that the registration information as the speaker information includes the following data.
(A) Face photo (b) Voice quality (c) Mouth movement characteristics (d) Utterance characteristics (e) Utterance contents (trends of utterance contents)
(F) Gender (g) Age (h) Height (i) Physical characteristics For example, the server 20 registers the speaker information of (a) to (i) in the data heart.

The processing described with reference to FIGS. 11 and 12 includes voice recognition processing, nod timing generation processing, lip reading recognition processing, nod detection processing, character string recognition result verification, nod timing verification processing, and the like. It is a processing sequence when it is set as the structure performed in the apparatus (server) by the side of a cloud.

Various settings can be made as to whether various processes included in a series of processing sequences are executed on the information processing apparatus 10 side or on the land (server 20) side.

[7. Specific usage example of information processing apparatus of present disclosure]
Next, a plurality of specific usage examples of the information processing apparatus of the present disclosure that executes the above-described speaker specifying process will be described.
The following usage examples will be described.
(1) Usage examples as agent devices (2) Usage examples as digital signage and vending machines (3) Usage examples as in-vehicle devices (4) Usage examples in conference rooms and schools (5) Usage examples in entertainment ( 6) Examples of use in security cameras and surveillance cameras

(1) Usage example as agent device First, a usage example as an agent device will be described.
If there are multiple people in front of the agent device, and one of them speaks to the agent device, the agent device captures the lip movement of the speaker with a camera and uses lip reading to convert the utterance into a string. Convert.
At the same time, the voice input using a microphone is converted into a character string using voice recognition. By comparing these character strings, it can be determined whether or not the person who captured the movement of the lips is a speaker.

If the speaker can be identified from the images captured by the camera, the agent device can infer information on the gender, age, and physical characteristics as the speaker's characteristics. More appropriate answers can be made.
For example, if the speaker is recognized as a child, it is possible to answer in an easy-to-understand tone tailored to the child, or to introduce a high-volume restaurant to a young and well-structured man.

In addition, by registering personal information in the database together with the face photo, it is possible to refer to the speaker information registered in the database from the face photo of the speaker. You can extract data and answer more in line with the speaker's expectations.

In addition, the following processing is also possible by storing the utterance feature information such as the utterance content of the speaker and the manner of speaking in the database together with the face photograph.
After the speaker is identified during the utterance, the utterance feature information is input to the speech recognition or lip recognition engine, so that recognition according to the speaker can be performed, and the recognition accuracy can be improved.

Also, when there are a plurality of agent devices in the home, speaker identification is effective to prevent the agent device from malfunctioning in response to the utterance of the agent device or the utterance of TV or radio sound.

(2) Usage example as digital signage and vending machine Next, a usage example as a digital signage and vending machine will be described.
There are many cases where there are multiple people in front of digital signage and vending machines in the city, and there are many unspecified people who cannot register in advance.

In such a situation, by specifying the speaker according to the process of the present disclosure, the following process can be performed.
In digital signage, the age and gender can be determined from the appearance of the speaker, and display according to the speaker can be performed.
In the case of vending machines, it is possible to recommend products tailored to the speaker.

Furthermore, by specifying the speaker, it is possible to record both the content of the utterance and the appearance information for a person who is interested in the displayed content or purchased product, or who has given an opinion. Useful for data collection. For the same purpose, in a shop or the like, by using the security camera or the like and the above-described speaker specifying process together, it is possible to record the utterance content of the customer for the product in the store together with the characteristics of the speaker.

(3) Usage example as an in-vehicle device Next, a usage example as an in-vehicle device will be described.
In a device operated by voice, such as car navigation, by configuring the speaker identification process according to the above-described process, the voice operator can be controlled so that it does not react to voices other than the driver, such as a child's utterance. It is possible to manage authority that is limited.

Also, in the case of a device that controls air conditioning by voice, by specifying the speaker, only the area around the speaker can be adjusted to the desired air conditioning.

(4) Usage example in a conference room and school Next, a usage example in a conference room and school will be described.
In a meeting such as a meeting where multiple people speak, a camera that shoots the entire room and a microphone are used to identify the speaker and automatically create a minutes that combines the speaker information and the content of the speech. be able to.

Also, for example, in nursery schools, if the minutes of speaker information and utterance contents can be created, the child's utterance contents can be easily reported to parents, and the burden on the nursery school can be reduced. Or, by placing the data on the cloud, the parent can check the utterance contents of the child at any time. By counting the number of utterances of each student at school, it can be used to determine whether the lesson is a class in which only some students are speaking or a class in which many students are speaking.

(5) Usage example in entertainment Next, a usage example in entertainment will be described.
In a large place with a large number of unspecified people such as parties and sports watching, you can find the person who is having a conversation of interest by projecting the topic you are talking to the speaker on a large monitor etc. You can increase the presence.
Further, if the performer's speaker can be specified at the time of TV shooting, the content of the performer's speech can be displayed in text as a balloon in real time.

(6) Usage Examples in Security Cameras and Surveillance Cameras Next, usage examples in security cameras and surveillance cameras will be described.
Identify people who are consulting criminal activity or speaking a specific word about crime using images taken with security cameras or surveillance cameras in the city or in the store, and surrounding audio input It becomes possible to do. As a result, the occurrence of the incident can be prevented and it can be used for identifying the criminal.

Thus, the speaker specifying process of the present disclosure can be used in various fields.
In the process of the present disclosure, a speaker is specified from among an unspecified number of persons even if the speaker is not registered in advance. Even when the movement of the lip of the speaker cannot be photographed, the speaker can be specified. Then, by specifying the speaker, the following effects can be obtained.
(A) Acquisition of external feature information such as the speaker's age and gender makes it possible to provide a service tailored to the speaker.
(B) It is possible to improve speech recognition and lip-reading recognition accuracy by recording the utterance contents and habits of the speaker and using it for learning the recognition function.
(C) By providing a service only to a specific speaker, it becomes possible to give a specific person administrator authority.
(D) By utilizing the characteristics and content of the utterance as log data, it can be used for market research data, entertainment use, and discovery of conspiracy.

[8. Configuration example of information processing apparatus and information processing system]
As described above with reference to FIGS. 11 and 12, for example, the processing functions of the respective components of the information processing apparatus 10 shown in FIG. 4 are all performed by one apparatus, for example, an agent device owned by a user or a smartphone. It is also possible to configure in a device such as a PC or a PC, but it is also possible to adopt a configuration in which a part thereof is executed in a server or the like.

FIG. 13 illustrates an example of a system configuration for executing the processing of the present disclosure.
FIG. 13 (1) Information processing system configuration example 1 has almost all the functions of the information processing apparatus shown in FIG. 4 as one apparatus, for example, a smartphone or PC owned by the user, or voice input / output and image input / output functions. In this example, the information processing apparatus 410 is a user terminal such as an agent device.
The information processing apparatus 410 corresponding to the user terminal executes communication with the application execution server 420 only when an external application is used when generating a response sentence, for example.

The application execution server 420 is, for example, a weather information providing server, a traffic information providing server, a medical information providing server, a tourist information providing server, or the like, and is configured by a server group that can provide information for generating a response to a user utterance. .

On the other hand, FIG. 13B is an information processing system configuration example 2 in the information processing apparatus 410, which is an information processing terminal such as a smartphone, PC, agent device, etc. owned by the user, with some of the functions of the information processing apparatus shown in FIG. This is an example of a system that is configured to be executed by a data processing server 460 that can communicate with an information processing apparatus.

For example, only the audio input unit 101, the image input unit 105, the audio output unit 125, and the image output unit 126 in the apparatus shown in FIG. 4 are provided on the information processing device 410 side on the information processing terminal side, and all other functions are provided on the server side. It is possible to perform a configuration such as

Specifically, for example, the following system configuration can be constructed.
An audio input / output unit and an image input / output unit are configured in an information processing terminal such as a user terminal.
On the other hand, the data processing server performs a speaker specifying process based on voice and images received from the user terminal.

Alternatively, as in the sequence described above with reference to FIGS. 11 and 12, the server generates information necessary for the speaker identification process and provides the information to the information processing device. It is good also as a structure which performs a speaker specific process.
For example, such various configurations are possible.

As specific processing modes, for example, the following processing modes are possible.
Information processing terminal
A voice input unit;
An image input unit;
A communication unit that transmits the sound acquired through the sound input unit and the camera-captured image acquired through the image input unit to the server;
The server
A speech recognition unit that performs speech recognition processing of the speech of the speaker received from the information processing terminal;
An image recognition unit that performs analysis of the camera-captured image received from the information processing terminal;
The voice recognition unit
Generate a string that shows the utterance content of the speaker, and generate nodding timing estimation data for the speaker and listener based on the string,
The image recognition unit
It is a configuration that generates nod timing actual data that records the nod timing of the nod performer included in the camera image,
In at least one of the information processing terminal and the server,
A speaker identification process is executed based on the degree of coincidence between the nod timing estimation data and the actual nod timing data.
Such an information processing system can be configured.

It should be noted that the functions on the information processing terminal side such as the user terminal and the function division modes of the server side functions can be set in various different ways, and a configuration in which one function is executed on both sides is also possible.

[9. Regarding hardware configuration example of information processing device]
Next, a hardware configuration example of the information processing apparatus will be described with reference to FIG.
The hardware described with reference to FIG. 14 is an example of the hardware configuration of the information processing apparatus described above with reference to FIG. 4, and constitutes the data processing server 460 described with reference to FIG. It is an example of the hardware constitutions of information processing apparatus.

A CPU (Central Processing Unit) 501 functions as a control unit or a data processing unit that executes various processes according to a program stored in a ROM (Read Only Memory) 502 or a storage unit 508. For example, processing according to the sequence described in the above-described embodiment is executed. A RAM (Random Access Memory) 503 stores programs executed by the CPU 501 and data. The CPU 501, ROM 502, and RAM 503 are connected to each other by a bus 504.

The CPU 501 is connected to an input / output interface 505 via a bus 504. An input unit 506 including various switches, a keyboard, a mouse, a microphone, and a sensor, and an output unit 507 including a display and a speaker are connected to the input / output interface 505. Has been. The CPU 501 executes various processes in response to a command input from the input unit 506 and outputs a processing result to the output unit 507, for example.

The storage unit 508 connected to the input / output interface 505 includes, for example, a hard disk and stores programs executed by the CPU 501 and various data. A communication unit 509 functions as a transmission / reception unit for Wi-Fi communication, Bluetooth (BT) communication, and other data communication via a network such as the Internet or a local area network, and communicates with an external device.

The drive 510 connected to the input / output interface 505 drives a removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory such as a memory card, and executes data recording or reading.

[10. Summary of composition of the present disclosure]
As described above, the embodiments of the present disclosure have been described in detail with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiments without departing from the gist of the present disclosure. In other words, the present invention has been disclosed in the form of exemplification, and should not be interpreted in a limited manner. In order to determine the gist of the present disclosure, the claims should be taken into consideration.

The technology disclosed in this specification can take the following configurations.
(1) having a speaker specifying unit for executing speaker specifying processing;
The speaker specifying unit
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
An information processing apparatus that executes at least one of the speaker identification processes of (a) and (b) above.

(2) The speaker specifying unit
If the lip movement of the speaker can be detected from the camera image,
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered voice and a lip recognition result for analyzing the utterance from the lip movement of the speaker;
The information processing apparatus according to (1), wherein:

(3) The speaker specifying unit
If it ’s impossible to detect the lip movement of the speaker from the camera image,
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
The information processing apparatus according to (1) or (2).

(4) The speaker specifying unit
Nodding timing estimation data of the speaker or listener estimated based on the utterance character string obtained from the speech recognition result for the uttered speech,
A comparison process with the actual data of the nodding person's nodding timing included in the camera image is executed.
The information processing apparatus according to any one of (1) to (3), wherein a nod performer included in the camera-captured image determines whether the person is a speaker or a listener.

(5) The speaker specifying unit
If the degree of coincidence between the utterer nod timing estimation data estimated based on the utterance character string obtained from the speech recognition result and the nod timing actual data is high, it is determined that the nod performer is a utterer,
If the degree of coincidence between the listener nodding timing estimation data estimated based on the utterance character string obtained from the speech recognition result and the nodding timing actual data is high, it is determined that the nod performer is a listener (4). The information processing apparatus described in 1.

(6) The speaker specifying unit
If it is determined that the nod performer is a listener,
The information processing apparatus according to (5), wherein a person in the line of sight of the nod performer is determined as a speaker.

(7) The information processing apparatus
A speech recognition unit that performs speech recognition processing for analyzing the speech from the lip movement of the speaker and performing speech recognition processing for the speech speech;
The speaker specifying unit
The information processing apparatus according to any one of (1) to (6), wherein a speech recognition result and a lipreading recognition result generated by the speech recognition processing unit are input and a speaker specifying process is executed.

(8) The voice recognition unit further includes:
Based on the utterance character string obtained from the speech recognition result for the uttered speech, generate nodding timing estimation data of the speaker and listener,
The speaker specifying unit
Nodding timing estimation data of a speaker and a listener generated by the voice recognition processing unit,
A comparison process with the actual data of the nodding person's nodding timing included in the camera image is executed.
The information processing apparatus according to (7), wherein the nod performer included in the camera-captured image determines whether the person is a speaker or a listener.

(9) The information processing apparatus
From an image captured by the camera, an image recognition unit that executes an analysis process of the operation of the speaker or the operation of the listener who is listening to the speech of the speaker,
The speaker specifying unit
The information processing apparatus according to any one of (1) to (8), wherein the analysis information generated by the image recognition processing unit is input and a speaker specifying process is executed.

(10) The image recognition unit
Generate a timetable that records the nod timing of the nod performer included in the camera image,
The speaker specifying unit
The information processing apparatus according to (9), wherein a speaker specifying process is executed using a nod timing timing table generated by the image recognition processing unit.

(11) The information processing apparatus includes:
A speech recognition unit that performs speech recognition processing on the uttered speech;
An image recognition unit that performs an analysis process of a captured image of at least one of the speaker or the listener;
The voice recognition unit
Based on the character string obtained by executing the speech recognition process for the uttered speech, estimating the speaker's nod timing and the listener's nod timing, and recording the estimated nod timing data,
Talker nod timing estimation timetable,
Listener nod timing estimation timetable,
Produces
The image recognition unit
Recorded the nodding timing of the nod performer included in the camera image,
Generate a nod timing timing table,
The speaker specifying unit
When the degree of coincidence between the speaker nodding timing estimation time table and the nodding timing time table is high, it is determined that the nodding performer is a speaker,
The information processing apparatus according to any one of (1) to (10), wherein when the degree of coincidence between the listener nod timing estimation time table and the nod timing timing table is high, the nod performer is determined to be a listener.

(12) An information processing system having an information processing terminal and a server,
The information processing terminal
A voice input unit;
An image input unit;
A communication unit that transmits the audio acquired through the audio input unit and the camera-captured image acquired through the image input unit to the server;
The server
A speech recognition unit that performs speech recognition processing of the speech of the speaker received from the information processing terminal;
An image recognition unit that performs analysis of the camera-captured image received from the information processing terminal;
The voice recognition unit
Generating a character string indicating the utterance content of the speaker, and generating nodding timing estimation data of the speaker and the listener based on the character string;
The image recognition unit
Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
In at least one of the information processing terminal and the server,
An information processing system that executes speaker identification processing based on a degree of coincidence between the nod timing estimation data and the nod timing actual data.

(13) An information processing method executed in the information processing apparatus,
The information processing apparatus includes:
It has a speaker identification unit that executes speaker identification processing,
The speaker specifying unit
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
An information processing method for executing the speaker specifying process of at least one of (a) and (b) above.

(14) An information processing method executed in an information processing system having an information processing terminal and a server,
The information processing terminal
Transmitting the voice acquired through the voice input unit and the camera-captured image acquired through the image input unit to the server;
The server
From the voice received from the information processing terminal, voice recognition processing of the voice of the speaker is executed, a character string indicating the utterance content of the speaker is generated, and nodding timing estimation of the speaker and the listener based on the character string Generate data,
Performing analysis of the camera-captured image received from the information processing terminal;
Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
In at least one of the information processing terminal and the server,
An information processing method for executing speaker specifying processing based on a degree of coincidence between the nod timing estimation data and the nod timing actual data.

(15) A program for executing information processing in an information processing device,
The information processing apparatus includes:
It has a speaker identification unit that executes speaker identification processing,
The program is stored in the speaker specifying unit.
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
A program for executing at least one of the speaker specifying processes (a) and (b).

Further, the series of processes described in the specification can be executed by hardware, software, or a combined configuration of both. When executing processing by software, the program recording the processing sequence is installed in a memory in a computer incorporated in dedicated hardware and executed, or the program is executed on a general-purpose computer capable of executing various processing. It can be installed and run. For example, the program can be recorded in advance on a recording medium. In addition to being installed on a computer from a recording medium, the program can be received via a network such as a LAN (Local Area Network) or the Internet and installed on a recording medium such as a built-in hard disk.

In addition, the various processes described in the specification are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Further, in this specification, the system is a logical set configuration of a plurality of devices, and the devices of each configuration are not limited to being in the same casing.

As described above, according to the configuration of an embodiment of the present disclosure, an apparatus and a method for executing a speaker identification process with high accuracy under various circumstances are realized.
Specifically, for example, the speaker specifying unit that executes the speaker specifying process (a) compares the speech recognition result for the uttered voice with the lip reading recognition result for analyzing the utterance from the lip movement of the speaker. Or (b) the speaker identification process based on the analysis result of the operation of the speaker or the listener. For example, a process of comparing the nodding timing estimation data of a speaker or listener estimated based on an utterance character string obtained as a speech recognition result and the nodding timing actual data of the nodding performer included in the camera-captured image is executed. Based on the degree of coincidence, it is determined whether the nod performer included in the camera-captured image is a speaker or a listener.
With this configuration, an apparatus and a method for executing a speaker identification process with high accuracy under various circumstances are realized.

DESCRIPTION OF SYMBOLS 10 Information processing apparatus 11 Camera 12 Microphone 13 Display part 14 Speaker 20 Server 30 External apparatus 101 Voice input part 102 Voice recognition part 103 Utterance meaning analysis part 105 Image input part 106 Image recognition part 111 Speaker specification part 112 Speaker analysis part 113 Speaker data storage unit 114 Storage unit (user database, etc.)
REFERENCE SIGNS LIST 121 Response generation unit 122 System utterance voice synthesis unit 123 Display image generation unit 124 Output (speech, image) control unit 125 Audio output unit 126 Image output unit 410 Information processing device 420 Application execution server 460 Data processing server 501 CPU
502 ROM
503 RAM
504 Bus 505 I / O interface 506 Input unit 507 Output unit 508 Storage unit 509 Communication unit 510 Drive 511 Removable media

Claims

It has a speaker identification unit that executes speaker identification processing,
The speaker specifying unit
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
An information processing apparatus that executes at least one of the speaker identification processes of (a) and (b) above.
The speaker specifying unit
If the lip movement of the speaker can be detected from the camera image,
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered voice and a lip recognition result for analyzing the utterance from the lip movement of the speaker;
The information processing apparatus according to claim 1, wherein:
The speaker specifying unit
If it ’s impossible to detect the lip movement of the speaker from the camera image,
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
The information processing apparatus according to claim 1, wherein:
The speaker specifying unit
Nodding timing estimation data of the speaker or listener estimated based on the utterance character string obtained from the speech recognition result for the uttered speech,
A comparison process with the actual data of the nodding person's nodding timing included in the camera image is executed.
The information processing apparatus according to claim 1, wherein a nod performer included in the camera-captured image determines whether the person is a speaker or a listener.
The speaker specifying unit
If the degree of coincidence between the utterer nod timing estimation data estimated based on the utterance character string obtained from the speech recognition result and the nod timing actual data is high, it is determined that the nod performer is a utterer,
5. When the degree of coincidence between the listener nodding timing estimation data estimated based on an utterance character string obtained from a speech recognition result and the nodding timing actual data is high, it is determined that the nod performer is a listener. The information processing apparatus described in 1.
The speaker specifying unit
If it is determined that the nod performer is a listener,
The information processing apparatus according to claim 5, wherein a person who is in a line of sight of the nod performer is determined as a speaker.
The information processing apparatus includes:
A speech recognition unit that performs speech recognition processing for analyzing the speech from the lip movement of the speaker and performing speech recognition processing for the speech speech;
The speaker specifying unit
The information processing apparatus according to claim 1, wherein a speaker identification process is executed by inputting a voice recognition result and a lip reading recognition result generated by the voice recognition processing unit.
The voice recognition unit further includes:
Based on the utterance character string obtained from the speech recognition result for the uttered speech, generate nodding timing estimation data of the speaker and listener,
The speaker specifying unit
Nodding timing estimation data of a speaker and a listener generated by the voice recognition processing unit,
A comparison process with the actual data of the nodding person's nodding timing included in the camera image is executed.
The information processing apparatus according to claim 7, wherein the nod performer included in the camera-captured image determines whether the person is a speaker or a listener.
The information processing apparatus includes:
From an image captured by the camera, an image recognition unit that executes an analysis process of the operation of the speaker or the operation of the listener who is listening to the speech of the speaker,
The speaker specifying unit
The information processing apparatus according to claim 1, wherein the speaker identification process is executed by inputting the analysis information generated by the image recognition processing unit.
The image recognition unit
Generate a timetable that records the nod timing of the nod performer included in the camera image,
The speaker specifying unit
The information processing apparatus according to claim 9, wherein a speaker specifying process is executed using a nod timing timing table generated by the image recognition processing unit.
The information processing apparatus includes:
A speech recognition unit that performs speech recognition processing on the uttered speech;
An image recognition unit that performs an analysis process of a captured image of at least one of the speaker or the listener;
The voice recognition unit
Based on the character string obtained by executing the speech recognition process for the uttered speech, estimating the speaker's nod timing and the listener's nod timing, and recording the estimated nod timing data,
Talker nod timing estimation timetable,
Listener nod timing estimation timetable,
Produces
The image recognition unit
Recorded the nodding timing of the nod performer included in the camera image,
Generate a nod timing timing table,
The speaker specifying unit
When the degree of coincidence between the speaker nodding timing estimation time table and the nodding timing time table is high, it is determined that the nodding performer is a speaker,
The information processing apparatus according to claim 1, wherein when the degree of coincidence between the listener nod timing estimation time table and the nod timing timing table is high, the information processing apparatus determines that the nod performer is a listener.
An information processing system having an information processing terminal and a server,
The information processing terminal
A voice input unit;
An image input unit;
A communication unit that transmits the audio acquired through the audio input unit and the camera-captured image acquired through the image input unit to the server;
The server
A speech recognition unit that performs speech recognition processing of the speech of the speaker received from the information processing terminal;
An image recognition unit that performs analysis of the camera-captured image received from the information processing terminal;
The voice recognition unit
Generating a character string indicating the utterance content of the speaker, and generating nodding timing estimation data of the speaker and the listener based on the character string;
The image recognition unit
Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
In at least one of the information processing terminal and the server,
An information processing system that executes speaker identification processing based on a degree of coincidence between the nod timing estimation data and the nod timing actual data.
An information processing method executed in an information processing apparatus,
The information processing apparatus includes:
It has a speaker identification unit that executes speaker identification processing,
The speaker specifying unit
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
An information processing method for executing the speaker specifying process of at least one of (a) and (b) above.
An information processing method executed in an information processing system having an information processing terminal and a server,
The information processing terminal
Transmitting the voice acquired through the voice input unit and the camera-captured image acquired through the image input unit to the server;
The server
From the voice received from the information processing terminal, voice recognition processing of the voice of the speaker is executed, a character string indicating the utterance content of the speaker is generated, and nodding timing estimation of the speaker and the listener based on the character string Generate data,
Performing analysis of the camera-captured image received from the information processing terminal;
Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
In at least one of the information processing terminal and the server,
An information processing method for executing speaker specifying processing based on a degree of coincidence between the nod timing estimation data and the nod timing actual data.
A program for executing information processing in an information processing apparatus;
The information processing apparatus includes:
It has a speaker identification unit that executes speaker identification processing,
The program is stored in the speaker specifying unit.
(A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
(B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
A program for executing at least one of the speaker specifying processes (a) and (b).