WO2018061173A1

WO2018061173A1 - Tv conference system, tv conference method, and program

Info

Publication number: WO2018061173A1
Application number: PCT/JP2016/078992
Authority: WO
Inventors: 俊二菅谷
Original assignee: 株式会社オプティム
Priority date: 2016-09-30
Filing date: 2016-09-30
Publication date: 2018-04-05

Abstract

[Problem] To display the look of a participant and who spoke what in an easily understandable manner in a TV conference system in which participants perform a TV conference, even when there is a plurality of participants at one place. [Solution] A TV conference system in which participants perform a TV conference, wherein the TV conference system is provided with an image analysis module 111 for analyzing an image of the TV conference in which participants are seen, and a face detection module 112 for detecting a portion of a participant that includes the face as a face portion, the detected face portions of a plurality of participants being displayed in list form by a face list display module 151.

Description

TV conference system, TV conference method, and program

In the TV conference system in which a participant performs a TV conference, even when there are a plurality of participants at one site, it is possible to appropriately display the faces of all the members at the connection destination and to read the facial expressions of the participants. The present invention relates to a TV conference system, a TV conference method, and a program that make it easy to understand who made what.

In a TV conference system in which a plurality of participants conduct a TV conference, the image quality of the image displayed during the remote conference is set so that at least a portion that is likely to be noticed by the participant has an appropriate image quality. A video conference apparatus that can be adjusted according to the above is disclosed (Patent Document 1).

JP 2012-138823 A

However, in the apparatus of Patent Document 1, only appropriate brightness / white balance adjustment is performed on either the speaker's face area or the display area of the information display object, and the number of participants is large. Even if the above-mentioned adjustment is performed, the display of the face area is too small, and it may be difficult to read each expression. In particular, this situation is likely to occur when participants sit in a large conference room with a gap between them. Furthermore, if the background of the place where the meeting is held is complicated, it becomes more difficult to read the facial expressions of each participant at once. In addition, when there are many participants, there is also a problem that it is difficult to understand who made what statement.

In the present invention, in view of these problems, in a video conference system in which a participant performs a video conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination, and the facial expressions of the participants are displayed. It is an object of the present invention to provide a TV conference system, a TV conference method, and a program that can easily read who said what.

The present invention provides the following solutions.

The invention according to the first feature is a TV conference system in which a participant performs a TV conference,
Image analysis means for image analysis of an image of the TV conference in which the participant is shown;
From the result of the image analysis, face detection means for detecting a part including the face of the participant as a face part;
Face list display means for displaying a list of images of the detected face parts of a plurality of participants;
A video conference system is provided.

According to the first aspect of the invention, in a TV conference system in which a participant performs a TV conference, an image analysis unit that performs image analysis of an image of the TV conference in which the participant is reflected; Face detection means for detecting a part including the face of the participant as a face part, and face list display means for displaying a list of images of the detected face parts of the plurality of participants.

The invention according to the first feature is a category of the TV conference system, but the TV conference method and the TV conference program exhibit the same operations and effects.

The invention according to the second feature is a TV conference system which is the invention according to the first feature,
The face list display means arranges the detected face part in the center of each display area, and arranges the display area in which the face part is arranged in the center to make a list display. Provide a system.

According to the invention relating to the second feature, in the video conference system according to the invention relating to the first feature, the face list display means arranges the detected face part in the center of each display area, and A list display is made by arranging display areas arranged in the center.

The invention according to the third feature is a TV conference system which is the invention according to the first feature or the second feature,
The face list display means provides a TV conference system in which, when displaying a list of images of the face portion, a background portion other than the detected face is replaced and displayed.

According to the invention relating to the third feature, in the video conference system according to the invention relating to the first feature or the second feature, the face list display means detects when displaying a list of images of the face part. The background part other than the selected face is replaced and displayed.

An invention according to a fourth feature is a TV conference system according to any one of the first feature to the third feature,
The face list display means provides a TV conference system, wherein after the start of list display, when the detected face portion satisfies a predetermined condition, the face list display means replaces and displays the image with another image.

According to the fourth feature of the present invention, in the TV conference system according to any one of the first feature to the third feature, the face list display means detects the detected face after starting the list display. When the face portion satisfies a predetermined condition, it is replaced with another image and displayed.

The invention according to the fifth feature is a TV conference system according to any one of the first feature to the fourth feature,
The face detection means further detects whether or not each participant of the detected face portion speaks,
The face list display means further provides a TV conference system that changes the attention level of a participant who detects that he / she is speaking when displaying a list of images of the face portion.

According to the fifth feature of the present invention, in the video conference system according to any one of the first feature to the fourth feature, the face detection means further includes a participant of the detected face part, The face list display means further changes the degree of attention of the participant who detects the speech when the face images are displayed as a list.

An invention according to a sixth feature is a TV conference system according to any one of the first feature to the fifth feature,
A speech detection means for detecting the speech of the participant;
Speaker determination means for determining a speaker of the detected speech;
A speech history display means for displaying a speech history of a participant selected from the list display;
A video conference system is provided.

According to the sixth aspect of the invention, in the TV conference system according to any one of the first to fifth aspects, the speech detection means for detecting the speech of the participant, and the detected A speaker determination unit that determines a speaker of a speech, and a speech history display unit that displays a speech history of a participant selected from the list display.

The invention according to a seventh aspect is a TV conference system in which a participant performs a TV conference,
Image analysis means for image analysis of an image of the TV conference in which the participant is shown;
From the result of the image analysis, a part including the face of the participant is detected as a face part, and further, a face detection unit for detecting whether or not each participant of the detected face part is speaking,
A list of face images of a plurality of detected participants is displayed, and a face list display for changing the attention level of the participant who has detected the speech when displaying the images of the face portions. Means,
A speech detection means for detecting the speech of the participant;
Speaker determination means for determining a speaker of the detected speech;
A statement history display means for displaying the participant's statement history;
A video conference system is provided.

According to the seventh aspect of the invention, in a TV conference system in which a participant performs a TV conference, the image analysis means for performing image analysis on an image of the TV conference in which the participant is reflected, and the result of the image analysis, Face detection means for detecting a part including a participant's face as a face part, and detecting whether each participant of the detected face part is speaking, and detected face parts of a plurality of participants A list of images, and a face list display means for changing the attention level of the participant who has detected the speech when the face image is displayed as a list, and detecting the speech of the participant A speech detection means for determining the speaker of the detected speech, and a speech history display means for displaying the speech history of the participant.

The invention according to an eighth feature is a video conference method in which a participant conducts a video conference,
Image analysis of an image of a TV conference in which the participant is shown;
Detecting a part including the face of the participant as a face part from the result of the image analysis;
Displaying a list of images of face portions of a plurality of detected participants;
A video conference method is provided.

The invention according to the ninth feature provides a computer system in which a participant conducts a TV conference.
Image analysis of an image of a video conference in which the participant is shown;
Detecting a part including the face of the participant as a face part from the result of the image analysis;
Displaying a list of images of face portions of a plurality of detected participants;
Provide a program to execute.

According to the present invention, in a TV conference system in which a participant performs a TV conference, even when there are a plurality of participants at one site, the faces of all the members can be appropriately displayed at the connection destination and the facial expressions of the participants can be read. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that are easy to understand who made what kind of remarks.

FIG. 1 is a schematic diagram of a preferred embodiment of the present invention. FIG. 2 is a diagram illustrating the relationship between the functional blocks of the TV conference apparatus 100 and the functions. FIG. 3 is a flowchart of face list display processing in the TV conference apparatus 100. FIG. 4 is a diagram showing a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a and face list display processing is performed by the receiving-side TV conference device 100b. FIG. 5 is a flowchart when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a, and face list display processing is performed by the receiving-side TV conference device 100b. FIG. 6 is a diagram illustrating a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the computer 200 and face list display processing is performed by the TV conference device 100b on the receiving side. FIG. 7 is a flowchart when the image analysis process and the face detection process are performed by the computer 200 and the face list display process is performed by the TV conference device 100b on the receiving side. FIG. 8 is a diagram illustrating the relationship between the function blocks and the functions when the face list display process and the speech history display process are performed in the TV conference apparatus 100. FIG. 9 is a flowchart of face list display processing and speech history display processing in the TV conference apparatus 100. FIG. 10 is a diagram illustrating an example of a display of a general TV conference. FIG. 11 is a diagram illustrating an example of the display of the face list display process. FIG. 12 is a diagram illustrating an example of a display in which the background is replaced in the face list display process. FIG. 13 is a diagram illustrating an example of a display in which a face portion satisfying a predetermined condition is replaced in the face list display process. FIG. 14 is a diagram illustrating an example of a display for changing the attention level of a participant who has detected that he / she is speaking in the face list display process. FIG. 15 is a diagram illustrating an example of the display of the face list display process and the speech history display process.

Hereinafter, the best mode for carrying out the present invention will be described with reference to the drawings. This is merely an example, and the technical scope of the present invention is not limited to this.

[Outline of video conference system]
FIG. 1 is a schematic diagram of a preferred embodiment of the present invention. The outline of the present invention will be described with reference to FIG.

As shown in FIG. 2, the TV conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150. The control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130.

The TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device. For example, the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display. Wearable terminals such as, and other items. The smartphone illustrated as the TV conference apparatus 100a in FIG. 2 and the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.

1, first, the image analysis module 111 of the TV conference device 100 receives a captured image from the connected TV conference device 100 via the communication unit 120 (step S01). The captured image here is a precise image having an amount of information necessary for image analysis, and the number of pixels and the image quality can be designated. Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130.

Next, the image analysis module 111 of the TV conference apparatus 100 performs image analysis of the received captured image (step S02). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like. In the example of FIG. 1, the image analysis module 111 specifies that there are four participants and their positions by image analysis.

Next, the face detection module 112 of the TV conference device 100 detects a part including each participant's face as a face part based on the image analysis result of step S02 (step S03). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face. In the example of FIG. 1, the face detection module 112 detects four faces of a participant 1001, a participant 1002, a participant 1003, and a participant 1004.

In order to perform these image analysis and face detection, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). In addition, since a large amount of data is required for learning, the learned image analysis module 111 and face detection module 112 may be acquired from the outside via the communication unit 120. However, with respect to image analysis and face detection methods, this patent is not limited, and existing techniques can be used.

Finally, the face list display module 151 of the TV conference apparatus 100 displays a list of detected images of the face portions of the plurality of participants on the output unit 150 (step S04). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. In the example of FIG. 1, the participant 1001 is placed in the area 101, the participant 1002 is placed in the area 102, the participant 1003 is placed in the area 103, and the participant 1004 is placed in the area 104, and the face list is displayed. Further, as shown in FIG. 1, a captured image representing the entire state may be displayed in the empty area 105 where the face portion is not displayed.

FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.

FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face of each participant is displayed. It can be seen that the display is large and facial expressions are easy to read. Here, in the area 1105, “TV conference system 2016/9/9 15:07:19 << Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed. In this way, information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.

As described above, according to the present invention, in a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.

[Description of each function]
FIG. 2 is a diagram illustrating the relationship between the functional blocks of the TV conference apparatus 100 and the functions. The video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150. The control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. The communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the TV conference apparatus 100b.

The TV conference apparatus 100 includes, as the camera unit 10, an imaging device such as a lens, an imaging device, various buttons, and a flash, and captures images as captured images such as moving images and still images. An image obtained by imaging is a precise image having an amount of information necessary for image analysis, and the number of pixels and the image quality can be designated. Further, it is assumed that the camera unit 10 includes a microphone for acquiring audio data combined with moving image capturing or can use the microphone function of the input unit 140.

The control unit 110 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like. The control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130.

As the communication unit 120, a device for enabling communication with other devices, for example, a WiFi (Wireless Fidelity) compatible device compliant with IEEE 802.11 or an IMT-2000 standard such as a third generation or fourth generation mobile communication system. Compliant wireless device etc. It may be a wired LAN connection.

Further, the storage unit 130 includes a data storage unit such as a hard disk or a semiconductor memory, and stores data necessary for processing of captured images, image analysis results, face detection results, and the like.

Suppose that the input unit 140 has functions necessary for using the TV conference system. As an example for realizing the input, a liquid crystal display that realizes a touch panel function, a keyboard, a mouse, a pen tablet, a hardware button on the apparatus, a microphone for performing voice recognition, and the like can be provided. The function of the present invention is not particularly limited by the input method.

Suppose that the output unit 150 has functions necessary for using the TV conference system. The output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. As an example for realizing the output, forms such as a liquid crystal display, a PC display, a projection on a projector, and an audio output can be considered. The function of the present invention is not particularly limited by the output method.

[Face list display processing]
FIG. 3 is a flowchart of face list display processing in the TV conference apparatus 100. Processing executed by each module described above will be described in accordance with this processing. Here, an example of a flowchart in the case where the image analysis process, the face detection process, and the face list display process are performed in the TV conference device 100a on the captured image reception side is shown as an example.

First, the control unit 110 of the TV conference device 100a notifies the connection destination TV conference device 100b of the start of the TV conference via the communication unit 120 (step S301).

Upon receiving this notification, the control unit 110 of the TV conference apparatus 100b starts imaging with the camera unit 10 (step S302). Here, it is assumed that the captured image is a precise image having an amount of information necessary for image analysis in the TV conference apparatus 100a, and the number of pixels and the image quality can be designated. In addition, audio data is acquired together with moving image capturing.

Next, the control unit 110 of the TV conference device 100b transmits the captured image to the TV conference device 100a via the communication unit 120 (step S303). When the captured image is a moving image, audio data is also transmitted.

The image analysis module 111 of the TV conference apparatus 100a receives a captured image from the TV conference apparatus 100b via the communication unit 120 (step S304). Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130.

The image analysis module 111 of the TV conference apparatus 100a performs image analysis of the received captured image (step S305). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.

Next, the face detection module 112 of the TV conference device 100a detects a part including each participant's face as a face part based on the image analysis result of step S305 (step S306). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.

Based on the results of the image analysis and face detection, the face list display module 151 of the TV conference device 100a displays a list of detected face part images of the plurality of participants on the output unit 150 (step S307). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. Here, when the face part is placed, if the face size varies from participant to participant simply by cutting out the whole image as it is, even if the size is automatically adjusted so that it will be the same size Good.

FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions. Here, in the area 1105, “TV conference system 2016/9/9 15:07:19 << Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed. In this way, information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.

After displaying the face list, the control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S308). It is assumed that the user can designate the end of the video conference via the input unit 140. If the video conference is to be terminated, the process proceeds to the next step S309. If the video conference is not to be terminated, the process returns to step S304 to continue the processing.

When ending the TV conference, the control unit 110 of the TV conference device 100a notifies the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S309).

The TV conference device 100b confirms whether or not to end the TV conference (step S310). If the TV conference apparatus 100b does not end, the process returns to step S302 to continue the process, and if it ends, the TV conference ends.

Here, for simplification of the flowchart, only the processing in the case of displaying a captured image on the TV conference device 100b on the TV conference device 100a is described, but in a normal TV conference, the captured image on the TV conference device 100a is described. Is displayed on the TV conference apparatus 100b in parallel. In addition, the notification of the TV conference start and the TV start end is described only in the flow performed from the TV conference apparatus 100a to the TV conference apparatus 100b, but there is no problem even if it is performed from the TV conference apparatus 100b to the TV conference apparatus 100a.

[Background replacement function]
FIG. 12 is a diagram illustrating an example of a display in which the background is replaced in the face list display process. Compared to FIG. 11, it can be seen that the backgrounds of the participants in the

areas

1201, 1202, 1203, and 1204 of the output unit 150 are replaced with uniform backgrounds. The background replacement process may be performed when the face list display module 151 performs the face list display process of step S307 in the flowchart of FIG. In this way, by replacing the background portion and displaying it, extra information is eliminated and the facial expressions of each participant are easier to read. Here, the same background is used for all of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, but the background may be changed depending on the participant. In addition, when a TV conference is held at three or more locations, the background of the participants may be changed for each location, and which participants are in the same space may be displayed in an easy-to-understand manner. Here, a captured image representing the overall state is displayed in the area 1205. However, as shown in FIG. 11, the date and time, the connection destination, the call start time, the number of participants of the other party, and the like may be displayed. Further, information such as “background: being replaced” may be displayed somewhere on the output unit 150 as shown in the display 1206. Whether or not to use the background replacement function may be set by the user at a desired timing, or the setting may be saved.

[Face replacement function]
FIG. 13 is a diagram illustrating an example of a display in which a face portion satisfying a predetermined condition is replaced in the face list display process. The face part replacement process may be performed when the face list display module 151 performs the face list display process of step S307 in the flowchart of FIG. The face list display module 151 determines a predetermined condition, and replaces the face portion when the condition is satisfied. The predetermined condition here may be, for example, a case where there are no participants in the captured image, that is, a case where the participant goes out of the shooting range of the camera. Comparing FIG. 13 with FIG. 12, it can be seen that an image 1307 is displayed in the area 1302 instead of the participant 1002 because the participant 1002 disappears. Here, since an example in which a captured image representing the entire state is displayed in the area 1305 is illustrated, it can be confirmed that the participant 1002 is absent. Since the other three participants are present, the participant 1001 is displayed in the area 1301, the participant 1003 is displayed in the area 1303, and the participant 1004 is displayed in the area 1304 as in FIG. Here, the image 1307 may be a still image of the participant 1002, a favorite illustration, an avatar of the participant 1002, or the like, and can be set by the participant according to his / her preference. The area 1305 may display date and time, connection destination, call start time, number of participants at the other end, and the like. Further, information such as “Away” may be displayed somewhere in the area 1302 as in the display 1306.

[Attention change function]
FIG. 14 is a diagram illustrating an example of a display for changing the attention level of a participant who has detected that he / she is speaking in the face list display process. In the attention level changing process, when the face detection module 112 performs the face detection process of step S306 in the flowchart of FIG. 3, it is detected whether or not each participant is speaking, and the face list display module 151 When performing the face list display process in step S307 in the flowchart of FIG. 3, the attention level of the participant who has detected that he / she is speaking is changed. Here, the attention level change is performed in order to make the speaker stand out and to attract attention. Specifically, as can be seen by comparing FIG. 13 and FIG. 14, it is conceivable to change the background color as shown in a region 1404. Since the other three participants are not speaking, the participant 1001 is displayed in the area 1401, the participant 1002 is displayed in the area 1402, and the participant 1003 is displayed in the area 1403 as in FIG. As another example of the attention level change, information “speaking” may be displayed somewhere in the area 1404 like a display 1406. Alternatively, the area 1404 where the speaking participant 1004 is displayed may be placed in a conspicuous part such as the center of the output unit 150 (not shown). Further, when the entire display is performed in the area 1405, the position of the speaker may be indicated by a mark or the like as shown in the display 1407 in conjunction with the attention level changing process described above. Here, an example in which the participant who speaks is only the participant 1004 is illustrated, but when a plurality of participants speak, a process of changing the degree of attention is performed on a plurality of people. Also good.

[Image analysis processing and face detection processing on the transmission side, face list display processing on the reception side]
FIG. 4 is a diagram showing a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a and face list display processing is performed by the receiving-side TV conference device 100b. The video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150. The control unit 110 implements the image analysis module 113 in cooperation with the camera unit 10 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. The communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the TV conference apparatus 100b.

FIG. 5 is a flowchart when the image analysis process and the face detection process are performed by the transmitting-side TV conference apparatus 100a, and the face list display process is performed by the receiving-side TV conference apparatus 100b. Processing executed by each module described above will be described in accordance with this processing.

First, the control unit 110 of the TV conference device 100a notifies the connection destination TV conference device 100b of the start of the TV conference via the communication unit 120 (step S501).

Next, the control unit 110 of the TV conference device 100a starts imaging with the camera unit 10 (step S502). Here, it is assumed that the captured image is a precise image having an amount of information necessary for image analysis in the TV conference apparatus 100a, and the number of pixels and the image quality can be designated. In addition, audio data is acquired together with moving image capturing. The captured image and audio data are stored in the storage unit 130.

The image analysis module 113 of the TV conference apparatus 100a performs image analysis of the captured image (step S503). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.

Next, the face detection module 112 of the TV conference device 100a detects a part including each participant's face as a face part based on the image analysis result of step S503 (step S504). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.

In order to perform these image analysis and face detection, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). Further, since a large amount of data is required for learning, the learned image analysis module 113 and the face detection module 112 may be acquired from the outside via the communication unit 120. However, with respect to image analysis and face detection methods, this patent is not limited, and existing techniques can be used.

Next, the control unit 110 of the TV conference device 100a transmits the analysis image to the TV conference device 100b via the communication unit 120 (step S505). When the captured image is a moving image, audio data is also transmitted. The analysis image transmitted here may be only the data determined to be necessary for the face list display as a result of the image analysis and face detection, or the captured image itself that was captured by the TV conference device 100a and used for the image analysis, or You may send the captured image which changed the resolution.

The TV conference device 100b receives an analysis image from the TV conference device 100a via the communication unit 120 (step S506). Audio data is also received along with the analysis image. The received analysis image and audio data are saved in the storage unit 130.

Based on the received analysis image and audio data, the face list display module 151 of the TV conference device 100b displays a list of detected face part images of the plurality of participants on the output unit 150 (step S507). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. When a face part is placed, if the face size varies depending on the participant, the size may be automatically adjusted so that the face size is the same if the face image is simply cut out from the entire image. .

The control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S508). It is assumed that the user can designate the end of the video conference via the input unit 140. If the video conference is to be ended, the process proceeds to the next step S509. If the video conference is not to be ended, the process returns to step S502 and the process is continued.

When ending the TV conference, the control unit 110 of the TV conference device 100a notifies the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S509).

Here, for simplification of the flowchart, only the processing in the case of displaying a captured image on the TV conference apparatus 100a on the TV conference apparatus 100b is described, but in a normal TV conference, the captured image on the TV conference apparatus 100b is described. Is displayed on the TV conference apparatus 100a in parallel. In addition, the notification of the TV conference start and the TV start end is described only in the flow performed from the TV conference apparatus 100a to the TV conference apparatus 100b, but there is no problem even if it is performed from the TV conference apparatus 100b to the TV conference apparatus 100a.

2 and 3 is configured to perform image analysis processing, face detection processing, and face list display processing on the reception side, image analysis processing and face detection processing on the transmission side in FIGS. 4 and 5, and face list display processing on the reception side. When comparing with the configuration to be performed, in FIG. 4 and FIG. 5, it is only necessary to transmit the analysis image, so that the effect of suppressing the communication data amount can be expected particularly when the background has been replaced.

[Image analysis processing and face detection processing by computer 200, and face list display processing at reception side]
FIG. 6 is a diagram illustrating a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the computer 200 and face list display processing is performed by the TV conference device 100b on the receiving side. The video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150. The output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. The computer 200 includes a control unit 210, a communication unit 220, and a storage unit 230. The control unit 210 implements the image analysis module 211 in cooperation with the communication unit 220 and the storage unit 230. In addition, the control unit 210 implements the face detection module 212 in cooperation with the storage unit 230. The communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the computer 200 and between the TV conference apparatus 100b and the computer 200.

The TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device. For example, the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display. Wearable terminals such as, and other items. The smartphone illustrated as the TV conference apparatus 100a in FIG. 6 and the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.

The control unit 110 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like.

The computer 200 may be a general computer having the functions described below. Although not described here, an input unit and an output unit may be provided as necessary.

The computer 200 includes a CPU, RAM, ROM and the like as the control unit 210. The control unit 210 implements the image analysis module 211 in cooperation with the communication unit 220 and the storage unit 230. In addition, the control unit 210 implements the face detection module 212 in cooperation with the storage unit 230.

A device for enabling communication with other devices as the communication unit 220, for example, a WiFi compatible device compliant with IEEE802.11 or a wireless device compliant with the IMT-2000 standard such as a third generation or fourth generation mobile communication system Etc. It may be a wired LAN connection.

The storage unit 230 includes a data storage unit using a hard disk or a semiconductor memory. The storage unit 230 holds data such as acquired captured images, image analysis results, and face detection results.

FIG. 7 is a flowchart when the image analysis process and the face detection process are performed by the computer 200 and the face list display process is performed by the TV conference device 100b on the receiving side. Processing executed by each module described above will be described in accordance with this processing.

First, the control unit 110 of the TV conference apparatus 100a notifies the start of the TV conference to the computer 200 and the connected TV conference apparatus 100b via the communication unit 120 (step S701). When the TV conference device 100a and the TV conference device 100b do not communicate directly, the computer 200 that has received the notification of the start of the TV conference from the TV conference device 100a notifies the TV conference device 100b of the start of the TV conference. Do.

Next, the control unit 110 of the TV conference device 100a starts imaging with the camera unit 10 (step S702). Here, it is assumed that the captured image is a precise image having an amount of information necessary for image analysis by the computer 200, and the number of pixels and the image quality can be designated. In addition, audio data is acquired together with moving image capturing. The captured image and audio data are stored in the storage unit 130.

Next, the control unit 110 of the TV conference device 100a transmits the captured image to the computer 200 via the communication unit 120 (step S703). When the captured image is a moving image, audio data is also transmitted.

The image analysis module 211 of the computer 200 receives a captured image from the TV conference device 100a via the communication unit 220 (step S704). Audio data is also received along with the captured image. The received captured image and audio data are stored in the storage unit 230.

The image analysis module 211 of the computer 200 performs image analysis of the received captured image (step S705). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.

Next, the face detection module 212 of the computer 200 detects a part including the face of each participant as a face part based on the image analysis result of step S705 (step S706). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.

In order to perform these image analysis and face detection, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). The learned image analysis module 211 and face detection module 212 may be acquired from the outside. However, with respect to image analysis and face detection methods, this patent is not limited, and existing techniques can be used.

Next, the control unit 210 of the computer 200 transmits the analysis image to the TV conference device 100b via the communication unit 220 (step S707). When the captured image is a moving image, audio data is also transmitted. The analysis image to be transmitted here may be only data determined to be necessary for the face list display as a result of image analysis and face detection, or is taken by the TV conference apparatus 100a and used for image analysis by the computer 200. You may send the captured image itself or the captured image which changed resolution.

The TV conference apparatus 100b receives an analysis image from the computer 200 via the communication unit 120 (step S708). Audio data is also received along with the analysis image. The received analysis image and audio data are saved in the storage unit 130.

Based on the received analysis image and audio data, the face list display module 151 of the TV conference apparatus 100b displays a list of detected face part images on the output unit 150 (step S709). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. When a face part is placed, if the face size varies depending on the participant, the size may be automatically adjusted so that the face size is the same if the face image is simply cut out from the entire image. .

The control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S710). It is assumed that the user can designate the end of the video conference via the input unit 140. If the TV conference is to be ended, the process proceeds to the next step S711. If the TV conference is not to be ended, the process returns to step S702 to continue the processing.

When ending the TV conference, the control unit 110 of the TV conference device 100a notifies the computer 200 and the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S711). When the TV conference apparatus 100a and the TV conference apparatus 100b do not directly communicate with each other, the computer 200 that receives the TV conference end notification from the TV conference apparatus 100a notifies the TV conference apparatus 100b of the TV conference end notification. Do.

Here, for simplification of the flowchart, only the processing in the case of displaying a captured image on the TV conference apparatus 100a on the TV conference apparatus 100b is described, but in a normal TV conference, the captured image on the TV conference apparatus 100b is described. Is displayed on the TV conference apparatus 100a in parallel. In addition, the notification of the TV conference start and the TV start end is described only in the flow performed from the TV conference apparatus 100a to the TV conference apparatus 100b, but there is no problem even if it is performed from the TV conference apparatus 100b to the TV conference apparatus 100a. Also in these cases, when the TV conference apparatus 100a and the TV conference apparatus 100b are configured not to perform direct communication, the notification is performed via the computer 200, respectively.

When comparing the configuration in which the video conference apparatus 100 performs image analysis processing, face detection processing, and face list display processing with the configuration in which image analysis processing, face detection processing, and face list display processing are performed by the computer 200, the latter An advantage is that the analysis module 211 and the face detection module 212 can be easily updated. Further, since a large amount of data is required for machine learning and deep learning, the computer 200 has an advantage that a large-capacity storage is easily provided in the storage unit 230 in this respect.

[Speaking history display function]
FIG. 8 is a diagram illustrating the relationship between the function blocks and the functions when the face list display process and the speech history display process are performed in the TV conference apparatus 100. In addition to the configuration of FIG. 2, the control unit 110 implements a speech detection module 114 and a speaker determination module 115 in cooperation with the communication unit 120 and the storage unit 130. In addition, the output unit 150 implements the message history display module 152 in cooperation with the control unit 110 and the storage unit 130.

The TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device. For example, the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display. Wearable terminals such as, and other items. The smartphone illustrated as the TV conference apparatus 100a in FIG. 8, the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.

FIG. 9 is a flowchart of the face list display process and the speech history display process in the TV conference apparatus 100. Processing executed by each module described above will be described in accordance with this processing. Here, an example in which a series of processing from image analysis processing to speech history display processing is performed by the video conference apparatus 100 on the captured image receiving side is shown. However, as in the case of the face list display process, the image analysis process, the face detection process, the speech detection process, and the speaker determination process are performed by the TV conference apparatus 100 or the computer 200 on the captured image transmission side, as described above. The TV conference device 100 on the receiving side may be configured to perform only face list display processing and speech history display processing.

First, the image analysis module 111 of the TV conference device 100 on the captured image receiving side receives a captured image via the communication unit 120 (step S901). Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130. Here, for simplification of the flowchart, the TV conference start notification is not described, but it is assumed that the TV conference start notification is performed before step S901.

Next, the image analysis module 111 performs image analysis of the received captured image (step S902). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.

Next, the face detection module 112 detects a part including each participant's face as a face part based on the image analysis result of step S902 (step S903). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.

Next, the speech detection module 114 detects the speech of each participant based on the received audio data (step S904). In this speech detection, the content of received voice data is analyzed and converted into text by voice recognition. If multiple people are speaking at the same time and it is difficult to isolate the speech, use the image analysis result in step S902, the face detection result in step S903, etc. Improvements may be made. Regarding the speech recognition method, this patent is not limited, and it is assumed that the existing technology can be used.

Next, the speaker determination module 115 determines a speaker based on the image analysis result in step S902, the face detection result in step S903, the speech detection result in step S904, and the like (step S905). The speaker determination here specifies which participant is speaking using the captured image, the mouth movement based on the analysis image, the voice height, the input direction, and the like, and the content of the speech detected in step S904. It is to be tied. The results of these processes are stored in the storage unit 130 as data as to which participant made what and when.

In order to perform such speech detection and speaker determination, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). In addition, since a large amount of data is required for learning, the learned speech detection module 114 and the speaker determination module 115 may be acquired from the outside via the communication unit 120. However, regarding the method of speech detection and speaker determination, this patent is not limited and existing technology can be used.

Based on the results of image analysis and face detection, the face list display module 151 displays a list of detected face part images of a plurality of participants on the output unit 150 (step S906). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. When a face part is placed, if the face size varies depending on the participant, the size may be automatically adjusted so that the face size is the same if the face image is simply cut out from the entire image. .

After the face list is displayed, the speech history display module 152 confirms whether or not to display the speech history (step S907). It is assumed that the user can specify the message history display via the input unit 140. If the message history is displayed, the process proceeds to the next step S908. If the message history is not displayed, the process is terminated.

When displaying the speech history, the speech history display module 152 causes the user to select a participant who displays the speech history via the input unit 140 (step S908). Here, the number of participants to be selected is not limited, and one, a plurality of people, or all of the participants may be selected. In addition, it is possible to select a setting that always displays the speech history of all members without selecting the speaker.

Finally, the speech history display module 152 displays the speech history of the selected participant on the output unit 150 (step S909). Here, for simplification of the flowchart, the TV conference end notification is not described. However, when the TV conference ends, it is assumed that the TV conference end notification is sent to the destination TV conference apparatus 100.

FIG. 15 is a diagram showing an example of the display of the face list display process and the speech history display process. Since the participant 1001 is displayed in the area 1501 of the output unit 150, the participant 1002 is displayed in the area 1502, the participant 1003 is displayed in the area 1503, and the participant 1004 is displayed in the area 1504, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions. Also, here, participant 1001 is displayed as “Participant A” 1506, participant 1002 is displayed as “Participant B” 1507, participant 1003 is displayed as “Participant C” 1508, and participant 1004 Is displayed on the output unit 150 as a “participant D”. When “participant C” is selected in the participant selection in step S908, the region 1503 or display 1508 of the participant C may be selected with the pointer 1510. The fact that “participant C” is selected may be displayed on the output unit 150 in an easily understandable manner. Here, an example in which the content of the speech of “participant C” is displayed in the area 1505 by the speech history display in step S909 is illustrated. If there are too many utterance histories that can be displayed, a scroll bar 1511 or the like may be provided in the utterance history display area 1505 so that the past utterances can be displayed. When there are a plurality of participants who display the speech history, the participant who made the speech is also displayed together with the content of the speech. In addition, when the speech history is displayed, it is easier to understand if the time of speech is also displayed.

As described above, according to the present invention, in a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions and easily understand who made what.

The means and functions described above are realized by a computer (including a CPU, an information processing apparatus, and various terminals) reading and executing a predetermined program. The program may be, for example, in a form (SaaS: Software as a Service) provided from a computer via a network, or a flexible disk, CD (CD-ROM, etc.), DVD (DVD-ROM, DVD). -RAM, etc.) and a computer-readable recording medium such as a compact memory. In this case, the computer reads the program from the recording medium, transfers it to the internal storage device or the external storage device, stores it, and executes it. The program may be recorded in advance in a storage device (recording medium) such as a magnetic disk, an optical disk, or a magneto-optical disk, and provided from the storage device to a computer via a communication line.

As mentioned above, although embodiment of this invention was described, this invention is not limited to these embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

100 TV conference device, 200 computer, 300 communication network

Claims

A TV conference system in which participants conduct a TV conference,
Image analysis means for image analysis of an image of the TV conference in which the participant is shown;
From the result of the image analysis, face detection means for detecting a part including the face of the participant as a face part;
Face list display means for displaying a list of images of the detected face parts of a plurality of participants;
A video conference system comprising:
The face list display means arranges the detected face part in the center of each display area, and arranges the display area in which the face part is arranged in the center to make a list display. 1. The video conference system according to 1.
3. The video conference system according to claim 1, wherein the face list display means replaces and displays a background part other than the detected face when displaying a list of images of the face part.
4. The face list display means, when starting the list display, when the detected face portion satisfies a predetermined condition, the face list display means displays the image by replacing it with another image. The video conference system according to claim 1.
The face detection means further detects whether or not each participant of the detected face portion speaks,
The face list display means further changes the attention level of the participant who detects that he / she is speaking when displaying a list of images of the face portion. The video conference system according to any one of the above.
A speech detection means for detecting the speech of the participant;
Speaker determination means for determining a speaker of the detected speech;
A speech history display means for displaying a speech history of a participant selected from the list display;
The video conference system according to any one of claims 1 to 5, further comprising:
A TV conference system in which participants conduct a TV conference,
Image analysis means for image analysis of an image of the TV conference in which the participant is shown;
From the result of the image analysis, a part including the face of the participant is detected as a face part, and further, a face detection unit for detecting whether or not each participant of the detected face part is speaking,
A list of face images of a plurality of detected participants is displayed, and a face list display for changing the attention level of the participant who has detected the speech when displaying the images of the face portions. Means,
A speech detection means for detecting the speech of the participant;
Speaker determination means for determining a speaker of the detected speech;
A statement history display means for displaying the participant's statement history;
A video conference system comprising:
A TV conference method in which participants conduct a TV conference,
Image analysis of an image of a TV conference in which the participant is shown;
Detecting a part including the face of the participant as a face part from the result of the image analysis;
Displaying a list of images of face portions of a plurality of detected participants;
A video conferencing method comprising:
A computer system in which participants conduct a TV conference,
Image analysis of an image of a video conference in which the participant is shown;
Detecting a part including the face of the participant as a face part from the result of the image analysis;
Displaying a list of images of face portions of a plurality of detected participants;
A program for running