WO2018061173A1 - Tv conference system, tv conference method, and program - Google Patents

Tv conference system, tv conference method, and program Download PDF

Info

Publication number
WO2018061173A1
WO2018061173A1 PCT/JP2016/078992 JP2016078992W WO2018061173A1 WO 2018061173 A1 WO2018061173 A1 WO 2018061173A1 JP 2016078992 W JP2016078992 W JP 2016078992W WO 2018061173 A1 WO2018061173 A1 WO 2018061173A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
participant
conference
image
image analysis
Prior art date
Application number
PCT/JP2016/078992
Other languages
French (fr)
Japanese (ja)
Inventor
俊二 菅谷
Original Assignee
株式会社オプティム
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社オプティム filed Critical 株式会社オプティム
Priority to PCT/JP2016/078992 priority Critical patent/WO2018061173A1/en
Publication of WO2018061173A1 publication Critical patent/WO2018061173A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present invention relates to a TV conference system, a TV conference method, and a program that make it easy to understand who made what.
  • Patent Document 1 A video conference apparatus that can be adjusted according to the above is disclosed (Patent Document 1).
  • the present invention provides the following solutions.
  • the invention according to the first feature is a TV conference system in which a participant performs a TV conference, Image analysis means for image analysis of an image of the TV conference in which the participant is shown; From the result of the image analysis, face detection means for detecting a part including the face of the participant as a face part; Face list display means for displaying a list of images of the detected face parts of a plurality of participants; A video conference system is provided.
  • an image analysis unit that performs image analysis of an image of the TV conference in which the participant is reflected; Face detection means for detecting a part including the face of the participant as a face part, and face list display means for displaying a list of images of the detected face parts of the plurality of participants.
  • the invention according to the first feature is a category of the TV conference system, but the TV conference method and the TV conference program exhibit the same operations and effects.
  • the invention according to the second feature is a TV conference system which is the invention according to the first feature,
  • the face list display means arranges the detected face part in the center of each display area, and arranges the display area in which the face part is arranged in the center to make a list display. Provide a system.
  • the face list display means arranges the detected face part in the center of each display area, and A list display is made by arranging display areas arranged in the center.
  • the invention according to the third feature is a TV conference system which is the invention according to the first feature or the second feature
  • the face list display means provides a TV conference system in which, when displaying a list of images of the face portion, a background portion other than the detected face is replaced and displayed.
  • the face list display means detects when displaying a list of images of the face part.
  • the background part other than the selected face is replaced and displayed.
  • An invention according to a fourth feature is a TV conference system according to any one of the first feature to the third feature,
  • the face list display means provides a TV conference system, wherein after the start of list display, when the detected face portion satisfies a predetermined condition, the face list display means replaces and displays the image with another image.
  • the face list display means detects the detected face after starting the list display. When the face portion satisfies a predetermined condition, it is replaced with another image and displayed.
  • the invention according to the fifth feature is a TV conference system according to any one of the first feature to the fourth feature,
  • the face detection means further detects whether or not each participant of the detected face portion speaks,
  • the face list display means further provides a TV conference system that changes the attention level of a participant who detects that he / she is speaking when displaying a list of images of the face portion.
  • the face detection means further includes a participant of the detected face part
  • the face list display means further changes the degree of attention of the participant who detects the speech when the face images are displayed as a list.
  • An invention according to a sixth feature is a TV conference system according to any one of the first feature to the fifth feature, A speech detection means for detecting the speech of the participant; Speaker determination means for determining a speaker of the detected speech; A speech history display means for displaying a speech history of a participant selected from the list display; A video conference system is provided.
  • the speech detection means for detecting the speech of the participant, and the detected A speaker determination unit that determines a speaker of a speech, and a speech history display unit that displays a speech history of a participant selected from the list display.
  • the invention according to a seventh aspect is a TV conference system in which a participant performs a TV conference, Image analysis means for image analysis of an image of the TV conference in which the participant is shown; From the result of the image analysis, a part including the face of the participant is detected as a face part, and further, a face detection unit for detecting whether or not each participant of the detected face part is speaking, A list of face images of a plurality of detected participants is displayed, and a face list display for changing the attention level of the participant who has detected the speech when displaying the images of the face portions.
  • a video conference system is provided.
  • the image analysis means for performing image analysis on an image of the TV conference in which the participant is reflected, and the result of the image analysis, Face detection means for detecting a part including a participant's face as a face part, and detecting whether each participant of the detected face part is speaking, and detected face parts of a plurality of participants A list of images, and a face list display means for changing the attention level of the participant who has detected the speech when the face image is displayed as a list, and detecting the speech of the participant A speech detection means for determining the speaker of the detected speech, and a speech history display means for displaying the speech history of the participant.
  • the invention according to an eighth feature is a video conference method in which a participant conducts a video conference, Image analysis of an image of a TV conference in which the participant is shown; Detecting a part including the face of the participant as a face part from the result of the image analysis; Displaying a list of images of face portions of a plurality of detected participants; A video conference method is provided.
  • the invention according to the ninth feature provides a computer system in which a participant conducts a TV conference.
  • the faces of all the members can be appropriately displayed at the connection destination and the facial expressions of the participants can be read. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that are easy to understand who made what kind of remarks.
  • FIG. 1 is a schematic diagram of a preferred embodiment of the present invention.
  • FIG. 2 is a diagram illustrating the relationship between the functional blocks of the TV conference apparatus 100 and the functions.
  • FIG. 3 is a flowchart of face list display processing in the TV conference apparatus 100.
  • FIG. 4 is a diagram showing a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a and face list display processing is performed by the receiving-side TV conference device 100b.
  • FIG. 5 is a flowchart when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a, and face list display processing is performed by the receiving-side TV conference device 100b.
  • FIG. 1 is a schematic diagram of a preferred embodiment of the present invention.
  • FIG. 2 is a diagram illustrating the relationship between the functional blocks of the TV conference apparatus 100 and the functions.
  • FIG. 3 is a flowchart of face list display processing in the TV conference apparatus 100.
  • FIG. 4 is a
  • FIG. 6 is a diagram illustrating a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the computer 200 and face list display processing is performed by the TV conference device 100b on the receiving side.
  • FIG. 7 is a flowchart when the image analysis process and the face detection process are performed by the computer 200 and the face list display process is performed by the TV conference device 100b on the receiving side.
  • FIG. 8 is a diagram illustrating the relationship between the function blocks and the functions when the face list display process and the speech history display process are performed in the TV conference apparatus 100.
  • FIG. 9 is a flowchart of face list display processing and speech history display processing in the TV conference apparatus 100.
  • FIG. 10 is a diagram illustrating an example of a display of a general TV conference.
  • FIG. 10 is a diagram illustrating an example of a display of a general TV conference.
  • FIG. 11 is a diagram illustrating an example of the display of the face list display process.
  • FIG. 12 is a diagram illustrating an example of a display in which the background is replaced in the face list display process.
  • FIG. 13 is a diagram illustrating an example of a display in which a face portion satisfying a predetermined condition is replaced in the face list display process.
  • FIG. 14 is a diagram illustrating an example of a display for changing the attention level of a participant who has detected that he / she is speaking in the face list display process.
  • FIG. 15 is a diagram illustrating an example of the display of the face list display process and the speech history display process.
  • FIG. 1 is a schematic diagram of a preferred embodiment of the present invention. The outline of the present invention will be described with reference to FIG.
  • the TV conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150.
  • the control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130.
  • the TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device.
  • the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display.
  • Wearable terminals such as, and other items.
  • the smartphone illustrated as the TV conference apparatus 100a in FIG. 2 and the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.
  • the image analysis module 111 of the TV conference device 100 receives a captured image from the connected TV conference device 100 via the communication unit 120 (step S01).
  • the captured image here is a precise image having an amount of information necessary for image analysis, and the number of pixels and the image quality can be designated. Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130.
  • the image analysis module 111 of the TV conference apparatus 100 performs image analysis of the received captured image (step S02).
  • the image analysis here is an analysis of the positions and number of participants in the conference.
  • gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
  • the image analysis module 111 specifies that there are four participants and their positions by image analysis.
  • the face detection module 112 of the TV conference device 100 detects a part including each participant's face as a face part based on the image analysis result of step S02 (step S03).
  • the face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
  • the face detection module 112 detects four faces of a participant 1001, a participant 1002, a participant 1003, and a participant 1004.
  • a human may perform teacher learning, or may use machine learning or deep learning (deep learning).
  • the learned image analysis module 111 and face detection module 112 may be acquired from the outside via the communication unit 120.
  • this patent is not limited, and existing techniques can be used.
  • the face list display module 151 of the TV conference apparatus 100 displays a list of detected images of the face portions of the plurality of participants on the output unit 150 (step S04).
  • the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area.
  • the display area with the face portion arranged in the center is arranged as a face list display.
  • the face portion to be displayed here is not limited to the head, but may be from the head to the chest. In the example of FIG.
  • the participant 1001 is placed in the area 101, the participant 1002 is placed in the area 102, the participant 1003 is placed in the area 103, and the participant 1004 is placed in the area 104, and the face list is displayed. Further, as shown in FIG. 1, a captured image representing the entire state may be displayed in the empty area 105 where the face portion is not displayed.
  • FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.
  • FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face of each participant is displayed. It can be seen that the display is large and facial expressions are easy to read.
  • the area 1105 “TV conference system 2016/9/9 15:07:19 ⁇ Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed.
  • information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.
  • a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.
  • FIG. 2 is a diagram illustrating the relationship between the functional blocks of the TV conference apparatus 100 and the functions.
  • the video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150.
  • the control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130.
  • the communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the TV conference apparatus 100b.
  • the TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device.
  • the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display.
  • Wearable terminals such as, and other items.
  • the smartphone illustrated as the TV conference apparatus 100a in FIG. 2 and the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.
  • the TV conference apparatus 100 includes, as the camera unit 10, an imaging device such as a lens, an imaging device, various buttons, and a flash, and captures images as captured images such as moving images and still images.
  • An image obtained by imaging is a precise image having an amount of information necessary for image analysis, and the number of pixels and the image quality can be designated.
  • the camera unit 10 includes a microphone for acquiring audio data combined with moving image capturing or can use the microphone function of the input unit 140.
  • the control unit 110 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like.
  • the control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130.
  • a device for enabling communication with other devices for example, a WiFi (Wireless Fidelity) compatible device compliant with IEEE 802.11 or an IMT-2000 standard such as a third generation or fourth generation mobile communication system. Compliant wireless device etc. It may be a wired LAN connection.
  • WiFi Wireless Fidelity
  • the storage unit 130 includes a data storage unit such as a hard disk or a semiconductor memory, and stores data necessary for processing of captured images, image analysis results, face detection results, and the like.
  • the input unit 140 has functions necessary for using the TV conference system.
  • a liquid crystal display that realizes a touch panel function, a keyboard, a mouse, a pen tablet, a hardware button on the apparatus, a microphone for performing voice recognition, and the like can be provided.
  • the function of the present invention is not particularly limited by the input method.
  • the output unit 150 has functions necessary for using the TV conference system.
  • the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130.
  • forms such as a liquid crystal display, a PC display, a projection on a projector, and an audio output can be considered.
  • the function of the present invention is not particularly limited by the output method.
  • FIG. 3 is a flowchart of face list display processing in the TV conference apparatus 100. Processing executed by each module described above will be described in accordance with this processing.
  • an example of a flowchart in the case where the image analysis process, the face detection process, and the face list display process are performed in the TV conference device 100a on the captured image reception side is shown as an example.
  • control unit 110 of the TV conference device 100a notifies the connection destination TV conference device 100b of the start of the TV conference via the communication unit 120 (step S301).
  • the control unit 110 of the TV conference apparatus 100b starts imaging with the camera unit 10 (step S302).
  • the captured image is a precise image having an amount of information necessary for image analysis in the TV conference apparatus 100a, and the number of pixels and the image quality can be designated.
  • audio data is acquired together with moving image capturing.
  • control unit 110 of the TV conference device 100b transmits the captured image to the TV conference device 100a via the communication unit 120 (step S303).
  • the captured image is a moving image
  • audio data is also transmitted.
  • the image analysis module 111 of the TV conference apparatus 100a receives a captured image from the TV conference apparatus 100b via the communication unit 120 (step S304). Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130.
  • the image analysis module 111 of the TV conference apparatus 100a performs image analysis of the received captured image (step S305).
  • the image analysis is an analysis of the positions and number of participants in the conference.
  • gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
  • the face detection module 112 of the TV conference device 100a detects a part including each participant's face as a face part based on the image analysis result of step S305 (step S306).
  • the face detection is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
  • a human may perform teacher learning, or may use machine learning or deep learning (deep learning).
  • the learned image analysis module 111 and face detection module 112 may be acquired from the outside via the communication unit 120.
  • this patent is not limited, and existing techniques can be used.
  • the face list display module 151 of the TV conference device 100a displays a list of detected face part images of the plurality of participants on the output unit 150 (step S307).
  • the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area.
  • the face portion to be displayed here is not limited to the head, but may be from the head to the chest.
  • the display area with the face portion arranged in the center is arranged as a face list display.
  • the face portion to be displayed here is not limited to the head, but may be from the head to the chest.
  • FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.
  • FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions.
  • the area 1105 “TV conference system 2016/9/9 15:07:19 ⁇ Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed.
  • information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.
  • the control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S308). It is assumed that the user can designate the end of the video conference via the input unit 140. If the video conference is to be terminated, the process proceeds to the next step S309. If the video conference is not to be terminated, the process returns to step S304 to continue the processing.
  • control unit 110 of the TV conference device 100a notifies the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S309).
  • the TV conference device 100b confirms whether or not to end the TV conference (step S310). If the TV conference apparatus 100b does not end, the process returns to step S302 to continue the process, and if it ends, the TV conference ends.
  • a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.
  • FIG. 12 is a diagram illustrating an example of a display in which the background is replaced in the face list display process. Compared to FIG. 11, it can be seen that the backgrounds of the participants in the areas 1201, 1202, 1203, and 1204 of the output unit 150 are replaced with uniform backgrounds.
  • the background replacement process may be performed when the face list display module 151 performs the face list display process of step S307 in the flowchart of FIG. In this way, by replacing the background portion and displaying it, extra information is eliminated and the facial expressions of each participant are easier to read.
  • the same background is used for all of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, but the background may be changed depending on the participant.
  • the background of the participants may be changed for each location, and which participants are in the same space may be displayed in an easy-to-understand manner.
  • a captured image representing the overall state is displayed in the area 1205.
  • the date and time, the connection destination, the call start time, the number of participants of the other party, and the like may be displayed.
  • information such as “background: being replaced” may be displayed somewhere on the output unit 150 as shown in the display 1206. Whether or not to use the background replacement function may be set by the user at a desired timing, or the setting may be saved.
  • FIG. 13 is a diagram illustrating an example of a display in which a face portion satisfying a predetermined condition is replaced in the face list display process.
  • the face part replacement process may be performed when the face list display module 151 performs the face list display process of step S307 in the flowchart of FIG.
  • the face list display module 151 determines a predetermined condition, and replaces the face portion when the condition is satisfied.
  • the predetermined condition here may be, for example, a case where there are no participants in the captured image, that is, a case where the participant goes out of the shooting range of the camera. Comparing FIG. 13 with FIG. 12, it can be seen that an image 1307 is displayed in the area 1302 instead of the participant 1002 because the participant 1002 disappears.
  • the participant 1002 is absent. Since the other three participants are present, the participant 1001 is displayed in the area 1301, the participant 1003 is displayed in the area 1303, and the participant 1004 is displayed in the area 1304 as in FIG.
  • the image 1307 may be a still image of the participant 1002, a favorite illustration, an avatar of the participant 1002, or the like, and can be set by the participant according to his / her preference.
  • the area 1305 may display date and time, connection destination, call start time, number of participants at the other end, and the like. Further, information such as “Away” may be displayed somewhere in the area 1302 as in the display 1306.
  • FIG. 14 is a diagram illustrating an example of a display for changing the attention level of a participant who has detected that he / she is speaking in the face list display process.
  • the face detection module 112 performs the face detection process of step S306 in the flowchart of FIG. 3, it is detected whether or not each participant is speaking, and the face list display module 151
  • the attention level of the participant who has detected that he / she is speaking is changed.
  • the attention level change is performed in order to make the speaker stand out and to attract attention. Specifically, as can be seen by comparing FIG. 13 and FIG.
  • the participant 1001 is displayed in the area 1401, the participant 1002 is displayed in the area 1402, and the participant 1003 is displayed in the area 1403 as in FIG.
  • information “speaking” may be displayed somewhere in the area 1404 like a display 1406.
  • the area 1404 where the speaking participant 1004 is displayed may be placed in a conspicuous part such as the center of the output unit 150 (not shown).
  • the position of the speaker may be indicated by a mark or the like as shown in the display 1407 in conjunction with the attention level changing process described above.
  • an example in which the participant who speaks is only the participant 1004 is illustrated, but when a plurality of participants speak, a process of changing the degree of attention is performed on a plurality of people. Also good.
  • FIG. 4 is a diagram showing a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a and face list display processing is performed by the receiving-side TV conference device 100b.
  • the video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150.
  • the control unit 110 implements the image analysis module 113 in cooperation with the camera unit 10 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130.
  • the communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the TV conference apparatus 100b.
  • FIG. 5 is a flowchart when the image analysis process and the face detection process are performed by the transmitting-side TV conference apparatus 100a, and the face list display process is performed by the receiving-side TV conference apparatus 100b. Processing executed by each module described above will be described in accordance with this processing.
  • control unit 110 of the TV conference device 100a notifies the connection destination TV conference device 100b of the start of the TV conference via the communication unit 120 (step S501).
  • the control unit 110 of the TV conference device 100a starts imaging with the camera unit 10 (step S502).
  • the captured image is a precise image having an amount of information necessary for image analysis in the TV conference apparatus 100a, and the number of pixels and the image quality can be designated.
  • audio data is acquired together with moving image capturing.
  • the captured image and audio data are stored in the storage unit 130.
  • the image analysis module 113 of the TV conference apparatus 100a performs image analysis of the captured image (step S503).
  • the image analysis is an analysis of the positions and number of participants in the conference.
  • gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
  • the face detection module 112 of the TV conference device 100a detects a part including each participant's face as a face part based on the image analysis result of step S503 (step S504).
  • the face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
  • a human may perform teacher learning, or may use machine learning or deep learning (deep learning). Further, since a large amount of data is required for learning, the learned image analysis module 113 and the face detection module 112 may be acquired from the outside via the communication unit 120.
  • this patent is not limited, and existing techniques can be used.
  • the control unit 110 of the TV conference device 100a transmits the analysis image to the TV conference device 100b via the communication unit 120 (step S505).
  • the captured image is a moving image
  • audio data is also transmitted.
  • the analysis image transmitted here may be only the data determined to be necessary for the face list display as a result of the image analysis and face detection, or the captured image itself that was captured by the TV conference device 100a and used for the image analysis, or You may send the captured image which changed the resolution.
  • the TV conference device 100b receives an analysis image from the TV conference device 100a via the communication unit 120 (step S506). Audio data is also received along with the analysis image. The received analysis image and audio data are saved in the storage unit 130.
  • the face list display module 151 of the TV conference device 100b displays a list of detected face part images of the plurality of participants on the output unit 150 (step S507).
  • the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area.
  • the display area with the face portion arranged in the center is arranged as a face list display.
  • the face portion to be displayed here is not limited to the head, but may be from the head to the chest.
  • FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.
  • FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions.
  • the area 1105 “TV conference system 2016/9/9 15:07:19 ⁇ Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed.
  • information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.
  • the control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S508). It is assumed that the user can designate the end of the video conference via the input unit 140. If the video conference is to be ended, the process proceeds to the next step S509. If the video conference is not to be ended, the process returns to step S502 and the process is continued.
  • control unit 110 of the TV conference device 100a notifies the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S509).
  • FIG. 2 and 3 is configured to perform image analysis processing, face detection processing, and face list display processing on the reception side, image analysis processing and face detection processing on the transmission side in FIGS. 4 and 5, and face list display processing on the reception side.
  • image analysis processing and face detection processing on the transmission side in FIGS. 4 and 5 face list display processing on the reception side.
  • FIG. 4 and FIG. 5 it is only necessary to transmit the analysis image, so that the effect of suppressing the communication data amount can be expected particularly when the background has been replaced.
  • a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.
  • FIG. 6 is a diagram illustrating a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the computer 200 and face list display processing is performed by the TV conference device 100b on the receiving side.
  • the video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150.
  • the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130.
  • the computer 200 includes a control unit 210, a communication unit 220, and a storage unit 230.
  • the control unit 210 implements the image analysis module 211 in cooperation with the communication unit 220 and the storage unit 230.
  • the control unit 210 implements the face detection module 212 in cooperation with the storage unit 230.
  • the communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the computer 200 and between the TV conference apparatus 100b and the computer 200.
  • the TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device.
  • the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display.
  • Wearable terminals such as, and other items.
  • the smartphone illustrated as the TV conference apparatus 100a in FIG. 6 and the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.
  • the TV conference apparatus 100 includes, as the camera unit 10, an imaging device such as a lens, an imaging device, various buttons, and a flash, and captures images as captured images such as moving images and still images.
  • An image obtained by imaging is a precise image having an amount of information necessary for image analysis, and the number of pixels and the image quality can be designated.
  • the camera unit 10 includes a microphone for acquiring audio data combined with moving image capturing or can use the microphone function of the input unit 140.
  • the control unit 110 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like.
  • a CPU Central Processing Unit
  • RAM Random Access Memory
  • ROM Read Only Memory
  • a device for enabling communication with other devices for example, a WiFi (Wireless Fidelity) compatible device compliant with IEEE 802.11 or an IMT-2000 standard such as a third generation or fourth generation mobile communication system. Compliant wireless device etc. It may be a wired LAN connection.
  • WiFi Wireless Fidelity
  • the storage unit 130 includes a data storage unit such as a hard disk or a semiconductor memory, and stores data necessary for processing of captured images, image analysis results, face detection results, and the like.
  • the input unit 140 has functions necessary for using the TV conference system.
  • a liquid crystal display that realizes a touch panel function, a keyboard, a mouse, a pen tablet, a hardware button on the apparatus, a microphone for performing voice recognition, and the like can be provided.
  • the function of the present invention is not particularly limited by the input method.
  • the output unit 150 has functions necessary for using the TV conference system.
  • the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130.
  • forms such as a liquid crystal display, a PC display, a projection on a projector, and an audio output can be considered.
  • the function of the present invention is not particularly limited by the output method.
  • the computer 200 may be a general computer having the functions described below. Although not described here, an input unit and an output unit may be provided as necessary.
  • the computer 200 includes a CPU, RAM, ROM and the like as the control unit 210.
  • the control unit 210 implements the image analysis module 211 in cooperation with the communication unit 220 and the storage unit 230.
  • the control unit 210 implements the face detection module 212 in cooperation with the storage unit 230.
  • a device for enabling communication with other devices as the communication unit 220 for example, a WiFi compatible device compliant with IEEE802.11 or a wireless device compliant with the IMT-2000 standard such as a third generation or fourth generation mobile communication system Etc. It may be a wired LAN connection.
  • the storage unit 230 includes a data storage unit using a hard disk or a semiconductor memory.
  • the storage unit 230 holds data such as acquired captured images, image analysis results, and face detection results.
  • FIG. 7 is a flowchart when the image analysis process and the face detection process are performed by the computer 200 and the face list display process is performed by the TV conference device 100b on the receiving side. Processing executed by each module described above will be described in accordance with this processing.
  • control unit 110 of the TV conference apparatus 100a notifies the start of the TV conference to the computer 200 and the connected TV conference apparatus 100b via the communication unit 120 (step S701).
  • the computer 200 that has received the notification of the start of the TV conference from the TV conference device 100a notifies the TV conference device 100b of the start of the TV conference. Do.
  • the control unit 110 of the TV conference device 100a starts imaging with the camera unit 10 (step S702).
  • the captured image is a precise image having an amount of information necessary for image analysis by the computer 200, and the number of pixels and the image quality can be designated.
  • audio data is acquired together with moving image capturing.
  • the captured image and audio data are stored in the storage unit 130.
  • control unit 110 of the TV conference device 100a transmits the captured image to the computer 200 via the communication unit 120 (step S703).
  • the captured image is a moving image
  • audio data is also transmitted.
  • the image analysis module 211 of the computer 200 receives a captured image from the TV conference device 100a via the communication unit 220 (step S704). Audio data is also received along with the captured image. The received captured image and audio data are stored in the storage unit 230.
  • the image analysis module 211 of the computer 200 performs image analysis of the received captured image (step S705).
  • the image analysis here is an analysis of the positions and number of participants in the conference.
  • gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
  • the face detection module 212 of the computer 200 detects a part including the face of each participant as a face part based on the image analysis result of step S705 (step S706).
  • the face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
  • a human may perform teacher learning, or may use machine learning or deep learning (deep learning).
  • the learned image analysis module 211 and face detection module 212 may be acquired from the outside.
  • this patent is not limited, and existing techniques can be used.
  • the control unit 210 of the computer 200 transmits the analysis image to the TV conference device 100b via the communication unit 220 (step S707).
  • the captured image is a moving image
  • audio data is also transmitted.
  • the analysis image to be transmitted here may be only data determined to be necessary for the face list display as a result of image analysis and face detection, or is taken by the TV conference apparatus 100a and used for image analysis by the computer 200. You may send the captured image itself or the captured image which changed resolution.
  • the TV conference apparatus 100b receives an analysis image from the computer 200 via the communication unit 120 (step S708). Audio data is also received along with the analysis image. The received analysis image and audio data are saved in the storage unit 130.
  • the face list display module 151 of the TV conference apparatus 100b displays a list of detected face part images on the output unit 150 (step S709).
  • the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area.
  • the display area with the face portion arranged in the center is arranged as a face list display.
  • the face portion to be displayed here is not limited to the head, but may be from the head to the chest.
  • FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.
  • FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions.
  • the area 1105 “TV conference system 2016/9/9 15:07:19 ⁇ Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed.
  • information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.
  • the control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S710). It is assumed that the user can designate the end of the video conference via the input unit 140. If the TV conference is to be ended, the process proceeds to the next step S711. If the TV conference is not to be ended, the process returns to step S702 to continue the processing.
  • the control unit 110 of the TV conference device 100a notifies the computer 200 and the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S711).
  • the computer 200 that receives the TV conference end notification from the TV conference apparatus 100a notifies the TV conference apparatus 100b of the TV conference end notification. Do.
  • the latter An advantage is that the analysis module 211 and the face detection module 212 can be easily updated. Further, since a large amount of data is required for machine learning and deep learning, the computer 200 has an advantage that a large-capacity storage is easily provided in the storage unit 230 in this respect.
  • a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.
  • FIG. 8 is a diagram illustrating the relationship between the function blocks and the functions when the face list display process and the speech history display process are performed in the TV conference apparatus 100.
  • the control unit 110 implements a speech detection module 114 and a speaker determination module 115 in cooperation with the communication unit 120 and the storage unit 130.
  • the output unit 150 implements the message history display module 152 in cooperation with the control unit 110 and the storage unit 130.
  • the TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device.
  • the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display.
  • Wearable terminals such as, and other items.
  • the smartphone illustrated as the TV conference apparatus 100a in FIG. 8, the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.
  • FIG. 9 is a flowchart of the face list display process and the speech history display process in the TV conference apparatus 100. Processing executed by each module described above will be described in accordance with this processing.
  • an example in which a series of processing from image analysis processing to speech history display processing is performed by the video conference apparatus 100 on the captured image receiving side is shown.
  • the image analysis process, the face detection process, the speech detection process, and the speaker determination process are performed by the TV conference apparatus 100 or the computer 200 on the captured image transmission side, as described above.
  • the TV conference device 100 on the receiving side may be configured to perform only face list display processing and speech history display processing.
  • the image analysis module 111 of the TV conference device 100 on the captured image receiving side receives a captured image via the communication unit 120 (step S901). Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130.
  • the TV conference start notification is not described, but it is assumed that the TV conference start notification is performed before step S901.
  • the image analysis module 111 performs image analysis of the received captured image (step S902).
  • the image analysis here is an analysis of the positions and number of participants in the conference.
  • gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
  • the face detection module 112 detects a part including each participant's face as a face part based on the image analysis result of step S902 (step S903).
  • the face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
  • a human may perform teacher learning, or may use machine learning or deep learning (deep learning).
  • the learned image analysis module 111 and face detection module 112 may be acquired from the outside via the communication unit 120.
  • this patent is not limited, and existing techniques can be used.
  • the speech detection module 114 detects the speech of each participant based on the received audio data (step S904).
  • the content of received voice data is analyzed and converted into text by voice recognition. If multiple people are speaking at the same time and it is difficult to isolate the speech, use the image analysis result in step S902, the face detection result in step S903, etc. Improvements may be made.
  • this patent is not limited, and it is assumed that the existing technology can be used.
  • the speaker determination module 115 determines a speaker based on the image analysis result in step S902, the face detection result in step S903, the speech detection result in step S904, and the like (step S905).
  • the speaker determination here specifies which participant is speaking using the captured image, the mouth movement based on the analysis image, the voice height, the input direction, and the like, and the content of the speech detected in step S904. It is to be tied.
  • the results of these processes are stored in the storage unit 130 as data as to which participant made what and when.
  • a human may perform teacher learning, or may use machine learning or deep learning (deep learning).
  • the learned speech detection module 114 and the speaker determination module 115 may be acquired from the outside via the communication unit 120.
  • this patent is not limited and existing technology can be used.
  • the face list display module 151 displays a list of detected face part images of a plurality of participants on the output unit 150 (step S906).
  • the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area.
  • the display area with the face portion arranged in the center is arranged as a face list display.
  • the face portion to be displayed here is not limited to the head, but may be from the head to the chest.
  • the speech history display module 152 confirms whether or not to display the speech history (step S907). It is assumed that the user can specify the message history display via the input unit 140. If the message history is displayed, the process proceeds to the next step S908. If the message history is not displayed, the process is terminated.
  • the speech history display module 152 When displaying the speech history, the speech history display module 152 causes the user to select a participant who displays the speech history via the input unit 140 (step S908).
  • the number of participants to be selected is not limited, and one, a plurality of people, or all of the participants may be selected.
  • the speech history display module 152 displays the speech history of the selected participant on the output unit 150 (step S909).
  • the TV conference end notification is not described. However, when the TV conference ends, it is assumed that the TV conference end notification is sent to the destination TV conference apparatus 100.
  • FIG. 15 is a diagram showing an example of the display of the face list display process and the speech history display process. Since the participant 1001 is displayed in the area 1501 of the output unit 150, the participant 1002 is displayed in the area 1502, the participant 1003 is displayed in the area 1503, and the participant 1004 is displayed in the area 1504, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions. Also, here, participant 1001 is displayed as “Participant A” 1506, participant 1002 is displayed as “Participant B” 1507, participant 1003 is displayed as “Participant C” 1508, and participant 1004 Is displayed on the output unit 150 as a “participant D”.
  • the region 1503 or display 1508 of the participant C may be selected with the pointer 1510.
  • the fact that “participant C” is selected may be displayed on the output unit 150 in an easily understandable manner.
  • an example in which the content of the speech of “participant C” is displayed in the area 1505 by the speech history display in step S909 is illustrated. If there are too many utterance histories that can be displayed, a scroll bar 1511 or the like may be provided in the utterance history display area 1505 so that the past utterances can be displayed.
  • the participant who made the speech is also displayed together with the content of the speech.
  • the speech history it is easier to understand if the time of speech is also displayed.
  • a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions and easily understand who made what.
  • the means and functions described above are realized by a computer (including a CPU, an information processing apparatus, and various terminals) reading and executing a predetermined program.
  • the program may be, for example, in a form (SaaS: Software as a Service) provided from a computer via a network, or a flexible disk, CD (CD-ROM, etc.), DVD (DVD-ROM, DVD). -RAM, etc.) and a computer-readable recording medium such as a compact memory.
  • the computer reads the program from the recording medium, transfers it to the internal storage device or the external storage device, stores it, and executes it.
  • the program may be recorded in advance in a storage device (recording medium) such as a magnetic disk, an optical disk, or a magneto-optical disk, and provided from the storage device to a computer via a communication line.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

[Problem] To display the look of a participant and who spoke what in an easily understandable manner in a TV conference system in which participants perform a TV conference, even when there is a plurality of participants at one place. [Solution] A TV conference system in which participants perform a TV conference, wherein the TV conference system is provided with an image analysis module 111 for analyzing an image of the TV conference in which participants are seen, and a face detection module 112 for detecting a portion of a participant that includes the face as a face portion, the detected face portions of a plurality of participants being displayed in list form by a face list display module 151.

Description

TV会議システム、TV会議方法、およびプログラムTV conference system, TV conference method, and program
 本発明は、参加者がTV会議を行うTV会議システムにおいて、1拠点の参加者が複数人いる場合でも、接続先に全員の顔を適切に表示し、参加者の表情を読み取ることが可能で、誰がどのような発言を行ったかがわかりやすいTV会議システム、TV会議方法、およびプログラムに関する。 In the TV conference system in which a participant performs a TV conference, even when there are a plurality of participants at one site, it is possible to appropriately display the faces of all the members at the connection destination and to read the facial expressions of the participants. The present invention relates to a TV conference system, a TV conference method, and a program that make it easy to understand who made what.
 複数人の参加者がTV会議を行うTV会議システムにおいて、少なくとも参加者が注目する可能性が高い部分が適切な画質となるように、遠隔会議中に表示される画像の画質を会議の進行状況に応じて調整することができるテレビ会議装置が開示されている(特許文献1)。 In a TV conference system in which a plurality of participants conduct a TV conference, the image quality of the image displayed during the remote conference is set so that at least a portion that is likely to be noticed by the participant has an appropriate image quality. A video conference apparatus that can be adjusted according to the above is disclosed (Patent Document 1).
特開2012-138823号公報JP 2012-138823 A
 しかしながら、特許文献1の装置では、発言者の顔領域か情報表示物の表示領域のいずれかに対して、適切な輝度・ホワイトバランス調整を行うに過ぎず、参加者の人数が多い場合には、前述の調整を行っても、顔領域の表示が小さすぎて、それぞれの表情を読み取ることが難しい恐れがある。特に、広い会議室で、参加者が隣同士の間隔を空けて座っているときに、このような状況が起こりやすい。さらに、会議を行っている場所の背景が煩雑であると、一度にそれぞれの参加者の表情を読み取ることが、より一層困難となる。また、参加者が多い場合には、誰がどのような発言を行ったかが、分かりにくいという問題もある。 However, in the apparatus of Patent Document 1, only appropriate brightness / white balance adjustment is performed on either the speaker's face area or the display area of the information display object, and the number of participants is large. Even if the above-mentioned adjustment is performed, the display of the face area is too small, and it may be difficult to read each expression. In particular, this situation is likely to occur when participants sit in a large conference room with a gap between them. Furthermore, if the background of the place where the meeting is held is complicated, it becomes more difficult to read the facial expressions of each participant at once. In addition, when there are many participants, there is also a problem that it is difficult to understand who made what statement.
 本発明では、これらの課題に鑑み、参加者がTV会議を行うTV会議システムにおいて、1拠点の参加者が複数人いる場合でも、接続先に全員の顔を適切に表示し、参加者の表情を読み取ることが可能で、誰がどのような発言を行ったかがわかりやすいTV会議システム、TV会議方法、およびプログラムを提供することを目的とする。 In the present invention, in view of these problems, in a video conference system in which a participant performs a video conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination, and the facial expressions of the participants are displayed. It is an object of the present invention to provide a TV conference system, a TV conference method, and a program that can easily read who said what.
 本発明では、以下のような解決手段を提供する。 The present invention provides the following solutions.
 第1の特徴に係る発明は、参加者がTV会議を行うTV会議システムであって、
 前記参加者が映ったTV会議の画像を画像解析する画像解析手段と、
 前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出する顔検出手段と、
 検出された複数の参加者の顔部分の画像を一覧表示する顔一覧表示手段と、
を備えることを特徴とするTV会議システムを提供する。
The invention according to the first feature is a TV conference system in which a participant performs a TV conference,
Image analysis means for image analysis of an image of the TV conference in which the participant is shown;
From the result of the image analysis, face detection means for detecting a part including the face of the participant as a face part;
Face list display means for displaying a list of images of the detected face parts of a plurality of participants;
A video conference system is provided.
 第1の特徴に係る発明によれば、参加者がTV会議を行うTV会議システムにおいて、前記参加者が映ったTV会議の画像を画像解析する画像解析手段と、前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出する顔検出手段と、検出された複数の参加者の顔部分の画像を一覧表示する顔一覧表示手段と、を備える。 According to the first aspect of the invention, in a TV conference system in which a participant performs a TV conference, an image analysis unit that performs image analysis of an image of the TV conference in which the participant is reflected; Face detection means for detecting a part including the face of the participant as a face part, and face list display means for displaying a list of images of the detected face parts of the plurality of participants.
 第1の特徴に係る発明は、TV会議システムのカテゴリであるが、TV会議方法、およびTV会議プログラムであっても同様の作用、効果を奏する。 The invention according to the first feature is a category of the TV conference system, but the TV conference method and the TV conference program exhibit the same operations and effects.
 第2の特徴に係る発明は、第1の特徴に係る発明であるTV会議システムであって、
 前記顔一覧表示手段は、前記検出された顔部分を各表示領域の中央に配置し、当該顔部分が中央に配置された表示領域を並べることで、一覧表示とすることを特徴とするTV会議システムを提供する。
The invention according to the second feature is a TV conference system which is the invention according to the first feature,
The face list display means arranges the detected face part in the center of each display area, and arranges the display area in which the face part is arranged in the center to make a list display. Provide a system.
 第2の特徴に係る発明によれば、第1の特徴に係る発明であるTV会議システムにおいて、前記顔一覧表示手段は、前記検出された顔部分を各表示領域の中央に配置し、当該顔部分が中央に配置された表示領域を並べることで、一覧表示とする。 According to the invention relating to the second feature, in the video conference system according to the invention relating to the first feature, the face list display means arranges the detected face part in the center of each display area, and A list display is made by arranging display areas arranged in the center.
 第3の特徴に係る発明は、第1の特徴または第2の特徴に係る発明であるTV会議システムであって、
 前記顔一覧表示手段は、前記顔部分の画像を一覧表示する際に、検出した顔以外の背景部分を置き換えて表示することを特徴とするTV会議システムを提供する。
The invention according to the third feature is a TV conference system which is the invention according to the first feature or the second feature,
The face list display means provides a TV conference system in which, when displaying a list of images of the face portion, a background portion other than the detected face is replaced and displayed.
 第3の特徴に係る発明によれば、第1の特徴または第2の特徴に係る発明であるTV会議システムにおいて、前記顔一覧表示手段は、前記顔部分の画像を一覧表示する際に、検出した顔以外の背景部分を置き換えて表示する。 According to the invention relating to the third feature, in the video conference system according to the invention relating to the first feature or the second feature, the face list display means detects when displaying a list of images of the face part. The background part other than the selected face is replaced and displayed.
 第4の特徴に係る発明は、第1の特徴から第3の特徴のいずれかに係る発明であるTV会議システムであって、
 前記顔一覧表示手段は、一覧表示の開始後、前記検出された顔部分が所定の条件を満たす場合に、他の画像に置き換えて表示することを特徴とするTV会議システムを提供する。
An invention according to a fourth feature is a TV conference system according to any one of the first feature to the third feature,
The face list display means provides a TV conference system, wherein after the start of list display, when the detected face portion satisfies a predetermined condition, the face list display means replaces and displays the image with another image.
 第4の特徴に係る発明によれば、第1の特徴から第3の特徴のいずれかに係る発明であるTV会議システムにおいて、前記顔一覧表示手段は、一覧表示の開始後、前記検出された顔部分が所定の条件を満たす場合に、他の画像に置き換えて表示する。 According to the fourth feature of the present invention, in the TV conference system according to any one of the first feature to the third feature, the face list display means detects the detected face after starting the list display. When the face portion satisfies a predetermined condition, it is replaced with another image and displayed.
 第5の特徴に係る発明は、第1の特徴から第4の特徴のいずれかに係る発明であるTV会議システムであって、
 前記顔検出手段は、さらに、検出した顔部分の参加者が、それぞれ発言しているか否かを検出し、
 前記顔一覧表示手段は、さらに、前記顔部分の画像を一覧表示する際に、発言していることを検出した参加者の注目度を変更する
ことを特徴とするTV会議システムを提供する。
The invention according to the fifth feature is a TV conference system according to any one of the first feature to the fourth feature,
The face detection means further detects whether or not each participant of the detected face portion speaks,
The face list display means further provides a TV conference system that changes the attention level of a participant who detects that he / she is speaking when displaying a list of images of the face portion.
 第5の特徴に係る発明によれば、第1の特徴から第4の特徴のいずれかに係る発明であるTV会議システムにおいて、前記顔検出手段は、さらに、検出した顔部分の参加者が、それぞれ発言しているか否かを検出し、前記顔一覧表示手段は、さらに、前記顔部分の画像を一覧表示する際に、発言していることを検出した参加者の注目度を変更する。 According to the fifth feature of the present invention, in the video conference system according to any one of the first feature to the fourth feature, the face detection means further includes a participant of the detected face part, The face list display means further changes the degree of attention of the participant who detects the speech when the face images are displayed as a list.
 第6の特徴に係る発明は、第1の特徴から第5の特徴のいずれかに係る発明であるTV会議システムであって、
 前記参加者の発言を検知する発言検知手段と、
 前記検知された発言の発言者を判定する発言者判定手段と、
 前記一覧表示の中から選択した参加者の発言履歴を表示する発言履歴表示手段と、
を備えることを特徴とするTV会議システムを提供する。
An invention according to a sixth feature is a TV conference system according to any one of the first feature to the fifth feature,
A speech detection means for detecting the speech of the participant;
Speaker determination means for determining a speaker of the detected speech;
A speech history display means for displaying a speech history of a participant selected from the list display;
A video conference system is provided.
 第6の特徴に係る発明によれば、第1の特徴から第5の特徴のいずれかに係る発明であるTV会議システムにおいて、前記参加者の発言を検知する発言検知手段と、前記検知された発言の発言者を判定する発言者判定手段と、前記一覧表示の中から選択した参加者の発言履歴を表示する発言履歴表示手段と、を備える。 According to the sixth aspect of the invention, in the TV conference system according to any one of the first to fifth aspects, the speech detection means for detecting the speech of the participant, and the detected A speaker determination unit that determines a speaker of a speech, and a speech history display unit that displays a speech history of a participant selected from the list display.
 第7の特徴に係る発明は、参加者がTV会議を行うTV会議システムであって、
 前記参加者が映ったTV会議の画像を画像解析する画像解析手段と、
 前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出し、さらに、検出した顔部分の参加者が、それぞれ発言しているか否かを検出する顔検出手段と、
 検出された複数の参加者の顔部分の画像を一覧表示し、さらに、前記顔部分の画像を一覧表示する際に、発言していることを検出した参加者の注目度を変更する顔一覧表示手段と、
 前記参加者の発言を検知する発言検知手段と、
 前記検知された発言の発言者を判定する発言者判定手段と、
 前記参加者の発言履歴を表示する発言履歴表示手段と、
を備えることを特徴とするTV会議システムを提供する。
The invention according to a seventh aspect is a TV conference system in which a participant performs a TV conference,
Image analysis means for image analysis of an image of the TV conference in which the participant is shown;
From the result of the image analysis, a part including the face of the participant is detected as a face part, and further, a face detection unit for detecting whether or not each participant of the detected face part is speaking,
A list of face images of a plurality of detected participants is displayed, and a face list display for changing the attention level of the participant who has detected the speech when displaying the images of the face portions. Means,
A speech detection means for detecting the speech of the participant;
Speaker determination means for determining a speaker of the detected speech;
A statement history display means for displaying the participant's statement history;
A video conference system is provided.
 第7の特徴に係る発明によれば、参加者がTV会議を行うTV会議システムにおいて、前記参加者が映ったTV会議の画像を画像解析する画像解析手段と、前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出し、さらに、検出した顔部分の参加者が、それぞれ発言しているか否かを検出する顔検出手段と、検出された複数の参加者の顔部分の画像を一覧表示し、さらに、前記顔部分の画像を一覧表示する際に、発言していることを検出した参加者の注目度を変更する顔一覧表示手段と、前記参加者の発言を検知する発言検知手段と、前記検知された発言の発言者を判定する発言者判定手段と、前記参加者の発言履歴を表示する発言履歴表示手段と、を備える。 According to the seventh aspect of the invention, in a TV conference system in which a participant performs a TV conference, the image analysis means for performing image analysis on an image of the TV conference in which the participant is reflected, and the result of the image analysis, Face detection means for detecting a part including a participant's face as a face part, and detecting whether each participant of the detected face part is speaking, and detected face parts of a plurality of participants A list of images, and a face list display means for changing the attention level of the participant who has detected the speech when the face image is displayed as a list, and detecting the speech of the participant A speech detection means for determining the speaker of the detected speech, and a speech history display means for displaying the speech history of the participant.
 第8の特徴に係る発明は、参加者がTV会議を行うTV会議方法であって、
 前記参加者が映ったTV会議の画像を画像解析するステップと、
 前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出するステップと、
 検出された複数の参加者の顔部分の画像を一覧表示するステップと、
を備えることを特徴とするTV会議方法を提供する。
The invention according to an eighth feature is a video conference method in which a participant conducts a video conference,
Image analysis of an image of a TV conference in which the participant is shown;
Detecting a part including the face of the participant as a face part from the result of the image analysis;
Displaying a list of images of face portions of a plurality of detected participants;
A video conference method is provided.
 第9の特徴に係る発明は、参加者がTV会議を行うコンピュータシステムに、
 前記参加者が映ったTV会議の画像を画像解析するステップ、
 前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出するステップ、
 検出された複数の参加者の顔部分の画像を一覧表示するステップ、
を実行させるためのプログラムを提供する。
The invention according to the ninth feature provides a computer system in which a participant conducts a TV conference.
Image analysis of an image of a video conference in which the participant is shown;
Detecting a part including the face of the participant as a face part from the result of the image analysis;
Displaying a list of images of face portions of a plurality of detected participants;
Provide a program to execute.
 本発明によれば、参加者がTV会議を行うTV会議システムにおいて、1拠点の参加者が複数人いる場合でも、接続先に全員の顔を適切に表示し、参加者の表情を読み取ることが可能で、誰がどのような発言を行ったかがわかりやすいTV会議システム、TV会議方法、およびTV会議プログラムを提供することが可能となる。 According to the present invention, in a TV conference system in which a participant performs a TV conference, even when there are a plurality of participants at one site, the faces of all the members can be appropriately displayed at the connection destination and the facial expressions of the participants can be read. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that are easy to understand who made what kind of remarks.
図1は、本発明の好適な実施形態の概要図である。FIG. 1 is a schematic diagram of a preferred embodiment of the present invention. 図2は、TV会議装置100の機能ブロックと各機能の関係を示す図である。FIG. 2 is a diagram illustrating the relationship between the functional blocks of the TV conference apparatus 100 and the functions. 図3は、TV会議装置100での顔一覧表示処理のフローチャート図である。FIG. 3 is a flowchart of face list display processing in the TV conference apparatus 100. 図4は、画像解析処理と顔検出処理を送信側のTV会議装置100aで、顔一覧表示処理を受信側のTV会議装置100bで行う場合の機能ブロックと各機能の関係を示す図である。FIG. 4 is a diagram showing a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a and face list display processing is performed by the receiving-side TV conference device 100b. 図5は、画像解析処理と顔検出処理を送信側のTV会議装置100aで、顔一覧表示処理を受信側のTV会議装置100bで行う場合のフローチャート図である。FIG. 5 is a flowchart when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a, and face list display processing is performed by the receiving-side TV conference device 100b. 図6は、画像解析処理と顔検出処理をコンピュータ200で、顔一覧表示処理を受信側のTV会議装置100bで行う場合の機能ブロックと各機能の関係を示す図である。FIG. 6 is a diagram illustrating a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the computer 200 and face list display processing is performed by the TV conference device 100b on the receiving side. 図7は、画像解析処理と顔検出処理をコンピュータ200で、顔一覧表示処理を受信側のTV会議装置100bで行う場合のフローチャート図である。FIG. 7 is a flowchart when the image analysis process and the face detection process are performed by the computer 200 and the face list display process is performed by the TV conference device 100b on the receiving side. 図8は、TV会議装置100で顔一覧表示処理と発言履歴表示処理を行う場合の機能ブロックと各機能の関係を示す図である。FIG. 8 is a diagram illustrating the relationship between the function blocks and the functions when the face list display process and the speech history display process are performed in the TV conference apparatus 100. 図9は、TV会議装置100での顔一覧表示処理と発言履歴表示処理のフローチャート図である。FIG. 9 is a flowchart of face list display processing and speech history display processing in the TV conference apparatus 100. 図10は、一般的なTV会議の表示の一例を示す図である。FIG. 10 is a diagram illustrating an example of a display of a general TV conference. 図11は、顔一覧表示処理の表示の一例を示す図である。FIG. 11 is a diagram illustrating an example of the display of the face list display process. 図12は、顔一覧表示処理で背景を置き換えた表示の一例を示す図である。FIG. 12 is a diagram illustrating an example of a display in which the background is replaced in the face list display process. 図13は、顔一覧表示処理で所定の条件を満たす顔部分を置き換えた表示の一例を示す図である。FIG. 13 is a diagram illustrating an example of a display in which a face portion satisfying a predetermined condition is replaced in the face list display process. 図14は、顔一覧表示処理で発言していることを検出した参加者の注目度を変更する表示の一例を示す図である。FIG. 14 is a diagram illustrating an example of a display for changing the attention level of a participant who has detected that he / she is speaking in the face list display process. 図15は、顔一覧表示処理と発言履歴表示処理の表示の一例を示す図である。FIG. 15 is a diagram illustrating an example of the display of the face list display process and the speech history display process.
 以下、本発明を実施するための最良の形態について図を参照しながら説明する。なお、これはあくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。 Hereinafter, the best mode for carrying out the present invention will be described with reference to the drawings. This is merely an example, and the technical scope of the present invention is not limited to this.
 [TV会議システムの概要]
 図1は、本発明の好適な実施形態の概要図である。この図1に基づいて、本発明の概要を説明する。
[Outline of video conference system]
FIG. 1 is a schematic diagram of a preferred embodiment of the present invention. The outline of the present invention will be described with reference to FIG.
 TV会議装置100は、図2に示すように、カメラ部10、制御部110、通信部120、記憶部130、入力部140、出力部150、から構成される。制御部110は通信部120、記憶部130と協働して画像解析モジュール111を実現する。また、制御部110は記憶部130と協働して顔検出モジュール112を実現する。また、出力部150は、制御部110、記憶部130と協働して顔一覧表示モジュール151を実現する。 As shown in FIG. 2, the TV conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150. The control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130.
 TV会議装置100は、装置全体として前述した各構成を備えればよく、内蔵デバイスまたは外付けデバイスなどの形態は問わない。例えば、TV会議装置100は、携帯電話、携帯情報端末、タブレット端末、パーソナルコンピュータに加え、ネットブック端末、スレート端末、電子書籍端末、携帯型音楽プレーヤ等の電化製品や、スマートグラス、ヘッドマウントディスプレイ等のウェアラブル端末や、その他の物品である。図2でTV会議装置100aとして図示しているスマートフォン、TV会議装置100bとして図示しているパーソナルコンピュータとディスプレイとWEBカメラ、はその一例にすぎない。 The TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device. For example, the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display. Wearable terminals such as, and other items. The smartphone illustrated as the TV conference apparatus 100a in FIG. 2 and the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.
 図1のTV会議システムにおいて、はじめに、TV会議装置100の画像解析モジュール111は、通信部120を介して、接続先のTV会議装置100から撮像画像の受信を行う(ステップS01)。ここでの撮像画像は、画像解析に必要なだけの情報量を持った精密な画像であるものとし、画素数や画質を指定可能であるものとする。また、撮像画像とあわせて、音声データの受信も行う。受信した撮像画像と音声データは、記憶部130に保存するものとする。 1, first, the image analysis module 111 of the TV conference device 100 receives a captured image from the connected TV conference device 100 via the communication unit 120 (step S01). The captured image here is a precise image having an amount of information necessary for image analysis, and the number of pixels and the image quality can be designated. Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130.
 次に、TV会議装置100の画像解析モジュール111は、受信した撮像画像の画像解析を行う(ステップS02)。ここでの画像解析とは、会議の参加者の位置や人数等の解析である。また、あわせて、性別や年代などを解析してもよいし、社員データベース等を利用して参加者個人を特定する解析を行ってもよい。図1の例では、画像解析モジュール111は、参加者が4人であることと、それぞれの位置を画像解析により特定する。 Next, the image analysis module 111 of the TV conference apparatus 100 performs image analysis of the received captured image (step S02). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like. In the example of FIG. 1, the image analysis module 111 specifies that there are four participants and their positions by image analysis.
 次に、TV会議装置100の顔検出モジュール112は、ステップS02の画像解析結果を基に、各参加者の顔を含む部分を顔部分として検出する(ステップS03)。ここでの顔検出は、参加者の頭部位置を特定するためのものであり、たとえば、参加者がカメラを向いておらず目や口等のパーツが見つけられない場合にも、側頭部や後頭部を顔として検出してもよいものとする。図1の例では、顔検出モジュール112は、参加者1001、参加者1002、参加者1003、参加者1004の4人の顔を検出する。 Next, the face detection module 112 of the TV conference device 100 detects a part including each participant's face as a face part based on the image analysis result of step S02 (step S03). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face. In the example of FIG. 1, the face detection module 112 detects four faces of a participant 1001, a participant 1002, a participant 1003, and a participant 1004.
 これらの画像解析や顔検出を行うためには、人間が教師学習を行ってもよいし、機械学習やディープラーニング(深層学習)を利用してもよい。また、学習のためには、大量のデータが必要となるので、学習済みの画像解析モジュール111や顔検出モジュール112を、通信部120を介して外部から取得することとしてもよい。ただし、画像解析や顔検出の方法に関しては、本特許を限定するものではなく、既存の技術を利用可能であるものとする。 In order to perform these image analysis and face detection, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). In addition, since a large amount of data is required for learning, the learned image analysis module 111 and face detection module 112 may be acquired from the outside via the communication unit 120. However, with respect to image analysis and face detection methods, this patent is not limited, and existing techniques can be used.
 最後に、TV会議装置100の顔一覧表示モジュール151は、検出された複数の参加者の顔部分の画像を出力部150に一覧表示する(ステップS04)。顔の一覧表示方法として具体的には、出力部150を検出した参加者の数以上の領域に分割し、分割した各表示領域の中央に、検出した顔部分を配置して表示する。参加者の人数に応じて、顔部分を中央に配置した表示領域を並べ顔一覧表示とする。ここで表示する顔部分は、頭部のみでなく、頭部から胸部までとしてもよい。図1の例では、領域101に参加者1001を、領域102に参加者1002を、領域103に参加者1003を、領域104に参加者1004を配置し、顔一覧表示としている。また、図1に示すように、顔部分を表示していない空き領域105に、全体の様子を表す撮像画像を表示してもよい。 Finally, the face list display module 151 of the TV conference apparatus 100 displays a list of detected images of the face portions of the plurality of participants on the output unit 150 (step S04). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. In the example of FIG. 1, the participant 1001 is placed in the area 101, the participant 1002 is placed in the area 102, the participant 1003 is placed in the area 103, and the participant 1004 is placed in the area 104, and the face list is displayed. Further, as shown in FIG. 1, a captured image representing the entire state may be displayed in the empty area 105 where the face portion is not displayed.
 図10は、一般的なTV会議の表示の一例を示す図である。出力部150に、撮像画像をそのまま表示しているので、全体の雰囲気は伝わるが、顔部分の表示が小さいため、参加者1001、参加者1002、参加者1003、参加者1004の、それぞれの表情が読み取りにくい。 FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.
 図11は、本発明の顔一覧表示処理の表示の一例を示す図である。出力部150の領域1101に参加者1001を、領域1102に参加者1002を、領域1103に参加者1003を、領域1104に参加者1004をそれぞれ顔部分表示しているため、各参加者の顔の表示が大きく、表情を読み取りやすいことが分かる。ここでは、領域1105に「TV会議システム 2016/9/9 15:07:19 <<○○事業所と接続中>> 通話開始:2016/9/9 14:05:33 相手先参加者:4名」と、日時、接続先、通話開始時刻、相手先参加者人数、を表示する例を示している。このように、空き領域に、情報を表示してもよいし、または、全体の様子を表す撮像画像を表示してもよい。また、領域を分割する際に、空き領域を作らず出力部150全体が顔一覧表示領域になるよう分割してもよい。その際、各参加者の領域の大きさが異なってもよいものとする。 FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face of each participant is displayed. It can be seen that the display is large and facial expressions are easy to read. Here, in the area 1105, “TV conference system 2016/9/9 15:07:19 << Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed. In this way, information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.
 以上のように、本発明によれば、参加者がTV会議を行うTV会議システムにおいて、1拠点の参加者が複数人いる場合でも、接続先に全員の顔を適切に表示し、参加者の表情を読み取ることが可能なTV会議システム、TV会議方法、およびTV会議プログラムを提供することが可能となる。 As described above, according to the present invention, in a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.
 [各機能の説明]
 図2は、TV会議装置100の機能ブロックと各機能の関係を示す図である。TV会議装置100は、カメラ部10、制御部110、通信部120、記憶部130、入力部140、出力部150、から構成される。制御部110は通信部120、記憶部130と協働して画像解析モジュール111を実現する。また、制御部110は記憶部130と協働して顔検出モジュール112を実現する。また、出力部150は、制御部110、記憶部130と協働して顔一覧表示モジュール151を実現する。通信網300は、インターネット等の公衆通信網でも専用通信網でも良く、TV会議装置100aとTV会議装置100b間の通信を可能とする。
[Description of each function]
FIG. 2 is a diagram illustrating the relationship between the functional blocks of the TV conference apparatus 100 and the functions. The video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150. The control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. The communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the TV conference apparatus 100b.
 TV会議装置100は、装置全体として前述した各構成を備えればよく、内蔵デバイスまたは外付けデバイスなどの形態は問わない。例えば、TV会議装置100は、携帯電話、携帯情報端末、タブレット端末、パーソナルコンピュータに加え、ネットブック端末、スレート端末、電子書籍端末、携帯型音楽プレーヤ等の電化製品や、スマートグラス、ヘッドマウントディスプレイ等のウェアラブル端末や、その他の物品である。図2でTV会議装置100aとして図示しているスマートフォン、TV会議装置100bとして図示しているパーソナルコンピュータとディスプレイとWEBカメラ、はその一例にすぎない。 The TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device. For example, the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display. Wearable terminals such as, and other items. The smartphone illustrated as the TV conference apparatus 100a in FIG. 2 and the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.
 TV会議装置100は、カメラ部10として、レンズ、撮像素子、各種ボタン、フラッシュ等の撮像デバイス等を備え、動画や静止画等の撮像画像として撮像する。また、撮像して得られる画像は、画像解析に必要なだけの情報量を持った精密な画像であるものとし、画素数や画質を指定可能であるものとする。さらに、カメラ部10は動画の撮像と併せた音声データの取得を行うためのマイクを備えるか、入力部140のマイク機能を利用可能であるものとする。 The TV conference apparatus 100 includes, as the camera unit 10, an imaging device such as a lens, an imaging device, various buttons, and a flash, and captures images as captured images such as moving images and still images. An image obtained by imaging is a precise image having an amount of information necessary for image analysis, and the number of pixels and the image quality can be designated. Further, it is assumed that the camera unit 10 includes a microphone for acquiring audio data combined with moving image capturing or can use the microphone function of the input unit 140.
 制御部110として、CPU(Central Processing Unit)、RAM(Random Access Memory)、ROM(Read Only Memory)等を備える。制御部110は通信部120、記憶部130と協働して画像解析モジュール111を実現する。また、制御部110は記憶部130と協働して顔検出モジュール112を実現する。 The control unit 110 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like. The control unit 110 implements the image analysis module 111 in cooperation with the communication unit 120 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130.
 通信部120として、他の機器と通信可能にするためのデバイス、例えば、IEEE802.11に準拠したWiFi(Wireless Fidelity)対応デバイス又は第3世代、第4世代移動通信システム等のIMT-2000規格に準拠した無線デバイス等を備える。有線によるLAN接続であってもよい。 As the communication unit 120, a device for enabling communication with other devices, for example, a WiFi (Wireless Fidelity) compatible device compliant with IEEE 802.11 or an IMT-2000 standard such as a third generation or fourth generation mobile communication system. Compliant wireless device etc. It may be a wired LAN connection.
 また、記憶部130として、ハードディスクや半導体メモリによる、データのストレージ部を備え、撮像画像や、画像解析結果、顔検出結果等の処理に必要なデータ等を記憶する。 Further, the storage unit 130 includes a data storage unit such as a hard disk or a semiconductor memory, and stores data necessary for processing of captured images, image analysis results, face detection results, and the like.
 入力部140は、TV会議システムを利用するために必要な機能を備えるものとする。入力を実現するための例として、タッチパネル機能を実現する液晶ディスプレイ、キーボード、マウス、ペンタブレット、装置上のハードウェアボタン、音声認識を行うためのマイク等を備えることが可能である。入力方法により、本発明は特に機能を限定されるものではない。 Suppose that the input unit 140 has functions necessary for using the TV conference system. As an example for realizing the input, a liquid crystal display that realizes a touch panel function, a keyboard, a mouse, a pen tablet, a hardware button on the apparatus, a microphone for performing voice recognition, and the like can be provided. The function of the present invention is not particularly limited by the input method.
 出力部150は、TV会議システムを利用するために必要な機能を備えるものとする。出力部150は、制御部110、記憶部130と協働して顔一覧表示モジュール151を実現する。出力を実現するための例として、液晶ディスプレイ、PCのディスプレイ、プロジェクターへの投影等の表示と音声出力等の形態が考えられる。出力方法により、本発明は特に機能を限定されるものではない。 Suppose that the output unit 150 has functions necessary for using the TV conference system. The output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. As an example for realizing the output, forms such as a liquid crystal display, a PC display, a projection on a projector, and an audio output can be considered. The function of the present invention is not particularly limited by the output method.
 [顔一覧表示処理]
 図3は、TV会議装置100での顔一覧表示処理のフローチャート図である。上述した各モジュールが実行する処理について、本処理にあわせて説明する。ここでは、画像解析処理と顔検出処理と顔一覧表示処理を撮像画像受信側のTV会議装置100aで行う場合のフローチャート図を例として示す。
[Face list display processing]
FIG. 3 is a flowchart of face list display processing in the TV conference apparatus 100. Processing executed by each module described above will be described in accordance with this processing. Here, an example of a flowchart in the case where the image analysis process, the face detection process, and the face list display process are performed in the TV conference device 100a on the captured image reception side is shown as an example.
 はじめに、TV会議装置100aの制御部110は、通信部120を介して、接続先のTV会議装置100bにTV会議の開始を通知する(ステップS301)。 First, the control unit 110 of the TV conference device 100a notifies the connection destination TV conference device 100b of the start of the TV conference via the communication unit 120 (step S301).
 TV会議装置100bの制御部110は、この通知を受けて、カメラ部10での撮像を開始する(ステップS302)。ここでの撮像画像は、TV会議装置100aでの画像解析に必要なだけの情報量を持った精密な画像であるものとし、画素数や画質を指定可能であるものとする。また、ここで動画の撮像とあわせて音声データの取得も行う。 Upon receiving this notification, the control unit 110 of the TV conference apparatus 100b starts imaging with the camera unit 10 (step S302). Here, it is assumed that the captured image is a precise image having an amount of information necessary for image analysis in the TV conference apparatus 100a, and the number of pixels and the image quality can be designated. In addition, audio data is acquired together with moving image capturing.
 次に、TV会議装置100bの制御部110は、通信部120を介して、撮像画像をTV会議装置100aに送信する(ステップS303)。撮像画像が動画の場合には、あわせて音声データの送信も行う。 Next, the control unit 110 of the TV conference device 100b transmits the captured image to the TV conference device 100a via the communication unit 120 (step S303). When the captured image is a moving image, audio data is also transmitted.
 TV会議装置100aの画像解析モジュール111は、通信部120を介して、TV会議装置100bから撮像画像の受信を行う(ステップS304)。また、撮像画像とあわせて、音声データの受信も行う。受信した撮像画像と音声データは、記憶部130に保存するものとする。 The image analysis module 111 of the TV conference apparatus 100a receives a captured image from the TV conference apparatus 100b via the communication unit 120 (step S304). Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130.
 TV会議装置100aの画像解析モジュール111は、受信した撮像画像の画像解析を行う(ステップS305)。ここでの画像解析とは、会議の参加者の位置や人数等の解析である。また、あわせて、性別や年代などを解析してもよいし、社員データベース等を利用して参加者個人を特定する解析を行ってもよい。 The image analysis module 111 of the TV conference apparatus 100a performs image analysis of the received captured image (step S305). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
 次に、TV会議装置100aの顔検出モジュール112は、ステップS305の画像解析結果を基に、各参加者の顔を含む部分を顔部分として検出する(ステップS306)。ここでの顔検出は、参加者の頭部位置を特定するためのものであり、たとえば、参加者がカメラを向いておらず目や口等のパーツが見つけられない場合にも、側頭部や後頭部を顔として検出してもよいものとする。 Next, the face detection module 112 of the TV conference device 100a detects a part including each participant's face as a face part based on the image analysis result of step S305 (step S306). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
 これらの画像解析や顔検出を行うためには、人間が教師学習を行ってもよいし、機械学習やディープラーニング(深層学習)を利用してもよい。また、学習のためには、大量のデータが必要となるので、学習済みの画像解析モジュール111や顔検出モジュール112を、通信部120を介して外部から取得することとしてもよい。ただし、画像解析や顔検出の方法に関しては、本特許を限定するものではなく、既存の技術を利用可能であるものとする。 In order to perform these image analysis and face detection, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). In addition, since a large amount of data is required for learning, the learned image analysis module 111 and face detection module 112 may be acquired from the outside via the communication unit 120. However, with respect to image analysis and face detection methods, this patent is not limited, and existing techniques can be used.
 画像解析と顔検出の結果を基に、TV会議装置100aの顔一覧表示モジュール151は、検出された複数の参加者の顔部分の画像を出力部150に一覧表示する(ステップS307)。顔の一覧表示方法として具体的には、出力部150を検出した参加者の数以上の領域に分割し、分割した各表示領域の中央に、検出した顔部分を配置して表示する。ここで表示する顔部分は、頭部のみでなく、頭部から胸部までとしてもよい。参加者の人数に応じて、顔部分を中央に配置した表示領域を並べ顔一覧表示とする。ここで表示する顔部分は、頭部のみでなく、頭部から胸部までとしてもよい。ここで顔部分を配置する際に、全体画像からそのまま切り出しを行っただけでは参加者によって顔の大きさが異なる場合には、同じ程度の大きさとなるよう、大きさの自動調整をおこなってもよい。 Based on the results of the image analysis and face detection, the face list display module 151 of the TV conference device 100a displays a list of detected face part images of the plurality of participants on the output unit 150 (step S307). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. Here, when the face part is placed, if the face size varies from participant to participant simply by cutting out the whole image as it is, even if the size is automatically adjusted so that it will be the same size Good.
 図10は、一般的なTV会議の表示の一例を示す図である。出力部150に、撮像画像をそのまま表示しているので、全体の雰囲気は伝わるが、顔部分の表示が小さいため、参加者1001、参加者1002、参加者1003、参加者1004の、それぞれの表情を読み取ることは困難である。 FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.
 図11は、本発明の顔一覧表示処理の表示の一例を示す図である。出力部150の領域1101に参加者1001を、領域1102に参加者1002を、領域1103に参加者1003を、領域1104に参加者1004をそれぞれ表示しているため、各参加者の顔の表示が大きく、表情を読み取りやすいことが分かる。ここでは、領域1105に「TV会議システム 2016/9/9 15:07:19 <<○○事業所と接続中>> 通話開始:2016/9/9 14:05:33 相手先参加者:4名」と、日時、接続先、通話開始時刻、相手先参加者人数、を表示する例を示している。このように、空き領域に、情報を表示してもよいし、または、全体の様子を表す撮像画像を表示してもよい。また、領域を分割する際に、空き領域を作らず出力部150全体が顔一覧表示領域になるよう分割してもよい。その際、各参加者の領域の大きさが異なってもよいものとする。 FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions. Here, in the area 1105, “TV conference system 2016/9/9 15:07:19 << Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed. In this way, information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.
 顔一覧の表示後、TV会議装置100aの制御部110は、TV会議を終了するかどうか確認する(ステップS308)。TV会議終了の指定は、ユーザが入力部140を介して行えるものとする。TV会議を終了する場合には、次のステップS309に進み、TV会議を終了しない場合には、ステップS304に戻って処理を継続する。 After displaying the face list, the control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S308). It is assumed that the user can designate the end of the video conference via the input unit 140. If the video conference is to be terminated, the process proceeds to the next step S309. If the video conference is not to be terminated, the process returns to step S304 to continue the processing.
 TV会議を終了する場合、TV会議装置100aの制御部110は、通信部120を介してTV会議装置100bにTV会議の終了を通知する(ステップS309)。 When ending the TV conference, the control unit 110 of the TV conference device 100a notifies the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S309).
 TV会議装置100bは、TV会議を終了するかどうかの確認を行い(ステップS310)、終了しない場合にはステップS302に戻って処理を継続し、終了する場合にはTV会議を終了する。 The TV conference device 100b confirms whether or not to end the TV conference (step S310). If the TV conference apparatus 100b does not end, the process returns to step S302 to continue the process, and if it ends, the TV conference ends.
 ここでは、フローチャートの簡略化のために、TV会議装置100bでの撮像画像をTV会議装置100aに表示する場合の処理のみを記載したが、通常のTV会議では、TV会議装置100aでの撮像画像をTV会議装置100bに表示する処理も並行して行う。また、TV会議開始とTV開始終了の通知も、TV会議装置100aからTV会議装置100bに対して行うフローのみを記載したが、TV会議装置100bからTV会議装置100aに行っても問題ない。 Here, for simplification of the flowchart, only the processing in the case of displaying a captured image on the TV conference device 100b on the TV conference device 100a is described, but in a normal TV conference, the captured image on the TV conference device 100a is described. Is displayed on the TV conference apparatus 100b in parallel. In addition, the notification of the TV conference start and the TV start end is described only in the flow performed from the TV conference apparatus 100a to the TV conference apparatus 100b, but there is no problem even if it is performed from the TV conference apparatus 100b to the TV conference apparatus 100a.
 以上のように、本発明によれば、参加者がTV会議を行うTV会議システムにおいて、1拠点の参加者が複数人いる場合でも、接続先に全員の顔を適切に表示し、参加者の表情を読み取ることが可能なTV会議システム、TV会議方法、およびTV会議プログラムを提供することが可能となる。 As described above, according to the present invention, in a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.
 [背景置き換え機能]
 図12は、顔一覧表示処理で背景を置き換えた表示の一例を示す図である。図11と比較すると、出力部150の領域1201、領域1202、領域1203、領域1204のそれぞれの参加者の背景を、一様な背景に置き換えていることが分かる。背景置き換え処理は、顔一覧表示モジュール151が図3のフローチャートのステップS307の顔一覧表示処理を行う際に、あわせて行ってもよい。このように、背景部分を置き換えて表示することで、余分な情報が排除され、各参加者の表情がより読み取りやすくなる。ここでは、参加者1001、参加者1002、参加者1003、参加者1004、の全員に対して同じ背景を用いたが、参加者によって背景を変えてもよい。また、3拠点以上でのTV会議を行っている場合には、拠点毎に参加者の背景を変えて、どの参加者同士が同じ空間にいるのかを分かりやすく表示してもよい。また、ここでは領域1205に全体の様子を表す撮像画像を表示しているが、図11のように日時、接続先、通話開始時刻、相手先参加者人数、等を表示してもよい。また、表示1206のように「背景:置き換え中」という情報を出力部150のどこかに表示してもよい。背景置き換え機能を使用するかどうかは、ユーザが好みのタイミングで設定可能として良く、また、設定を保存することとしてもよい。
[Background replacement function]
FIG. 12 is a diagram illustrating an example of a display in which the background is replaced in the face list display process. Compared to FIG. 11, it can be seen that the backgrounds of the participants in the areas 1201, 1202, 1203, and 1204 of the output unit 150 are replaced with uniform backgrounds. The background replacement process may be performed when the face list display module 151 performs the face list display process of step S307 in the flowchart of FIG. In this way, by replacing the background portion and displaying it, extra information is eliminated and the facial expressions of each participant are easier to read. Here, the same background is used for all of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, but the background may be changed depending on the participant. In addition, when a TV conference is held at three or more locations, the background of the participants may be changed for each location, and which participants are in the same space may be displayed in an easy-to-understand manner. Here, a captured image representing the overall state is displayed in the area 1205. However, as shown in FIG. 11, the date and time, the connection destination, the call start time, the number of participants of the other party, and the like may be displayed. Further, information such as “background: being replaced” may be displayed somewhere on the output unit 150 as shown in the display 1206. Whether or not to use the background replacement function may be set by the user at a desired timing, or the setting may be saved.
 [顔部分置き換え機能]
 図13は、顔一覧表示処理で所定の条件を満たす顔部分を置き換えた表示の一例を示す図である。顔部分置き換え処理は、顔一覧表示モジュール151が図3のフローチャートのステップS307の顔一覧表示処理を行う際に、あわせて行ってもよい。顔一覧表示モジュール151が所定の条件を判定し、条件を満たす場合に顔部分を置き換える。ここでの所定の条件とは、例えば撮像画像に参加者がいなくなった場合、つまり参加者がカメラの撮影範囲外に出た場合等が考えられる。図13を図12と比較すると、参加者1002がいなくなったため、領域1302には参加者1002に代わり画像1307を表示していることが分かる。ここでは領域1305に全体の様子を表す撮像画像を表示している例を図示しているので、参加者1002が不在であることが確認できる。その他3人の参加者は在席しているため、参加者1001は領域1301に、参加者1003は領域1303に、参加者1004は領域1304に、図12と同じように表示している。ここで、画像1307は、参加者1002の静止画像であってもよいし、好きなイラストや参加者1002のアバター等であってよく、参加者が好みに応じて設定出来るものとする。また、領域1305には、日時、接続先、通話開始時刻、相手先参加者人数、等を表示してもよい。また、表示1306のように「退席中」という情報を領域1302のどこかに表示してもよい。
[Face replacement function]
FIG. 13 is a diagram illustrating an example of a display in which a face portion satisfying a predetermined condition is replaced in the face list display process. The face part replacement process may be performed when the face list display module 151 performs the face list display process of step S307 in the flowchart of FIG. The face list display module 151 determines a predetermined condition, and replaces the face portion when the condition is satisfied. The predetermined condition here may be, for example, a case where there are no participants in the captured image, that is, a case where the participant goes out of the shooting range of the camera. Comparing FIG. 13 with FIG. 12, it can be seen that an image 1307 is displayed in the area 1302 instead of the participant 1002 because the participant 1002 disappears. Here, since an example in which a captured image representing the entire state is displayed in the area 1305 is illustrated, it can be confirmed that the participant 1002 is absent. Since the other three participants are present, the participant 1001 is displayed in the area 1301, the participant 1003 is displayed in the area 1303, and the participant 1004 is displayed in the area 1304 as in FIG. Here, the image 1307 may be a still image of the participant 1002, a favorite illustration, an avatar of the participant 1002, or the like, and can be set by the participant according to his / her preference. The area 1305 may display date and time, connection destination, call start time, number of participants at the other end, and the like. Further, information such as “Away” may be displayed somewhere in the area 1302 as in the display 1306.
 [注目度変更機能]
 図14は、顔一覧表示処理で発言していることを検出した参加者の注目度を変更する表示の一例を示す図である。注目度変更処理では、顔検出モジュール112が図3のフローチャートのステップS306の顔検出処理を行う際に、あわせて参加者がそれぞれ発言しているか否かを検出し、顔一覧表示モジュール151は図3のフローチャートのステップS307の顔一覧表示処理を行う際に、あわせて発言していることを検知した参加者の注目度を変更する。ここでの注目度の変更とは、発言者を目立たせて注目を集めやすくするために行うものである。具体的には、図13と図14を比較するとわかるように、領域1404に示すように背景の色を変更することが考えられる。その他3人の参加者は発言していないため、参加者1001は領域1401に、参加者1002は領域1402に、参加者1003は領域1403に、図13と同じように表示している。注目度変更のその他の例としては、表示1406のように「発言中」という情報を領域1404のどこかに表示してもよい。または、発言している参加者1004を表示する領域1404の場所自体を、出力部150の中央など目立つ部分に配置してもよい(非図示)。さらに、領域1405に全体表示を行っている場合には、さきの注目度変更処理とあわせて、表示1407のように発言者の位置をマーク等で指し示してもよい。ここでは、発言している参加者が、参加者1004のみである例を図示したが、複数の参加者が発言している場合には、複数人に対して注目度を変更する処理を行ってもよい。
[Attention change function]
FIG. 14 is a diagram illustrating an example of a display for changing the attention level of a participant who has detected that he / she is speaking in the face list display process. In the attention level changing process, when the face detection module 112 performs the face detection process of step S306 in the flowchart of FIG. 3, it is detected whether or not each participant is speaking, and the face list display module 151 When performing the face list display process in step S307 in the flowchart of FIG. 3, the attention level of the participant who has detected that he / she is speaking is changed. Here, the attention level change is performed in order to make the speaker stand out and to attract attention. Specifically, as can be seen by comparing FIG. 13 and FIG. 14, it is conceivable to change the background color as shown in a region 1404. Since the other three participants are not speaking, the participant 1001 is displayed in the area 1401, the participant 1002 is displayed in the area 1402, and the participant 1003 is displayed in the area 1403 as in FIG. As another example of the attention level change, information “speaking” may be displayed somewhere in the area 1404 like a display 1406. Alternatively, the area 1404 where the speaking participant 1004 is displayed may be placed in a conspicuous part such as the center of the output unit 150 (not shown). Further, when the entire display is performed in the area 1405, the position of the speaker may be indicated by a mark or the like as shown in the display 1407 in conjunction with the attention level changing process described above. Here, an example in which the participant who speaks is only the participant 1004 is illustrated, but when a plurality of participants speak, a process of changing the degree of attention is performed on a plurality of people. Also good.
 [送信側で画像解析処理と顔検出処理、受信側で顔一覧表示処理]
 図4は、画像解析処理と顔検出処理を送信側のTV会議装置100aで、顔一覧表示処理を受信側のTV会議装置100bで行う場合の機能ブロックと各機能の関係を示す図である。TV会議装置100は、カメラ部10、制御部110、通信部120、記憶部130、入力部140、出力部150、から構成される。制御部110はカメラ部10、記憶部130と協働して画像解析モジュール113を実現する。また、制御部110は記憶部130と協働して顔検出モジュール112を実現する。また、出力部150は、制御部110、記憶部130と協働して顔一覧表示モジュール151を実現する。通信網300は、インターネット等の公衆通信網でも専用通信網でも良く、TV会議装置100aとTV会議装置100b間の通信を可能とする。
[Image analysis processing and face detection processing on the transmission side, face list display processing on the reception side]
FIG. 4 is a diagram showing a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the transmitting-side TV conference device 100a and face list display processing is performed by the receiving-side TV conference device 100b. The video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150. The control unit 110 implements the image analysis module 113 in cooperation with the camera unit 10 and the storage unit 130. Further, the control unit 110 implements the face detection module 112 in cooperation with the storage unit 130. Further, the output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. The communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the TV conference apparatus 100b.
 図5は、画像解析処理と顔検出処理を送信側のTV会議装置100aで、顔一覧表示処理を受信側のTV会議装置100bで行う場合のフローチャート図である。上述した各モジュールが実行する処理について、本処理にあわせて説明する。 FIG. 5 is a flowchart when the image analysis process and the face detection process are performed by the transmitting-side TV conference apparatus 100a, and the face list display process is performed by the receiving-side TV conference apparatus 100b. Processing executed by each module described above will be described in accordance with this processing.
 はじめに、TV会議装置100aの制御部110は、通信部120を介して、接続先のTV会議装置100bにTV会議の開始を通知する(ステップS501)。 First, the control unit 110 of the TV conference device 100a notifies the connection destination TV conference device 100b of the start of the TV conference via the communication unit 120 (step S501).
 次に、TV会議装置100aの制御部110は、カメラ部10での撮像を開始する(ステップS502)。ここでの撮像画像は、TV会議装置100aでの画像解析に必要なだけの情報量を持った精密な画像であるものとし、画素数や画質を指定可能であるものとする。また、ここで動画の撮像とあわせて音声データの取得も行う。撮像した画像と音声データは、記憶部130に記憶する。 Next, the control unit 110 of the TV conference device 100a starts imaging with the camera unit 10 (step S502). Here, it is assumed that the captured image is a precise image having an amount of information necessary for image analysis in the TV conference apparatus 100a, and the number of pixels and the image quality can be designated. In addition, audio data is acquired together with moving image capturing. The captured image and audio data are stored in the storage unit 130.
 TV会議装置100aの画像解析モジュール113は、撮像した撮像画像の画像解析を行う(ステップS503)。ここでの画像解析とは、会議の参加者の位置や人数等の解析である。また、あわせて、性別や年代などを解析してもよいし、社員データベース等を利用して参加者個人を特定する解析を行ってもよい。 The image analysis module 113 of the TV conference apparatus 100a performs image analysis of the captured image (step S503). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
 次に、TV会議装置100aの顔検出モジュール112は、ステップS503の画像解析結果を基に、各参加者の顔を含む部分を顔部分として検出する(ステップS504)。ここでの顔検出は、参加者の頭部位置を特定するためのものであり、たとえば、参加者がカメラを向いておらず目や口等のパーツが見つけられない場合にも、側頭部や後頭部を顔として検出してもよいものとする。 Next, the face detection module 112 of the TV conference device 100a detects a part including each participant's face as a face part based on the image analysis result of step S503 (step S504). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
 これらの画像解析や顔検出を行うためには、人間が教師学習を行ってもよいし、機械学習やディープラーニング(深層学習)を利用してもよい。また、学習のためには、大量のデータが必要となるので、学習済みの画像解析モジュール113や顔検出モジュール112を、通信部120を介して外部から取得することとしてもよい。ただし、画像解析や顔検出の方法に関しては、本特許を限定するものではなく、既存の技術を利用可能であるものとする。 In order to perform these image analysis and face detection, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). Further, since a large amount of data is required for learning, the learned image analysis module 113 and the face detection module 112 may be acquired from the outside via the communication unit 120. However, with respect to image analysis and face detection methods, this patent is not limited, and existing techniques can be used.
 次に、TV会議装置100aの制御部110は、通信部120を介して、解析画像をTV会議装置100bに送信する(ステップS505)。撮像画像が動画の場合には、あわせて音声データの送信も行う。ここで送信する解析画像は、画像解析と顔検出の結果、顔一覧表示に必要と判断されるデータのみとしてもよいし、あわせてTV会議装置100aで撮像し画像解析に使用した撮像画像そのもの又は解像度を変更した撮像画像を送ってもよい。 Next, the control unit 110 of the TV conference device 100a transmits the analysis image to the TV conference device 100b via the communication unit 120 (step S505). When the captured image is a moving image, audio data is also transmitted. The analysis image transmitted here may be only the data determined to be necessary for the face list display as a result of the image analysis and face detection, or the captured image itself that was captured by the TV conference device 100a and used for the image analysis, or You may send the captured image which changed the resolution.
 TV会議装置100bは、通信部120を介して、TV会議装置100aから解析画像の受信を行う(ステップS506)。また、解析画像とあわせて、音声データの受信も行う。受信した解析画像と音声データは、記憶部130に保存するものとする。 The TV conference device 100b receives an analysis image from the TV conference device 100a via the communication unit 120 (step S506). Audio data is also received along with the analysis image. The received analysis image and audio data are saved in the storage unit 130.
 受信した解析画像と音声データを基に、TV会議装置100bの顔一覧表示モジュール151は、検出された複数の参加者の顔部分の画像を出力部150に一覧表示する(ステップS507)。顔の一覧表示方法として具体的には、出力部150を検出した参加者の数以上の領域に分割し、分割した各表示領域の中央に、検出した顔部分を配置して表示する。参加者の人数に応じて、顔部分を中央に配置した表示領域を並べ顔一覧表示とする。ここで表示する顔部分は、頭部のみでなく、頭部から胸部までとしてもよい。顔部分を配置する際に、全体画像からそのまま切り出しを行っただけでは、参加者によって顔の大きさが異なる場合には、同じ程度の大きさとなるよう、大きさの自動調整をおこなってもよい。 Based on the received analysis image and audio data, the face list display module 151 of the TV conference device 100b displays a list of detected face part images of the plurality of participants on the output unit 150 (step S507). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. When a face part is placed, if the face size varies depending on the participant, the size may be automatically adjusted so that the face size is the same if the face image is simply cut out from the entire image. .
 図10は、一般的なTV会議の表示の一例を示す図である。出力部150に、撮像画像をそのまま表示しているので、全体の雰囲気は伝わるが、顔部分の表示が小さいため、参加者1001、参加者1002、参加者1003、参加者1004の、それぞれの表情を読み取ることは困難である。 FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.
 図11は、本発明の顔一覧表示処理の表示の一例を示す図である。出力部150の領域1101に参加者1001を、領域1102に参加者1002を、領域1103に参加者1003を、領域1104に参加者1004をそれぞれ表示しているため、各参加者の顔の表示が大きく、表情を読み取りやすいことが分かる。ここでは、領域1105に「TV会議システム 2016/9/9 15:07:19 <<○○事業所と接続中>> 通話開始:2016/9/9 14:05:33 相手先参加者:4名」と、日時、接続先、通話開始時刻、相手先参加者人数、を表示する例を示している。このように、空き領域に、情報を表示してもよいし、または、全体の様子を表す撮像画像を表示してもよい。また、領域を分割する際に、空き領域を作らず出力部150全体が顔一覧表示領域になるよう分割してもよい。その際、各参加者の領域の大きさが異なってもよいものとする。 FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions. Here, in the area 1105, “TV conference system 2016/9/9 15:07:19 << Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed. In this way, information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.
 TV会議装置100aの制御部110は、TV会議を終了するかどうか確認する(ステップS508)。TV会議終了の指定は、ユーザが入力部140を介して行えるものとする。TV会議を終了する場合には、次のステップS509に進み、TV会議を終了しない場合には、ステップS502に戻って処理を継続する。 The control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S508). It is assumed that the user can designate the end of the video conference via the input unit 140. If the video conference is to be ended, the process proceeds to the next step S509. If the video conference is not to be ended, the process returns to step S502 and the process is continued.
 TV会議を終了する場合、TV会議装置100aの制御部110は、通信部120を介してTV会議装置100bにTV会議の終了を通知する(ステップS509)。 When ending the TV conference, the control unit 110 of the TV conference device 100a notifies the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S509).
 ここでは、フローチャートの簡略化のために、TV会議装置100aでの撮像画像をTV会議装置100bに表示する場合の処理のみを記載したが、通常のTV会議では、TV会議装置100bでの撮像画像をTV会議装置100aに表示する処理も並行して行う。また、TV会議開始とTV開始終了の通知も、TV会議装置100aからTV会議装置100bに対して行うフローのみを記載したが、TV会議装置100bからTV会議装置100aに行っても問題ない。 Here, for simplification of the flowchart, only the processing in the case of displaying a captured image on the TV conference apparatus 100a on the TV conference apparatus 100b is described, but in a normal TV conference, the captured image on the TV conference apparatus 100b is described. Is displayed on the TV conference apparatus 100a in parallel. In addition, the notification of the TV conference start and the TV start end is described only in the flow performed from the TV conference apparatus 100a to the TV conference apparatus 100b, but there is no problem even if it is performed from the TV conference apparatus 100b to the TV conference apparatus 100a.
 図2・図3の受信側で画像解析処理と顔検出処理と顔一覧表示処理を行う構成と、図4・5の送信側で画像解析処理と顔検出処理、受信側で顔一覧表示処理を行う構成とを比較した場合、図4・図5では、解析画像のみを送信すればよいので、特に背景を置き換え済みの場合等に、通信データ量を抑える効果が期待できる。 2 and 3 is configured to perform image analysis processing, face detection processing, and face list display processing on the reception side, image analysis processing and face detection processing on the transmission side in FIGS. 4 and 5, and face list display processing on the reception side. When comparing with the configuration to be performed, in FIG. 4 and FIG. 5, it is only necessary to transmit the analysis image, so that the effect of suppressing the communication data amount can be expected particularly when the background has been replaced.
 以上のように、本発明によれば、参加者がTV会議を行うTV会議システムにおいて、1拠点の参加者が複数人いる場合でも、接続先に全員の顔を適切に表示し、参加者の表情を読み取ることが可能なTV会議システム、TV会議方法、およびTV会議プログラムを提供することが可能となる。 As described above, according to the present invention, in a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.
 [コンピュータ200で画像解析処理と顔検出処理、受信側で顔一覧表示処理]
 図6は、画像解析処理と顔検出処理をコンピュータ200で、顔一覧表示処理を受信側のTV会議装置100bで行う場合の機能ブロックと各機能の関係を示す図である。TV会議装置100は、カメラ部10、制御部110、通信部120、記憶部130、入力部140、出力部150、から構成される。出力部150は、制御部110、記憶部130と協働して顔一覧表示モジュール151を実現する。コンピュータ200は、制御部210、通信部220、記憶部230、から構成される。制御部210は通信部220、記憶部230と協働して画像解析モジュール211を実現する。また、制御部210は記憶部230と協働して、顔検出モジュール212を実現する。通信網300は、インターネット等の公衆通信網でも専用通信網でも良く、TV会議装置100aとコンピュータ200間、TV会議装置100bとコンピュータ200間の通信を可能とする。
[Image analysis processing and face detection processing by computer 200, and face list display processing at reception side]
FIG. 6 is a diagram illustrating a relationship between functional blocks and functions when image analysis processing and face detection processing are performed by the computer 200 and face list display processing is performed by the TV conference device 100b on the receiving side. The video conference apparatus 100 includes a camera unit 10, a control unit 110, a communication unit 120, a storage unit 130, an input unit 140, and an output unit 150. The output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. The computer 200 includes a control unit 210, a communication unit 220, and a storage unit 230. The control unit 210 implements the image analysis module 211 in cooperation with the communication unit 220 and the storage unit 230. In addition, the control unit 210 implements the face detection module 212 in cooperation with the storage unit 230. The communication network 300 may be a public communication network such as the Internet or a dedicated communication network, and enables communication between the TV conference apparatus 100a and the computer 200 and between the TV conference apparatus 100b and the computer 200.
 TV会議装置100は、装置全体として前述した各構成を備えればよく、内蔵デバイスまたは外付けデバイスなどの形態は問わない。例えば、TV会議装置100は、携帯電話、携帯情報端末、タブレット端末、パーソナルコンピュータに加え、ネットブック端末、スレート端末、電子書籍端末、携帯型音楽プレーヤ等の電化製品や、スマートグラス、ヘッドマウントディスプレイ等のウェアラブル端末や、その他の物品である。図6でTV会議装置100aとして図示しているスマートフォン、TV会議装置100bとして図示しているパーソナルコンピュータとディスプレイとWEBカメラ、はその一例にすぎない。 The TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device. For example, the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display. Wearable terminals such as, and other items. The smartphone illustrated as the TV conference apparatus 100a in FIG. 6 and the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.
 TV会議装置100は、カメラ部10として、レンズ、撮像素子、各種ボタン、フラッシュ等の撮像デバイス等を備え、動画や静止画等の撮像画像として撮像する。また、撮像して得られる画像は、画像解析に必要なだけの情報量を持った精密な画像であるものとし、画素数や画質を指定可能であるものとする。さらに、カメラ部10は動画の撮像と併せた音声データの取得を行うためのマイクを備えるか、入力部140のマイク機能を利用可能であるものとする。 The TV conference apparatus 100 includes, as the camera unit 10, an imaging device such as a lens, an imaging device, various buttons, and a flash, and captures images as captured images such as moving images and still images. An image obtained by imaging is a precise image having an amount of information necessary for image analysis, and the number of pixels and the image quality can be designated. Further, it is assumed that the camera unit 10 includes a microphone for acquiring audio data combined with moving image capturing or can use the microphone function of the input unit 140.
 制御部110として、CPU(Central Processing Unit)、RAM(Random Access Memory)、ROM(Read Only Memory)等を備える。 The control unit 110 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like.
 通信部120として、他の機器と通信可能にするためのデバイス、例えば、IEEE802.11に準拠したWiFi(Wireless Fidelity)対応デバイス又は第3世代、第4世代移動通信システム等のIMT-2000規格に準拠した無線デバイス等を備える。有線によるLAN接続であってもよい。 As the communication unit 120, a device for enabling communication with other devices, for example, a WiFi (Wireless Fidelity) compatible device compliant with IEEE 802.11 or an IMT-2000 standard such as a third generation or fourth generation mobile communication system. Compliant wireless device etc. It may be a wired LAN connection.
 また、記憶部130として、ハードディスクや半導体メモリによる、データのストレージ部を備え、撮像画像や、画像解析結果、顔検出結果等の処理に必要なデータ等を記憶する。 Further, the storage unit 130 includes a data storage unit such as a hard disk or a semiconductor memory, and stores data necessary for processing of captured images, image analysis results, face detection results, and the like.
 入力部140は、TV会議システムを利用するために必要な機能を備えるものとする。入力を実現するための例として、タッチパネル機能を実現する液晶ディスプレイ、キーボード、マウス、ペンタブレット、装置上のハードウェアボタン、音声認識を行うためのマイク等を備えることが可能である。入力方法により、本発明は特に機能を限定されるものではない。 Suppose that the input unit 140 has functions necessary for using the TV conference system. As an example for realizing the input, a liquid crystal display that realizes a touch panel function, a keyboard, a mouse, a pen tablet, a hardware button on the apparatus, a microphone for performing voice recognition, and the like can be provided. The function of the present invention is not particularly limited by the input method.
 出力部150は、TV会議システムを利用するために必要な機能を備えるものとする。出力部150は、制御部110、記憶部130と協働して顔一覧表示モジュール151を実現する。出力を実現するための例として、液晶ディスプレイ、PCのディスプレイ、プロジェクターへの投影等の表示と音声出力等の形態が考えられる。出力方法により、本発明は特に機能を限定されるものではない。 Suppose that the output unit 150 has functions necessary for using the TV conference system. The output unit 150 implements the face list display module 151 in cooperation with the control unit 110 and the storage unit 130. As an example for realizing the output, forms such as a liquid crystal display, a PC display, a projection on a projector, and an audio output can be considered. The function of the present invention is not particularly limited by the output method.
 コンピュータ200は、後述の機能を備える、一般的な計算機であってよい。ここでは記載していないが、必要に応じて、入力部、出力部を備えてもよい。 The computer 200 may be a general computer having the functions described below. Although not described here, an input unit and an output unit may be provided as necessary.
 コンピュータ200は、制御部210として、CPU、RAM、ROM等を備える。制御部210は通信部220、記憶部230と協働して画像解析モジュール211を実現する。また、制御部210は記憶部230と協働して顔検出モジュール212を実現する。 The computer 200 includes a CPU, RAM, ROM and the like as the control unit 210. The control unit 210 implements the image analysis module 211 in cooperation with the communication unit 220 and the storage unit 230. In addition, the control unit 210 implements the face detection module 212 in cooperation with the storage unit 230.
 通信部220として、他の機器と通信可能にするためのデバイス、例えば、IEEE802.11に準拠したWiFi対応デバイス又は第3世代、第4世代移動通信システム等のIMT-2000規格に準拠した無線デバイス等を備える。有線によるLAN接続であってもよい。 A device for enabling communication with other devices as the communication unit 220, for example, a WiFi compatible device compliant with IEEE802.11 or a wireless device compliant with the IMT-2000 standard such as a third generation or fourth generation mobile communication system Etc. It may be a wired LAN connection.
 記憶部230として、ハードディスクや半導体メモリによる、データのストレージ部を備える。記憶部230には、取得した撮像画像、画像解析結果、顔検出結果等のデータを保持するものとする。 The storage unit 230 includes a data storage unit using a hard disk or a semiconductor memory. The storage unit 230 holds data such as acquired captured images, image analysis results, and face detection results.
 図7は、画像解析処理と顔検出処理をコンピュータ200で、顔一覧表示処理を受信側のTV会議装置100bで行う場合のフローチャート図である。上述した各モジュールが実行する処理について、本処理にあわせて説明する。 FIG. 7 is a flowchart when the image analysis process and the face detection process are performed by the computer 200 and the face list display process is performed by the TV conference device 100b on the receiving side. Processing executed by each module described above will be described in accordance with this processing.
 はじめに、TV会議装置100aの制御部110は、通信部120を介して、コンピュータ200と、接続先のTV会議装置100bに、TV会議の開始を通知する(ステップS701)。TV会議装置100aとTV会議装置100bが直接通信を行わない構成とする場合には、TV会議装置100aからTV会議開始の通知を受けたコンピュータ200は、TV会議装置100bにTV会議開始の通知を行う。 First, the control unit 110 of the TV conference apparatus 100a notifies the start of the TV conference to the computer 200 and the connected TV conference apparatus 100b via the communication unit 120 (step S701). When the TV conference device 100a and the TV conference device 100b do not communicate directly, the computer 200 that has received the notification of the start of the TV conference from the TV conference device 100a notifies the TV conference device 100b of the start of the TV conference. Do.
 次に、TV会議装置100aの制御部110は、カメラ部10での撮像を開始する(ステップS702)。ここでの撮像画像は、コンピュータ200での画像解析に必要なだけの情報量を持った精密な画像であるものとし、画素数や画質を指定可能であるものとする。また、ここで動画の撮像とあわせて音声データの取得も行う。撮像した画像と音声データは、記憶部130に記憶する。 Next, the control unit 110 of the TV conference device 100a starts imaging with the camera unit 10 (step S702). Here, it is assumed that the captured image is a precise image having an amount of information necessary for image analysis by the computer 200, and the number of pixels and the image quality can be designated. In addition, audio data is acquired together with moving image capturing. The captured image and audio data are stored in the storage unit 130.
 次に、TV会議装置100aの制御部110は、通信部120を介して、撮像画像をコンピュータ200に送信する(ステップS703)。撮像画像が動画の場合には、あわせて音声データの送信も行う。 Next, the control unit 110 of the TV conference device 100a transmits the captured image to the computer 200 via the communication unit 120 (step S703). When the captured image is a moving image, audio data is also transmitted.
 コンピュータ200の画像解析モジュール211は、通信部220を介して、TV会議装置100aから撮像画像の受信を行う(ステップS704)。また、撮像画像とあわせて、音声データの受信も行う。受信した撮像画像と音声データは、記憶部230に保存するものとする。 The image analysis module 211 of the computer 200 receives a captured image from the TV conference device 100a via the communication unit 220 (step S704). Audio data is also received along with the captured image. The received captured image and audio data are stored in the storage unit 230.
 コンピュータ200の画像解析モジュール211は、受信した撮像画像の画像解析を行う(ステップS705)。ここでの画像解析とは、会議の参加者の位置や人数等の解析である。また、あわせて、性別や年代などを解析してもよいし、社員データベース等を利用して参加者個人を特定する解析を行ってもよい。 The image analysis module 211 of the computer 200 performs image analysis of the received captured image (step S705). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
 次に、コンピュータ200の顔検出モジュール212は、ステップS705の画像解析結果を基に、各参加者の顔を含む部分を顔部分として検出する(ステップS706)。ここでの顔検出は、参加者の頭部位置を特定するためのものであり、たとえば、参加者がカメラを向いておらず目や口等のパーツが見つけられない場合にも、側頭部や後頭部を顔として検出してもよいものとする。 Next, the face detection module 212 of the computer 200 detects a part including the face of each participant as a face part based on the image analysis result of step S705 (step S706). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
 これらの画像解析や顔検出を行うためには、人間が教師学習を行ってもよいし、機械学習やディープラーニング(深層学習)を利用してもよい。また、学習済みの画像解析モジュール211や顔検出モジュール212を、外部から取得することとしてもよい。ただし、画像解析や顔検出の方法に関しては、本特許を限定するものではなく、既存の技術を利用可能であるものとする。 In order to perform these image analysis and face detection, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). The learned image analysis module 211 and face detection module 212 may be acquired from the outside. However, with respect to image analysis and face detection methods, this patent is not limited, and existing techniques can be used.
 次に、コンピュータ200の制御部210は、通信部220を介して、解析画像をTV会議装置100bに送信する(ステップS707)。撮像画像が動画の場合には、あわせて音声データの送信も行う。ここで送信する解析画像は、画像解析と顔検出の結果、顔一覧表示に必要と判断されるデータのみとしてもよいし、あわせてTV会議装置100aで撮像しコンピュータ200での画像解析に使用した撮像画像そのもの又は解像度を変更した撮像画像を送ってもよい。 Next, the control unit 210 of the computer 200 transmits the analysis image to the TV conference device 100b via the communication unit 220 (step S707). When the captured image is a moving image, audio data is also transmitted. The analysis image to be transmitted here may be only data determined to be necessary for the face list display as a result of image analysis and face detection, or is taken by the TV conference apparatus 100a and used for image analysis by the computer 200. You may send the captured image itself or the captured image which changed resolution.
 TV会議装置100bは、通信部120を介して、コンピュータ200から解析画像の受信を行う(ステップS708)。また、解析画像とあわせて、音声データの受信も行う。受信した解析画像と音声データは、記憶部130に保存するものとする。 The TV conference apparatus 100b receives an analysis image from the computer 200 via the communication unit 120 (step S708). Audio data is also received along with the analysis image. The received analysis image and audio data are saved in the storage unit 130.
 受信した解析画像と音声データを基に、TV会議装置100bの顔一覧表示モジュール151は、検出された複数の参加者の顔部分の画像を出力部150に一覧表示する(ステップS709)。顔の一覧表示方法として具体的には、出力部150を検出した参加者の数以上の領域に分割し、分割した各表示領域の中央に、検出した顔部分を配置して表示する。参加者の人数に応じて、顔部分を中央に配置した表示領域を並べ顔一覧表示とする。ここで表示する顔部分は、頭部のみでなく、頭部から胸部までとしてもよい。顔部分を配置する際に、全体画像からそのまま切り出しを行っただけでは、参加者によって顔の大きさが異なる場合には、同じ程度の大きさとなるよう、大きさの自動調整をおこなってもよい。 Based on the received analysis image and audio data, the face list display module 151 of the TV conference apparatus 100b displays a list of detected face part images on the output unit 150 (step S709). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. When a face part is placed, if the face size varies depending on the participant, the size may be automatically adjusted so that the face size is the same if the face image is simply cut out from the entire image. .
 図10は、一般的なTV会議の表示の一例を示す図である。出力部150に、撮像画像をそのまま表示しているので、全体の雰囲気は伝わるが、顔部分の表示が小さいため、参加者1001、参加者1002、参加者1003、参加者1004の、それぞれの表情を読み取ることは困難である。 FIG. 10 is a diagram showing an example of a general video conference display. Since the captured image is displayed as it is on the output unit 150, the overall atmosphere is transmitted, but the display of the face portion is small, so the facial expressions of the participant 1001, the participant 1002, the participant 1003, and the participant 1004, respectively. Is difficult to read.
 図11は、本発明の顔一覧表示処理の表示の一例を示す図である。出力部150の領域1101に参加者1001を、領域1102に参加者1002を、領域1103に参加者1003を、領域1104に参加者1004をそれぞれ表示しているため、各参加者の顔の表示が大きく、表情を読み取りやすいことが分かる。ここでは、領域1105に「TV会議システム 2016/9/9 15:07:19 <<○○事業所と接続中>> 通話開始:2016/9/9 14:05:33 相手先参加者:4名」と、日時、接続先、通話開始時刻、相手先参加者人数、を表示する例を示している。このように、空き領域に、情報を表示してもよいし、または、全体の様子を表す撮像画像を表示してもよい。また、領域を分割する際に、空き領域を作らず出力部150全体が顔一覧表示領域になるよう分割してもよい。その際、各参加者の領域の大きさが異なってもよいものとする。 FIG. 11 is a diagram showing an example of the display of the face list display process of the present invention. Since the participant 1001 is displayed in the area 1101 of the output unit 150, the participant 1002 is displayed in the area 1102, the participant 1003 is displayed in the area 1103, and the participant 1004 is displayed in the area 1104, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions. Here, in the area 1105, “TV conference system 2016/9/9 15:07:19 << Connecting to XX office >> Call start: 2016/9/9 14:05:33 Destination participant: 4 In this example, the “name”, the date and time, the connection destination, the call start time, and the number of destination participants are displayed. In this way, information may be displayed in the empty area, or a captured image representing the entire state may be displayed. Further, when the area is divided, the entire output unit 150 may be divided into a face list display area without creating an empty area. At that time, the size of each participant's area may be different.
 TV会議装置100aの制御部110は、TV会議を終了するかどうか確認する(ステップS710)。TV会議終了の指定は、ユーザが入力部140を介して行えるものとする。TV会議を終了する場合には、次のステップS711に進み、TV会議を終了しない場合には、ステップS702に戻って処理を継続する。 The control unit 110 of the TV conference device 100a confirms whether or not to end the TV conference (step S710). It is assumed that the user can designate the end of the video conference via the input unit 140. If the TV conference is to be ended, the process proceeds to the next step S711. If the TV conference is not to be ended, the process returns to step S702 to continue the processing.
 TV会議を終了する場合、TV会議装置100aの制御部110は、通信部120を介してコンピュータ200とTV会議装置100bにTV会議の終了を通知する(ステップS711)。TV会議装置100aとTV会議装置100bが直接通信を行わない構成とする場合には、TV会議装置100aからTV会議終了の通知を受けたコンピュータ200が、TV会議装置100bにTV会議終了の通知を行う。 When ending the TV conference, the control unit 110 of the TV conference device 100a notifies the computer 200 and the TV conference device 100b of the end of the TV conference via the communication unit 120 (step S711). When the TV conference apparatus 100a and the TV conference apparatus 100b do not directly communicate with each other, the computer 200 that receives the TV conference end notification from the TV conference apparatus 100a notifies the TV conference apparatus 100b of the TV conference end notification. Do.
 ここでは、フローチャートの簡略化のために、TV会議装置100aでの撮像画像をTV会議装置100bに表示する場合の処理のみを記載したが、通常のTV会議では、TV会議装置100bでの撮像画像をTV会議装置100aに表示する処理も並行して行う。また、TV会議開始とTV開始終了の通知も、TV会議装置100aからTV会議装置100bに対して行うフローのみを記載したが、TV会議装置100bからTV会議装置100aに行っても問題ない。これらの場合にも、TV会議装置100aとTV会議装置100bが直接通信を行わない構成とする場合には、通知はそれぞれコンピュータ200を介して行うものとする。 Here, for simplification of the flowchart, only the processing in the case of displaying a captured image on the TV conference apparatus 100a on the TV conference apparatus 100b is described, but in a normal TV conference, the captured image on the TV conference apparatus 100b is described. Is displayed on the TV conference apparatus 100a in parallel. In addition, the notification of the TV conference start and the TV start end is described only in the flow performed from the TV conference apparatus 100a to the TV conference apparatus 100b, but there is no problem even if it is performed from the TV conference apparatus 100b to the TV conference apparatus 100a. Also in these cases, when the TV conference apparatus 100a and the TV conference apparatus 100b are configured not to perform direct communication, the notification is performed via the computer 200, respectively.
 画像解析処理と顔検出処理と顔一覧表示処理をTV会議装置100で行う構成と、画像解析処理と顔検出処理と顔一覧表示処理をコンピュータ200で行う構成とを比較した場合、後者は、画像解析モジュール211と顔検出モジュール212の更新が容易であるという利点が考えられる。また、機械学習やディープラーニングには大量のデータが必要となるので、その点でもコンピュータ200は、記憶部230に大容量のストレージを装備しやすいという利点がある。 When comparing the configuration in which the video conference apparatus 100 performs image analysis processing, face detection processing, and face list display processing with the configuration in which image analysis processing, face detection processing, and face list display processing are performed by the computer 200, the latter An advantage is that the analysis module 211 and the face detection module 212 can be easily updated. Further, since a large amount of data is required for machine learning and deep learning, the computer 200 has an advantage that a large-capacity storage is easily provided in the storage unit 230 in this respect.
 以上のように、本発明によれば、参加者がTV会議を行うTV会議システムにおいて、1拠点の参加者が複数人いる場合でも、接続先に全員の顔を適切に表示し、参加者の表情を読み取ることが可能なTV会議システム、TV会議方法、およびTV会議プログラムを提供することが可能となる。 As described above, according to the present invention, in a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions.
 [発言履歴表示機能]
 図8は、TV会議装置100で顔一覧表示処理と発言履歴表示処理を行う場合の機能ブロックと各機能の関係を示す図である。図2の構成に加えて、制御部110は通信部120、記憶部130と協働して発言検知モジュール114と発言者判定モジュール115を実現する。また、出力部150は、制御部110、記憶部130と協働して発言履歴表示モジュール152を実現する。
[Speaking history display function]
FIG. 8 is a diagram illustrating the relationship between the function blocks and the functions when the face list display process and the speech history display process are performed in the TV conference apparatus 100. In addition to the configuration of FIG. 2, the control unit 110 implements a speech detection module 114 and a speaker determination module 115 in cooperation with the communication unit 120 and the storage unit 130. In addition, the output unit 150 implements the message history display module 152 in cooperation with the control unit 110 and the storage unit 130.
 TV会議装置100は、装置全体として前述した各構成を備えればよく、内蔵デバイスまたは外付けデバイスなどの形態は問わない。例えば、TV会議装置100は、携帯電話、携帯情報端末、タブレット端末、パーソナルコンピュータに加え、ネットブック端末、スレート端末、電子書籍端末、携帯型音楽プレーヤ等の電化製品や、スマートグラス、ヘッドマウントディスプレイ等のウェアラブル端末や、その他の物品である。図8でTV会議装置100aとして図示しているスマートフォン、TV会議装置100bとして図示しているパーソナルコンピュータとディスプレイとWEBカメラ、はその一例にすぎない。 The TV conference apparatus 100 may have the above-described configurations as the entire apparatus, and may be in the form of an internal device or an external device. For example, the TV conference device 100 includes a mobile phone, a portable information terminal, a tablet terminal, a personal computer, an electronic product such as a netbook terminal, a slate terminal, an electronic book terminal, and a portable music player, a smart glass, and a head-mounted display. Wearable terminals such as, and other items. The smartphone illustrated as the TV conference apparatus 100a in FIG. 8, the personal computer, the display, and the WEB camera illustrated as the TV conference apparatus 100b are merely examples.
 図9は、TV会議装置100での顔一覧表示処理と発言履歴表示処理のフローチャート図である。上述した各モジュールが実行する処理について、本処理にあわせて説明する。ここでは、画像解析処理から発言履歴表示処理までの一連の処理を撮像画像受信側のTV会議装置100で行う場合の例を示す。ただし、顔一覧表示処理の例として前述した内容と同じく、撮像画像送信側のTV会議装置100やコンピュータ200で、画像解析処理、顔検出処理、発言検知処理、発言者判定処理を行い、撮像装置受信側のTV会議装置100では、顔一覧表示処理と発言履歴表示処理のみを行う構成としてもよい。 FIG. 9 is a flowchart of the face list display process and the speech history display process in the TV conference apparatus 100. Processing executed by each module described above will be described in accordance with this processing. Here, an example in which a series of processing from image analysis processing to speech history display processing is performed by the video conference apparatus 100 on the captured image receiving side is shown. However, as in the case of the face list display process, the image analysis process, the face detection process, the speech detection process, and the speaker determination process are performed by the TV conference apparatus 100 or the computer 200 on the captured image transmission side, as described above. The TV conference device 100 on the receiving side may be configured to perform only face list display processing and speech history display processing.
 はじめに、撮像画像受信側のTV会議装置100の画像解析モジュール111は、通信部120を介して、撮像画像の受信を行う(ステップS901)。また、撮像画像とあわせて、音声データの受信も行う。受信した撮像画像と音声データは、記憶部130に保存するものとする。ここでは、フローチャートの簡略化のために、TV会議開始の通知を記載していないが、ステップS901以前に、TV会議開始の通知が行われているものとする。 First, the image analysis module 111 of the TV conference device 100 on the captured image receiving side receives a captured image via the communication unit 120 (step S901). Audio data is also received along with the captured image. The received captured image and audio data are saved in the storage unit 130. Here, for simplification of the flowchart, the TV conference start notification is not described, but it is assumed that the TV conference start notification is performed before step S901.
 次に、画像解析モジュール111は、受信した撮像画像の画像解析を行う(ステップS902)。ここでの画像解析とは、会議の参加者の位置や人数等の解析である。また、あわせて、性別や年代などを解析してもよいし、社員データベース等を利用して参加者個人を特定する解析を行ってもよい。 Next, the image analysis module 111 performs image analysis of the received captured image (step S902). The image analysis here is an analysis of the positions and number of participants in the conference. In addition, gender and age may be analyzed, or analysis may be performed to identify individual participants using an employee database or the like.
 次に、顔検出モジュール112は、ステップS902の画像解析結果を基に、各参加者の顔を含む部分を顔部分として検出する(ステップS903)。ここでの顔検出は、参加者の頭部位置を特定するためのものであり、たとえば、参加者がカメラを向いておらず目や口等のパーツが見つけられない場合にも、側頭部や後頭部を顔として検出してもよいものとする。 Next, the face detection module 112 detects a part including each participant's face as a face part based on the image analysis result of step S902 (step S903). The face detection here is for specifying the participant's head position. For example, even when the participant is not facing the camera and parts such as eyes and mouth cannot be found, the temporal region Or the back of the head may be detected as a face.
 これらの画像解析や顔検出を行うためには、人間が教師学習を行ってもよいし、機械学習やディープラーニング(深層学習)を利用してもよい。また、学習のためには、大量のデータが必要となるので、学習済みの画像解析モジュール111や顔検出モジュール112を、通信部120を介して外部から取得することとしてもよい。ただし、画像解析や顔検出の方法に関しては、本特許を限定するものではなく、既存の技術を利用可能であるものとする。 In order to perform these image analysis and face detection, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). In addition, since a large amount of data is required for learning, the learned image analysis module 111 and face detection module 112 may be acquired from the outside via the communication unit 120. However, with respect to image analysis and face detection methods, this patent is not limited, and existing techniques can be used.
 次に、発言検知モジュール114は、受信した音声データを基に、各参加者の発言を検知する(ステップS904)。ここでの発言検知は、音声認識により、受信した音声データの内容を分析して、テキスト化するものである。複数人が同時に発言を行っており発言の切り分けが難しい場合には、ステップS902の画像解析結果、ステップS903の顔検知結果、等を利用して、音声の高さや入力方向等から、認識機能の向上を行ってもよい。音声認識の方法に関しては、本特許を限定するものではなく、既存の技術を利用可能であるものとする。 Next, the speech detection module 114 detects the speech of each participant based on the received audio data (step S904). In this speech detection, the content of received voice data is analyzed and converted into text by voice recognition. If multiple people are speaking at the same time and it is difficult to isolate the speech, use the image analysis result in step S902, the face detection result in step S903, etc. Improvements may be made. Regarding the speech recognition method, this patent is not limited, and it is assumed that the existing technology can be used.
 次に、発言者判定モジュール115は、ステップS902の画像解析結果、ステップS903の顔検知結果、ステップS904の発言検知結果、等を基に、発言者を判定する(ステップS905)。ここでの発言者判定は、どの参加者が発言しているかを、撮像画像、解析画像による口の動き、音声の高さや入力方向等を利用して特定し、ステップS904で検知した発言内容と紐付けするものである。これらの処理の結果を、どの参加者がいつどのような発言をしたのかというデータとして、記憶部130に保存する。 Next, the speaker determination module 115 determines a speaker based on the image analysis result in step S902, the face detection result in step S903, the speech detection result in step S904, and the like (step S905). The speaker determination here specifies which participant is speaking using the captured image, the mouth movement based on the analysis image, the voice height, the input direction, and the like, and the content of the speech detected in step S904. It is to be tied. The results of these processes are stored in the storage unit 130 as data as to which participant made what and when.
 これらの発言検知や発言者判定を行うためには、人間が教師学習を行ってもよいし、機械学習やディープラーニング(深層学習)を利用してもよい。また、学習のためには、大量のデータが必要となるので、学習済みの発言検知モジュール114や発言者判定モジュール115を、通信部120を介して外部から取得することとしてもよい。ただし、発言検知や発言者判定の方法に関しては、本特許を限定するものではなく、既存の技術を利用可能であるものとする。 In order to perform such speech detection and speaker determination, a human may perform teacher learning, or may use machine learning or deep learning (deep learning). In addition, since a large amount of data is required for learning, the learned speech detection module 114 and the speaker determination module 115 may be acquired from the outside via the communication unit 120. However, regarding the method of speech detection and speaker determination, this patent is not limited and existing technology can be used.
 画像解析と顔検出の結果を基に、顔一覧表示モジュール151は、検出された複数の参加者の顔部分の画像を出力部150に一覧表示する(ステップS906)。顔の一覧表示方法として具体的には、出力部150を検出した参加者の数以上の領域に分割し、分割した各表示領域の中央に、検出した顔部分を配置して表示する。参加者の人数に応じて、顔部分を中央に配置した表示領域を並べ顔一覧表示とする。ここで表示する顔部分は、頭部のみでなく、頭部から胸部までとしてもよい。顔部分を配置する際に、全体画像からそのまま切り出しを行っただけでは、参加者によって顔の大きさが異なる場合には、同じ程度の大きさとなるよう、大きさの自動調整をおこなってもよい。 Based on the results of image analysis and face detection, the face list display module 151 displays a list of detected face part images of a plurality of participants on the output unit 150 (step S906). Specifically, as the face list display method, the output unit 150 is divided into areas equal to or more than the number of detected participants, and the detected face portion is arranged and displayed at the center of each divided display area. Depending on the number of participants, the display area with the face portion arranged in the center is arranged as a face list display. The face portion to be displayed here is not limited to the head, but may be from the head to the chest. When a face part is placed, if the face size varies depending on the participant, the size may be automatically adjusted so that the face size is the same if the face image is simply cut out from the entire image. .
 顔一覧の表示後、発言履歴表示モジュール152は、発言履歴を表示するかどうか確認する(ステップS907)。発言履歴表示の指定は、ユーザが入力部140を介して行えるものとする。発言履歴を表示する場合には、次のステップS908に進み、発言履歴を表示しない場合には、処理を終了する。 After the face list is displayed, the speech history display module 152 confirms whether or not to display the speech history (step S907). It is assumed that the user can specify the message history display via the input unit 140. If the message history is displayed, the process proceeds to the next step S908. If the message history is not displayed, the process is terminated.
 発言履歴を表示する場合、発言履歴表示モジュール152は、ユーザに入力部140を介して、発言履歴を表示する参加者を選択させる(ステップS908)。ここでは、選択させる参加者の数は問わず、1人でも複数人でも全員でもよいものとする。また、発言者の選択を行わず、常に全員の発言履歴を表示する設定も選択可能としてよい。 When displaying the speech history, the speech history display module 152 causes the user to select a participant who displays the speech history via the input unit 140 (step S908). Here, the number of participants to be selected is not limited, and one, a plurality of people, or all of the participants may be selected. In addition, it is possible to select a setting that always displays the speech history of all members without selecting the speaker.
 最後に、発言履歴表示モジュール152は、選択された参加者の発言履歴を、出力部150に表示する(ステップS909)。ここでは、フローチャートの簡略化のために、TV会議終了の通知を記載していないが、TV会議の終了時には、相手先TV会議装置100に対して、TV会議終了の通知を行うものとする。 Finally, the speech history display module 152 displays the speech history of the selected participant on the output unit 150 (step S909). Here, for simplification of the flowchart, the TV conference end notification is not described. However, when the TV conference ends, it is assumed that the TV conference end notification is sent to the destination TV conference apparatus 100.
 図15は、顔一覧表示処理と発言履歴表示処理の表示の一例を示す図である。出力部150の領域1501に参加者1001を、領域1502に参加者1002を、領域1503に参加者1003を、領域1504に参加者1004をそれぞれ表示しているため、各参加者の顔の表示が大きく、表情を読み取りやすいことが分かる。また、ここでは、参加者1001を「参加者A」として表示1506で、参加者1002を「参加者B」として表示1507で、参加者1003を「参加者C」として表示1508で、参加者1004を「参加者D」として表示1509で、出力部150に表示している。ステップS908の参加者選択で、「参加者C」を選択する場合、ポインタ1510で参加者Cの領域1503または表示1508を選択すればよいものとする。「参加者C」が選択されていることは、出力部150上に分かりやすく表示してもよい。ここでは、ステップS909の発言履歴表示により、領域1505に「参加者C」の発言した内容を表示した例を図示している。発言履歴が多すぎて表示しきれない場合には、発言履歴を表示する領域1505にスクロールバー1511等を設けて、過去の発言にさかのぼって表示できるようにしてもよい。発言履歴を表示する参加者が複数人である場合には、発言内容とあわせて、その発言を行った参加者も表示するものとする。また、発言履歴を表示する場合には、発言した時刻もあわせて表示すると、より分かりやすい。 FIG. 15 is a diagram showing an example of the display of the face list display process and the speech history display process. Since the participant 1001 is displayed in the area 1501 of the output unit 150, the participant 1002 is displayed in the area 1502, the participant 1003 is displayed in the area 1503, and the participant 1004 is displayed in the area 1504, the face display of each participant is displayed. It can be seen that it is large and easy to read facial expressions. Also, here, participant 1001 is displayed as “Participant A” 1506, participant 1002 is displayed as “Participant B” 1507, participant 1003 is displayed as “Participant C” 1508, and participant 1004 Is displayed on the output unit 150 as a “participant D”. When “participant C” is selected in the participant selection in step S908, the region 1503 or display 1508 of the participant C may be selected with the pointer 1510. The fact that “participant C” is selected may be displayed on the output unit 150 in an easily understandable manner. Here, an example in which the content of the speech of “participant C” is displayed in the area 1505 by the speech history display in step S909 is illustrated. If there are too many utterance histories that can be displayed, a scroll bar 1511 or the like may be provided in the utterance history display area 1505 so that the past utterances can be displayed. When there are a plurality of participants who display the speech history, the participant who made the speech is also displayed together with the content of the speech. In addition, when the speech history is displayed, it is easier to understand if the time of speech is also displayed.
 以上のように、本発明によれば、参加者がTV会議を行うTV会議システムにおいて、1拠点の参加者が複数人いる場合でも、接続先に全員の顔を適切に表示し、参加者の表情を読み取ることが可能で、誰がどのような発言を行ったかがわかりやすいTV会議システム、TV会議方法、およびTV会議プログラムを提供することが可能となる。 As described above, according to the present invention, in a TV conference system in which a participant conducts a TV conference, even when there are a plurality of participants at one site, the faces of all the members are appropriately displayed at the connection destination. It is possible to provide a TV conference system, a TV conference method, and a TV conference program that can read facial expressions and easily understand who made what.
 上述した手段、機能は、コンピュータ(CPU、情報処理装置、各種端末を含む)が、所定のプログラムを読み込んで、実行することによって実現される。プログラムは、例えば、コンピュータからネットワーク経由で提供される(SaaS:ソフトウェア・アズ・ア・サービス)形態であってもよいし、フレキシブルディスク、CD(CD-ROMなど)、DVD(DVD-ROM、DVD-RAMなど)、コンパクトメモリ等のコンピュータ読取可能な記録媒体に記録された形態で提供される。この場合、コンピュータはその記録媒体からプログラムを読み取って内部記憶装置又は外部記憶装置に転送し記憶して実行する。また、そのプログラムを、例えば、磁気ディスク、光ディスク、光磁気ディスク等の記憶装置(記録媒体)に予め記録しておき、その記憶装置から通信回線を介してコンピュータに提供するようにしてもよい。 The means and functions described above are realized by a computer (including a CPU, an information processing apparatus, and various terminals) reading and executing a predetermined program. The program may be, for example, in a form (SaaS: Software as a Service) provided from a computer via a network, or a flexible disk, CD (CD-ROM, etc.), DVD (DVD-ROM, DVD). -RAM, etc.) and a computer-readable recording medium such as a compact memory. In this case, the computer reads the program from the recording medium, transfers it to the internal storage device or the external storage device, stores it, and executes it. The program may be recorded in advance in a storage device (recording medium) such as a magnetic disk, an optical disk, or a magneto-optical disk, and provided from the storage device to a computer via a communication line.
 以上、本発明の実施形態について説明したが、本発明は上述したこれらの実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not limited to these embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.
100 TV会議装置、200 コンピュータ、300 通信網 100 TV conference device, 200 computer, 300 communication network

Claims (9)

  1.  参加者がTV会議を行うTV会議システムであって、
     前記参加者が映ったTV会議の画像を画像解析する画像解析手段と、
     前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出する顔検出手段と、
     検出された複数の参加者の顔部分の画像を一覧表示する顔一覧表示手段と、
    を備えることを特徴とするTV会議システム。
    A TV conference system in which participants conduct a TV conference,
    Image analysis means for image analysis of an image of the TV conference in which the participant is shown;
    From the result of the image analysis, face detection means for detecting a part including the face of the participant as a face part;
    Face list display means for displaying a list of images of the detected face parts of a plurality of participants;
    A video conference system comprising:
  2.  前記顔一覧表示手段は、前記検出された顔部分を各表示領域の中央に配置し、当該顔部分が中央に配置された表示領域を並べることで、一覧表示とすることを特徴とする請求項1に記載のTV会議システム。 The face list display means arranges the detected face part in the center of each display area, and arranges the display area in which the face part is arranged in the center to make a list display. 1. The video conference system according to 1.
  3.  前記顔一覧表示手段は、前記顔部分の画像を一覧表示する際に、検出した顔以外の背景部分を置き換えて表示することを特徴とする請求項1または請求項2に記載のTV会議システム。 3. The video conference system according to claim 1, wherein the face list display means replaces and displays a background part other than the detected face when displaying a list of images of the face part.
  4.  前記顔一覧表示手段は、一覧表示の開始後、前記検出された顔部分が所定の条件を満たす場合に、他の画像に置き換えて表示することを特徴とする請求項1から請求項3のいずれか一項に記載のTV会議システム。 4. The face list display means, when starting the list display, when the detected face portion satisfies a predetermined condition, the face list display means displays the image by replacing it with another image. The video conference system according to claim 1.
  5.  前記顔検出手段は、さらに、検出した顔部分の参加者が、それぞれ発言しているか否かを検出し、
     前記顔一覧表示手段は、さらに、前記顔部分の画像を一覧表示する際に、発言していることを検出した参加者の注目度を変更する
    ことを特徴とする請求項1から請求項4のいずれか一項に記載のTV会議システム。
    The face detection means further detects whether or not each participant of the detected face portion speaks,
    The face list display means further changes the attention level of the participant who detects that he / she is speaking when displaying a list of images of the face portion. The video conference system according to any one of the above.
  6.  前記参加者の発言を検知する発言検知手段と、
     前記検知された発言の発言者を判定する発言者判定手段と、
     前記一覧表示の中から選択した参加者の発言履歴を表示する発言履歴表示手段と、
    を備えることを特徴とする請求項1から請求項5のいずれか一項に記載のTV会議システム。
    A speech detection means for detecting the speech of the participant;
    Speaker determination means for determining a speaker of the detected speech;
    A speech history display means for displaying a speech history of a participant selected from the list display;
    The video conference system according to any one of claims 1 to 5, further comprising:
  7.  参加者がTV会議を行うTV会議システムであって、
     前記参加者が映ったTV会議の画像を画像解析する画像解析手段と、
     前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出し、さらに、検出した顔部分の参加者が、それぞれ発言しているか否かを検出する顔検出手段と、
     検出された複数の参加者の顔部分の画像を一覧表示し、さらに、前記顔部分の画像を一覧表示する際に、発言していることを検出した参加者の注目度を変更する顔一覧表示手段と、
     前記参加者の発言を検知する発言検知手段と、
     前記検知された発言の発言者を判定する発言者判定手段と、
     前記参加者の発言履歴を表示する発言履歴表示手段と、
    を備えることを特徴とするTV会議システム。
    A TV conference system in which participants conduct a TV conference,
    Image analysis means for image analysis of an image of the TV conference in which the participant is shown;
    From the result of the image analysis, a part including the face of the participant is detected as a face part, and further, a face detection unit for detecting whether or not each participant of the detected face part is speaking,
    A list of face images of a plurality of detected participants is displayed, and a face list display for changing the attention level of the participant who has detected the speech when displaying the images of the face portions. Means,
    A speech detection means for detecting the speech of the participant;
    Speaker determination means for determining a speaker of the detected speech;
    A statement history display means for displaying the participant's statement history;
    A video conference system comprising:
  8.  参加者がTV会議を行うTV会議方法であって、
     前記参加者が映ったTV会議の画像を画像解析するステップと、
     前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出するステップと、
     検出された複数の参加者の顔部分の画像を一覧表示するステップと、
    を備えることを特徴とするTV会議方法。
    A TV conference method in which participants conduct a TV conference,
    Image analysis of an image of a TV conference in which the participant is shown;
    Detecting a part including the face of the participant as a face part from the result of the image analysis;
    Displaying a list of images of face portions of a plurality of detected participants;
    A video conferencing method comprising:
  9.  参加者がTV会議を行うコンピュータシステムに、
     前記参加者が映ったTV会議の画像を画像解析するステップ、
     前記画像解析の結果から、前記参加者の顔を含む部分を顔部分として検出するステップ、
     検出された複数の参加者の顔部分の画像を一覧表示するステップ、
    を実行させるためのプログラム。
    A computer system in which participants conduct a TV conference,
    Image analysis of an image of a video conference in which the participant is shown;
    Detecting a part including the face of the participant as a face part from the result of the image analysis;
    Displaying a list of images of face portions of a plurality of detected participants;
    A program for running
PCT/JP2016/078992 2016-09-30 2016-09-30 Tv conference system, tv conference method, and program WO2018061173A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/078992 WO2018061173A1 (en) 2016-09-30 2016-09-30 Tv conference system, tv conference method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/078992 WO2018061173A1 (en) 2016-09-30 2016-09-30 Tv conference system, tv conference method, and program

Publications (1)

Publication Number Publication Date
WO2018061173A1 true WO2018061173A1 (en) 2018-04-05

Family

ID=61760351

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/078992 WO2018061173A1 (en) 2016-09-30 2016-09-30 Tv conference system, tv conference method, and program

Country Status (1)

Country Link
WO (1) WO2018061173A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3627832A1 (en) 2018-09-21 2020-03-25 Yamaha Corporation Image processing apparatus, camera apparatus, and image processing method
WO2020151443A1 (en) * 2019-01-23 2020-07-30 广州视源电子科技股份有限公司 Video image transmission method, device, interactive intelligent tablet and storage medium
JP2021521497A (en) * 2018-05-04 2021-08-26 グーグル エルエルシーGoogle LLC Adaptation of automated assistants based on detected mouth movements and / or gaze
US11493992B2 (en) 2018-05-04 2022-11-08 Google Llc Invoking automated assistant function(s) based on detected gesture and gaze
US11688417B2 (en) 2018-05-04 2023-06-27 Google Llc Hot-word free adaptation of automated assistant function(s)
US12020704B2 (en) 2022-01-19 2024-06-25 Google Llc Dynamic adaptation of parameter set used in hot word free adaptation of automated assistant

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004153674A (en) * 2002-10-31 2004-05-27 Sony Corp Camera apparatus
JP2009194857A (en) * 2008-02-18 2009-08-27 Sharp Corp Communication conference system, communication apparatus, communication conference method, and computer program
JP2009206924A (en) * 2008-02-28 2009-09-10 Fuji Xerox Co Ltd Information processing apparatus, information processing system and information processing program
JP2012054897A (en) * 2010-09-03 2012-03-15 Sharp Corp Conference system, information processing apparatus, and information processing method
JP2014175866A (en) * 2013-03-08 2014-09-22 Ricoh Co Ltd Video conference system
JP2015019162A (en) * 2013-07-09 2015-01-29 大日本印刷株式会社 Convention support system
JP2016134781A (en) * 2015-01-20 2016-07-25 株式会社リコー Information processing device, voice output method, program and communication system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004153674A (en) * 2002-10-31 2004-05-27 Sony Corp Camera apparatus
JP2009194857A (en) * 2008-02-18 2009-08-27 Sharp Corp Communication conference system, communication apparatus, communication conference method, and computer program
JP2009206924A (en) * 2008-02-28 2009-09-10 Fuji Xerox Co Ltd Information processing apparatus, information processing system and information processing program
JP2012054897A (en) * 2010-09-03 2012-03-15 Sharp Corp Conference system, information processing apparatus, and information processing method
JP2014175866A (en) * 2013-03-08 2014-09-22 Ricoh Co Ltd Video conference system
JP2015019162A (en) * 2013-07-09 2015-01-29 大日本印刷株式会社 Convention support system
JP2016134781A (en) * 2015-01-20 2016-07-25 株式会社リコー Information processing device, voice output method, program and communication system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021521497A (en) * 2018-05-04 2021-08-26 グーグル エルエルシーGoogle LLC Adaptation of automated assistants based on detected mouth movements and / or gaze
US11493992B2 (en) 2018-05-04 2022-11-08 Google Llc Invoking automated assistant function(s) based on detected gesture and gaze
US11614794B2 (en) 2018-05-04 2023-03-28 Google Llc Adapting automated assistant based on detected mouth movement and/or gaze
US11688417B2 (en) 2018-05-04 2023-06-27 Google Llc Hot-word free adaptation of automated assistant function(s)
JP7471279B2 (en) 2018-05-04 2024-04-19 グーグル エルエルシー Adapting an automated assistant based on detected mouth movements and/or gaze
EP3627832A1 (en) 2018-09-21 2020-03-25 Yamaha Corporation Image processing apparatus, camera apparatus, and image processing method
US10965909B2 (en) 2018-09-21 2021-03-30 Yamaha Corporation Image processing apparatus, camera apparatus, and image processing method
WO2020151443A1 (en) * 2019-01-23 2020-07-30 广州视源电子科技股份有限公司 Video image transmission method, device, interactive intelligent tablet and storage medium
US12020704B2 (en) 2022-01-19 2024-06-25 Google Llc Dynamic adaptation of parameter set used in hot word free adaptation of automated assistant

Similar Documents

Publication Publication Date Title
WO2018061173A1 (en) Tv conference system, tv conference method, and program
KR20140100704A (en) Mobile terminal comprising voice communication function and voice communication method thereof
JP7100824B2 (en) Data processing equipment, data processing methods and programs
JP7283384B2 (en) Information processing terminal, information processing device, and information processing method
US9247206B2 (en) Information processing device, information processing system, and information processing method
US20170185365A1 (en) System and method for screen sharing
JP2004128614A (en) Image display controller and image display control program
JP2014220619A (en) Conference information recording system, information processing unit, control method and computer program
JP2011061450A (en) Conference communication system, method, and program
WO2018158852A1 (en) Telephone call system and communication system
CN114531564A (en) Processing method and electronic equipment
JP2020136921A (en) Video call system and computer program
AU2013222959A1 (en) Method and apparatus for processing information of image including a face
JP4973908B2 (en) Communication terminal and display method thereof
US20230093298A1 (en) Voice conference apparatus, voice conference system and voice conference method
WO2019026395A1 (en) Information processing device, information processing method, and program
JP5432805B2 (en) Speaking opportunity equalizing method, speaking opportunity equalizing apparatus, and speaking opportunity equalizing program
JP2004112511A (en) Display controller and method therefor
US11928253B2 (en) Virtual space control system, method for controlling the same, and control program
KR101562901B1 (en) System and method for supporing conversation
US11949727B2 (en) Organic conversations in a virtual group setting
JP5613102B2 (en) CONFERENCE DEVICE, CONFERENCE METHOD, AND CONFERENCE PROGRAM
JP2005091463A (en) Information processing device
WO2006106671A1 (en) Image processing device, image display device, reception device, transmission device, communication system, image processing method, image processing program, and recording medium containing the image processing program
JP2023184519A (en) Information processing system, information processing method, and computer program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16917723

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16917723

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP