WO2022249555A1

WO2022249555A1 - Image output device, image output method, and program

Info

Publication number: WO2022249555A1
Application number: PCT/JP2022/004219
Authority: WO
Inventors: 和博嶋内
Original assignee: ソニーグループ株式会社
Priority date: 2021-05-26
Filing date: 2022-02-03
Publication date: 2022-12-01

Abstract

The present disclosure pertains to an image output device, an image output method, and a program that enable more suitable switching of different types of images. An image selection unit selects at least one image to be outputted, on the basis of a detection result obtained by performing a different type of a detection process on each of a plurality of images. The feature of the present disclosure can be applied to, for example, a switcher.

Description

Image output device, image output method, and program

The present disclosure relates to an image output device, an image output method, and a program, and more particularly to an image output device, an image output method, and a program that enable more suitable switching between different types of images.

In general, video images of lectures, lessons, product reviews, etc. are not only shots with only the instructor as the subject, but also shots of objects in the instructor's hands and the state of work at hand, and to convey PC operations. It is often constructed by combining the PC images of the A user who shoots and creates such moving images needs to appropriately switch and edit a plurality of camera inputs according to the progress of a lecture or the like.

Patent Literature 1 and Patent Literature 2 disclose techniques for accepting a plurality of moving images as input and appropriately switching between them. In either case, the line of sight of the user is detected, and the image from the camera corresponding to the line of sight is specified and output.

　In order to detect the line of sight correctly, the user's face must be fixed so that it faces the front to some extent with respect to the camera for detecting the line of sight. The configuration of Patent Document 1 requires the user to wear the head-mounted display, and the configuration of Patent Document 2 assumes that the user is always facing the camera. In lectures, lessons, product reviews, etc., these restrictions are factors that hinder the performance of lecturers and the progress of lectures.

WO2017/145645 JP 2010-161655 A

A surveillance camera system can also be cited as a configuration that accepts multiple moving images as input and switches them appropriately. In a monitoring camera system, a uniform abnormality detection process or the like is performed on a plurality of camera images, and a camera image in which an abnormality or the like is detected is selected.

However, it was not always possible to switch the images appropriately for the user and the viewer.

The present disclosure has been made in view of such circumstances, and is intended to make it possible to switch between different types of images in a more suitable manner.

The image output device of the present disclosure is an image output device including an image selection unit that selects one or more images to be output based on detection results obtained by different types of detection processing for each of a plurality of images.

The image output method of the present disclosure is an image output method in which an image output device selects one or more images to be output based on detection results of different types of detection processes for each of a plurality of images.

The program of the present disclosure is a program for causing a computer to execute processing for selecting one or more images to be output based on detection results obtained by different types of detection processing for each of a plurality of images.

In the present disclosure, one or more images to be output are selected based on detection results obtained by different types of detection processing for each of the plurality of images.

1 is a block diagram showing a configuration example of an image output device to which technology according to the present disclosure is applied; FIG. 1 is a diagram illustrating a configuration example of an image output system according to a first embodiment of the present disclosure; FIG. 3 is a block diagram showing a functional configuration example of a switcher; FIG. FIG. 10 is a diagram showing an example of an output image; FIG. 4 is a flowchart for explaining the flow of image output processing; 4 is a flowchart for explaining the flow of image output processing; 4 is a flowchart for explaining the flow of image output processing; FIG. 10 is a diagram showing an example of a synthesized image; FIG. It is a figure which shows the structural example of the image output system which concerns on 2nd Embodiment of this indication. 3 is a block diagram showing a functional configuration example of a switcher; FIG. FIG. 10 is a diagram showing an example of an output image; FIG. FIG. 11 is a diagram illustrating a configuration example of an image output system according to a third embodiment of the present disclosure; FIG. 3 is a block diagram showing a functional configuration example of a switcher; FIG. FIG. 10 is a diagram showing an example of an output image; FIG. 4 is a flowchart for explaining the flow of image output processing; FIG. 12 is a diagram illustrating a configuration example of an image output system according to a fourth embodiment of the present disclosure; FIG. 3 is a block diagram showing a functional configuration example of a switcher (PC); FIG. 4 is a flowchart for explaining the flow of image output processing; 10 is a flowchart for explaining the flow of student image selection; FIG. 10 is a diagram showing an example of a synthesized image; FIG. It is a block diagram which shows the structural example of the hardware of a computer.

Hereinafter, modes for carrying out the present disclosure (hereinafter referred to as embodiments) will be described. The description will be given in the following order.

1. Outline of technology according to the present disclosure 2 . 1st embodiment (example of application to product review)
3. Second embodiment (example of application to piano lessons)
4. Third Embodiment (Example of Application to Cooking Lessons)
5. Fourth embodiment (example of application to online lectures)
6. Modification 7. Computer configuration example

<1. Overview of technology according to the present disclosure>
The technology according to the present disclosure is a system for live distribution of lectures, lessons, product reviews, etc. using moving images (hereinafter simply referred to as images), and for editing these images. It realizes appropriate switching and output according to the progress of the lecture (instructor's behavior).

FIG. 1 is a block diagram showing a configuration example of an image output device to which technology according to the present disclosure is applied.

The image output device 10 of FIG. 1 accepts as input a plurality of images captured by a plurality of cameras, performs different types of detection processing on each image, and selects an image to be output based on the detection results. and output.

At least one of the images input to the image output device 10 is a person image (first image) with a person as the subject, and the other images are object images (second image) with an arbitrary object as the subject. image). The person here is, for example, a lecturer who gives lectures, lessons, product reviews, etc., and the object is, for example, the hands of the lecturer or various objects handled by the lecturer.

The object images also include the display screen of the computer operated by the lecturer, and the audience image centering on the audience listening to the lecturer's speech over the network.

That is, the images input to the image output device 10 can be said to be different types of images.

The image output device 10 includes

state detection units

11 , 12 , 13 , an image selection unit 14 , and an image output/synthesis unit 15 .

The

state detection units

11, 12, and 13 perform different types of detection processing on each of the plurality of images input to the image output device 10.

The state detection unit 11 detects the state of the person in the person image input to the image output device 10 and supplies the detection result to the image selection unit 14 . The person's state detected by the state detection unit 11 includes at least one of the person's posture, position, and presence/absence of speech.

The

state detection units

12 and 13 each detect the state of the object in the object image input to the image output device 10 and supply the detection result to the image selection unit 14 .

The object image targeted for detection processing by the state detection unit 12 and the object image targeted for detection processing by the state detection unit 13 may be images with different objects as subjects, or personal computers (PCs) operated by people. ) is a display screen (PC image).

Therefore, the state of the object detected by the

state detection units

12 and 13 is at least one of the presence/absence, number, position, orientation, and shape of the object, or changes in the display screen of the PC operated by the instructor. do. The state of the object detected by the

state detection units

12 and 13 may be the result of voice detection, or at least the attitude of the listener in the image of the listener input via the network and the presence or absence of speech. It can be either.

The image selection unit 14 selects one or more images to be output from the person image and the object image input to the image output device 10 based on the detection results from the

state detection units

11, 12, and 13. do. In particular, the image selection unit 14 selects the object image as an output target when the state of the object in the object image satisfies a predetermined condition and the state of the person in the person image indicates a predetermined relationship with the object. For example, when an object is detected in the object image and the person in the person image faces the object or is positioned near the object, the object image is selected as an output target. The selected image (selected image) is supplied to the image output/synthesis section 15 .

The image selection unit 14 has a metadata generation unit 14m. The metadata generation unit 14 m generates metadata of the selected image selected as an output target by the image selection unit 14 and supplies the metadata to the image output/synthesis unit 15 . This metadata includes what was detected from the image, the state of the detected object, the detection conditions, the timing of switching the selected image, etc. is included.

The image output/synthesis unit 15 functions as an image output unit that outputs an image selected as an output target by the image selection unit 14 as an output image. When two or more images are selected as output targets in the image selection unit 14, the image output/synthesis unit 15 outputs a synthesized image obtained by synthesizing the two or more images.

The image output/synthesis unit 15 outputs an output image based on the metadata from the metadata generation unit 14m. For example, the image output/synthesis unit 15 outputs an output image at timing based on the metadata, or synthesizes the selected images in a layout based on the metadata when two or more selected images are selected. do. The image output/synthesis unit 15 outputs an output image and its metadata. The metadata of the output image includes the layout information of the synthesized image and the like in addition to the metadata from the metadata generation unit 14m.

An output image output from the image output device 10 is live-delivered via a network such as the Internet, or supplied to an editing device or the like for editing moving images. Further, the metadata of the output image may be recorded on a recording medium detachably attached to the image output device 10 or recorded in a recording unit (not shown) provided in the image output device 10 . The metadata recorded in this way can be used for editing the output image.

With the above configuration, it is possible to appropriately switch and output human images and object images according to the progress of lectures and the like.

In the example of FIG. 1, the image input to the image output device 10 includes a portrait image, but may include a plurality of portrait images with different subjects as subjects, and the image output device 10 may All images input to 10 may be object images.

An embodiment of an image output system to which the technology according to the present disclosure is applied will be described below.

<2. First Embodiment>
(Configuration example of image output system)
FIG. 2 is a diagram illustrating a configuration example of an image output system according to the first embodiment of the present disclosure.

In the image output system 100 of FIG. 2, while the reviewer L1 is reviewing the product, a plurality of images are appropriately switched and output according to the progress of the product review (behavior of the reviewer L1).

The image output system 100 is composed of an instructor camera 101 , a hand camera 102 , a PC 103 and a switcher 110 .

The lecturer camera 101 is a camera that shoots reviewer L1 as a lecturer as a central subject. The reviewer L1 looks at the lecturer camera 101 and explains or talks to the audience.

The instructor's camera 101 is photographed at a camera angle in which the orientation of the body of the reviewer L1 when giving an explanation toward the instructor's camera 101 is different from the orientation of the body of the reviewer L1 when working at hand or operating the PC 103. is installed to For example, as shown in FIG. 2, the instructor camera 101 is installed so as to photograph the reviewer L1 from the side of the reviewer L1 facing the desk. The instructor camera 101 is connected to the switcher 110, and the instructor image of the reviewer L1 is output to the switcher 110. FIG.

The hand camera 102 is a camera that captures the hand of the reviewer L1 as a shooting range. The handheld camera 102 captures images of the reviewer L1's work and objects in the reviewer's hand.

The hand camera 102 is installed so that the reviewer L1's hand and objects at hand are within the shooting range. For example, as shown in FIG. 2, the hand camera 102 is installed so as to photograph the desk from directly above the desk. The hand camera 102 is connected to the switcher 110 , and a hand image obtained by photographing the hand of the reviewer L1 is output to the switcher 110 .

PC 103 is a computer used by reviewer L1 for explanation. The PC 103 presents materials for product review and screens for explaining applications and programming to the viewer. PC 103 is connected to switcher 110 , and PC images presented by PC 103 are output to switcher 110 .

The instructor camera 101, hand camera 102, PC 103, and switcher 110 are directly connected via interfaces such as HDMI (High-Definition Multimedia Interface) (registered trademark), SDI (Serial Digital Interface), and USB (Universal Serial Bus). Alternatively, they may be connected to each other by a wired or wireless LAN (Local Area Network) or WAN (Wide Area Network).

The switcher 110 performs various image analysis, detection processing, recognition processing, etc. corresponding to the images from the instructor camera 101, the hand camera 102, and the PC 103, respectively. Based on these processing results, the switcher 110 selects and outputs an image suitable for the progress of the product review (the behavior of the reviewer L1) as an output target. The switcher 110 can also output the metadata of the output image along with the output image selected for output.

The switcher 110 is connected to a network NW such as the Internet, and the output image and its metadata are supplied to the distribution server of the video distribution site via the network NW. Also, the switcher 110 may be directly connected to a distribution server, recording device, or the like via an interface such as HDMI, SDI, or USB.

(Example of switcher functional configuration)
FIG. 3 is a block diagram showing a functional configuration example of the switcher 110. As shown in FIG.

The switcher 110 receives the teacher image from the teacher camera 101, the hand image from the hand camera 102, and the PC image from the PC 103 as inputs, and outputs the output image and its metadata.

The switcher 110 includes a body orientation detection unit 111 , a hand detection unit 112 , a screen change detection unit 113 , an image selection unit 114 , and an image output/synthesis unit 115 .

The body orientation detection unit 111 performs body orientation detection processing to detect the body orientation of the reviewer L1 in the lecturer image input to the switcher 110 and supplies the detection result to the image selection unit 114 .

The body direction detection processing is, for example, obtaining a person's skeleton by a skeleton estimation technique using general deep learning or the like, and specifying which direction the skeleton is facing with respect to the lecturer camera 101. This is the processing for detecting the orientation of the body. The detection result of the body orientation detection process is indicated by a position (x, y, z), an angle (picth, yaw, row), etc., for example, by defining an arbitrary coordinate system and reference point.

In addition, as a detection result of the body direction detection processing, the position (x, y, z) and angle (picth, yaw, row) are "directed toward the lecturer's camera", "directed toward the working direction at hand". It may be converted into a meaningful orientation such as "facing the PC" or "facing the PC". If the position (x, y, z) and angle (picth, yaw, row) are within each range, "the direction of the lecturer's camera", "the direction of the work at hand", "the PC's It is set in advance whether it is "facing the direction". The range may be set by, for example, inputting a value by the user or by automatically recognizing the desk or PC appearing in the lecturer image.

The hand detection unit 112 performs hand detection processing to detect the state of the hand of the reviewer L1 in the hand image input to the switcher 110, and supplies the detection result to the image selection unit 114.

Hand detection processing detects at least one of the number of hands, position (coordinates), orientation, and shape (shape of fingers) in addition to the presence or absence of hands in the hand image. As a processing result of the hand detection processing, these hand states may be output, or information ("holding hands", "opening hands", etc.) given meanings from these hand states may be output. may be output.

The screen change detection unit 113 performs screen change detection processing for detecting changes in the display screen of the PC image input to the switcher 110 and supplies the detection result to the image selection unit 114 .

Screen change detection processing is, for example, processing for detecting cursor movement in a PC image. In this case, the detection result of the screen change detection process may be information indicating not only the presence or absence of movement of the cursor, but also the position of the cursor on the PC image, the movement speed and acceleration of the cursor, and the like.

In addition, the screen change detection process may include a process of detecting page transitions of slides of presentation materials and a process of detecting playback of moving images. In this case, the detection result of the screen change detection processing may include not only the presence or absence of a screen change, but also the type of screen change indicating which of cursor movement, slide page transition, and video playback was detected. good.

Image selection unit 114 selects one of the instructor image, hand image, and PC image input to switcher 110 based on the detection results from body orientation detection unit 111, hand detection unit 112, and screen change detection unit 113, respectively. , to select one or more images to be output. The selected image is supplied to the image output/synthesis unit 115 .

The image selection unit 114 has a metadata generation unit 114m. The metadata generation unit 114 m generates metadata of the selected image selected as an output target by the image selection unit 114 and supplies the metadata to the image output/synthesis unit 115 . This metadata includes what was detected from the image, the state of the detected object, detection conditions, and the timing of switching the selected image as the detection result of the detection process for the image selected as the output target. included.

The image output/synthesis unit 115 outputs one or more images selected as output targets by the image selection unit 114 as output images based on the metadata from the metadata generation unit 114m. The output image may be a through image of one selected image, or may be a synthesized image obtained by synthesizing two or more selected images side-by-side or picture-in-picture, for example. Also, a telop, other content, an effect, or the like may be superimposed on the through image or the synthesized image of the selected image.

With the above configuration, the switcher 110 can select at least one of the instructor image 151, the hand image 152, and the PC image 153 and output it as an output image, as shown in FIG.

For example, either the instructor image 151 or the hand image 152 may be output as the output image as indicated by arrow #1, or either the instructor image 151 or the PC image 153 may be output as indicated by arrow #2. or may be output as an output image. Also, in product reviews, there are many cases where the instructor image 151 is mainly output. Any of the images 153 may be output as the output image.

Here, the flow of image output processing for selecting and outputting at least one of the instructor image 151, the hand image 152, and the PC image 153 will be described.

(Image output processing flow 1)
FIG. 5 is a flowchart for explaining the flow of image output processing for selecting and outputting either the instructor image 151 or the hand image 152 indicated by arrow #1 in FIG.

In step S11 , the hand detection unit 112 of the switcher 110 performs hand detection processing on the hand image 152 .

In step S12, the hand detection unit 112 determines whether or not a hand has been detected in the hand image 152 based on the detection result of the hand detection process. If it is determined that the hand has been detected, the process proceeds to step S13.

In step S13 , the body orientation detection unit 111 performs body orientation detection processing on the instructor image 151 .

In step S14, the body orientation detection unit 111 determines whether or not the body of the reviewer L1 is oriented in the direction of working at hand, based on the detection result of the body orientation detection processing. If it is determined that it faces the direction of working at hand, the process proceeds to step S15.

In step S15, the image selection unit 114 selects the hand image 152 as an output target.

On the other hand, if it is determined in step S12 that the hand has not been detected, or if it is determined in step S14 that the body is not facing the direction of working at hand, the process proceeds to step S16.

In step S16, the image selection unit 114 selects the lecturer image 151 as an output target.

After step S15 or step S16, in step S17, the image output/synthesis unit 115 outputs the image selected by the image selection unit 114 as an output target.

If a hand is simply detected in the hand image 152 and the hand image 152 is selected as an output target, the hand image 152 will be output even if the hand is placed on the desk by chance. On the other hand, in the processing described above, not only the hand is detected in the hand image 152 but also the orientation of the body is detected in the lecturer image 151 .

As a result, even if a hand is detected in the hand image 152, if the body is not facing the direction in which the hand is to be worked, the reviewer L1 can talk to the lecturer camera 101 or operate the PC. Since there is a high possibility that the hand image 152 is present, the hand image 152 is not output. Also, even if the body is facing the direction of working with the hand, if the hand is not detected in the hand image 152, there is a high possibility that the reviewer L1 is not working with the hand. Image 152 is not output.

According to the above processing, it is possible to avoid erroneously switching to the hand image 152 just by showing the hand in the hand image 152 or by simply pointing the body in the direction of working with the hand. That is, different types of detection processing are performed for each image, and an image to be output is appropriately selected based on the detection results, so that it is possible to switch between different types of images more preferably. Become.

(Flow 2 of image output processing)
FIG. 6 is a flowchart for explaining the flow of image output processing for selecting and outputting either the instructor image 151 or the PC image 153 indicated by arrow #2 in FIG.

In step S21 , the screen change detection unit 113 of the switcher 110 performs cursor movement detection processing for detecting cursor movement on the PC image 153 .

In step S22, the screen change detection unit 113 determines whether or not a cursor movement has been detected in the PC image 153 based on the detection result of the cursor movement detection process. If it is determined that the movement of the cursor has been detected, the process proceeds to step S23.

In step S23 , the body orientation detection unit 111 performs body orientation detection processing on the instructor image 151 .

In step S24, the body orientation detection unit 111 determines whether or not the body of the reviewer L1 is facing the PC 103 based on the detection result of the body orientation detection process. If it is determined that the direction of the PC 103 is facing, the process proceeds to step S25.

In step S25, the image selection unit 114 selects the PC image 153 as an output target.

On the other hand, if it is determined in step S22 that no cursor movement has been detected, or if it is determined in step S24 that the body is not facing the PC 103, the process proceeds to step S26.

In step S26, the image selection unit 114 selects the lecturer image 151 as an output target.

After step S25 or step S26, in step S27, the image output/synthesis unit 115 outputs the image selected by the image selection unit 114 as an output target.

Simply, when cursor movement (change in display screen) is detected in the PC image 153, if the PC image 153 is selected as an output target, the PC image 153 will be displayed even if the hand accidentally hits the mouse or the like. output. On the other hand, in the above-described processing, not only changes in the display screen are detected in the PC image 153 but also the orientation of the body is detected in the lecturer image 151 .

As a result, even if a change in the display screen is detected in the PC image 153, the reviewer L1 can speak toward the instructor camera 101 or work at hand if the body is not facing the direction of the PC 103. Therefore, the PC image 153 is not output. Also, even if the body faces the direction of the PC 103, if no change in the display screen is detected in the PC image 153, there is a high possibility that the reviewer L1 is not operating the PC 103. PC image 153 is not output.

According to the above processing, it is possible to avoid erroneously switching to the PC image 153 just by moving the cursor in the PC image 153 or by simply pointing the body toward the PC 103 . That is, different types of detection processing are performed for each image, and an image to be output is appropriately selected based on the detection results, so that it is possible to switch between different types of images more preferably. Become.

(Flow 3 of image output processing)
FIG. 7 is a flowchart for explaining the flow of image output processing for selecting and outputting any one of the instructor image 151, the hand image 152, and the PC image 153 indicated by arrows #1, #2, and #3 in FIG. is.

As described above, there are many cases in which the instructor image 151 is mainly output in product reviews, so the instructor image 151 is output in step S31.

In step S32 , the body orientation detection unit 111 performs body orientation detection processing on the instructor image 151 to determine whether or not the body of the reviewer L1 faces the instructor camera 101 . If it is determined that the body is facing the instructor camera 101, it is considered that the reviewer L1 is speaking toward the instructor camera 101, so the process returns to step S31 and the instructor image 151 is continuously output.

On the other hand, if it is determined that the body is not facing the instructor camera 101, the process proceeds to step S33.

In step S33, the body orientation detection unit 111 performs body orientation detection processing on the instructor image 151 to determine whether the body of the reviewer L1 is facing the direction of working at hand or facing the direction of the PC 103. determine whether

If it is determined that the body is facing the direction of working at hand, proceed to step S34.

In step S34 , the hand detection unit 112 performs hand detection processing on the hand image 152 to determine whether or not a hand is detected in the hand image 152 . If it is determined that the hand is not detected, it is considered that the reviewer L1 is not working at hand, so the process returns to step S31 and the instructor image 151 is output.

On the other hand, if it is determined that a hand has been detected, the process proceeds to step S35, the image selection unit 114 selects the hand image as the output image, and the image output/synthesis unit 115 outputs the hand image. After that, the process returns to step S32 and the subsequent processes are repeated.

Now, if it is determined in step S33 that the body is facing the direction of the PC 103, the process proceeds to step S36.

In step S36 , the screen change detection unit 113 performs cursor movement detection processing on the PC image 153 to determine whether cursor movement is detected in the PC image 153 . If it is determined that the movement of the cursor has not been detected, it is considered that the reviewer L1 is not operating the PC.

On the other hand, if it is determined that the movement of the cursor has been detected, the process proceeds to step S37, and the image selection unit 114 selects the PC image as the output image, and the image output/synthesis unit 115 outputs the PC image. . After that, the process returns to step S32 and the subsequent processes are repeated.

Further, in step S33, if it is determined that the body is facing in any direction (neither the direction of working at hand nor the direction of the PC 103), for example, the reviewer L1 is facing backwards. In that case, the process returns to step S31 and the instructor image 151 is output.

According to the above processing, different types of detection processing are performed on each of the instructor image 151, the hand image 152, and the PC image 153, and images to be output are appropriately selected based on the detection results. . As a result, it is possible to switch between different types of images more preferably.

In the processes of FIGS. 5 and 6, the state of the person in the human image is detected after the state of the object in the object image is detected. After the detection of the state of the object is performed. In the image output processing of the image output system of this embodiment, either the detection of the state of the person or the detection of the state of the object may be executed first, or they may be executed in parallel.

In the above-described processing, the instructor image 151 is assumed to be mainly output, and the image to be output is switched to the hand image 152 or the PC image 153 according to the detection result. Alternatively, the hand image 152 may be mainly output, and the image to be output may be switched to the lecturer image 151 or the PC image 153 according to the detection result. Further, assuming that the PC image 153 is mainly output, the image to be output may be switched to the lecturer image 151 or the hand image 152 according to the detection result.

In the configuration of the switcher 110 in FIG. 3, an object detection unit may be provided so that it can be detected whether or not a product desired to be shown to the viewer appears in the image at hand. In this case, an object detection unit may be provided instead of the hand detection unit 112, or both the hand detection unit 112 and the object detection unit may be provided. In the latter case, in step S12 of FIG. 5 or step S34 of FIG. 7, it is determined whether or not a hand has been detected in the image at hand and a product to be shown to the viewer has been detected.

In the above, an example was explained in which reviewer L1 changes only the orientation of the body at the same position. In addition to this, if the position of reviewer L1 changes when speaking to lecturer camera 101, when working at hand, and when operating PC 103, switcher 110 may be provided with a body position detection unit. , the position of the body of the reviewer L1 may be detected. In this case, a body position detection unit may be provided instead of the body orientation detection unit 111, or both the body orientation detection unit 111 and the body position detection unit may be provided.

In addition, the image selection unit 114 can select an image to be output at various timings and switch the output image.

For example, the image selection unit 114 can select an image to be output based on the frame-by-frame detection result of the detection process for each image. In this case, the image is selected and the output image is switched at the moment when the detection result satisfying the condition to be output is obtained in one frame of the predetermined image.

In addition, the image selection unit 114 may select an image to be output based on the detection result obtained continuously for a certain period of time in the detection process for each image, or may select an image to be output based on the detection result obtained at a specific frequency. An image to be output may be selected based on the detection result.

Selection of images to be output (switching of output images) may be performed on multiple images that are being input in real time, or may be performed on multiple images that have been recorded in advance.

In addition, in order to enable image editing in a stage subsequent to the image selection unit 114, the selection timing of an image to be output (output image switching timing), selection conditions (switching conditions), and the like are output as metadata. may

As described above, the image selection unit 114 can select two or more images as output targets. Accordingly, the image output/synthesis unit 115 can output a synthesized image obtained by side-by-side synthesis or picture-in-picture synthesis.

(Operation of image output/compositing unit)
As described above, the image output/synthesis unit 115 can output the selected image from the image selection unit 114 as a through image, and can superimpose a telop, other contents, effects, and the like on the through image.

For example, the image output/synthesis unit 115 can output, as an output image, a synthesized image P101 in which a telop 171 is superimposed on the hand image 152, as shown in FIG. 8A.

Also, the image output/synthesis unit 115 can synthesize the plurality of selected images based on the plurality of selected images from the image selection unit 114 and the metadata.

For example, based on metadata indicating that the selected image is the lecturer image 151, the image output/synthesis unit 115 synthesizes the PC image 153 with the lecturer image 151 as picture-in-picture as shown in FIG. 8B. The resulting composite image P102 can be output as an output image. In addition, based on the metadata indicating that the selected image is the PC image 153, the image output/synthesis unit 115 adds the instructor image 151 and the hand image 152 to the PC image 153 as shown in FIG. 8C. The side-by-side composite image P103 can also be output as an output image.

Note that the combination and layout of images in each of the combined images P101, P102, and P103 shown in FIG. 8 are not limited to this. Also, the layout of the synthesized images P101, P102, and P103 and the output timing (combination switching timing) may be determined based on the metadata, or may be determined based on the user's instruction.

<3. Second Embodiment>
(Configuration example of image output system)
FIG. 9 is a diagram showing a configuration example of an image output system according to the second embodiment of the present disclosure.

In the image output system 200 of FIG. 9, while the instructor L2 is conducting an online piano lesson, a plurality of images are appropriately switched and output according to the progress of the lesson (behavior of the instructor L2).

The image output system 200 is composed of an instructor camera 201 , a hand camera 202 , a foot camera 203 and a switcher 210 .

The lecturer camera 201 is a camera that shoots the lecturer L2 as the main subject. The lecturer L2 looks at the lecturer camera 201 and explains or talks to the students.

The instructor camera 201 is installed at a camera angle such that the orientation of the body of the instructor L2 when giving an explanation toward the instructor camera 201 is different from the orientation of the body of the instructor L2 when playing the piano. be. The lecturer camera 201 is connected to the switcher 210 , and the lecturer image of the lecturer L2 is output to the switcher 210 .

The hand camera 202 is a camera that captures the hand of the lecturer L2 as a shooting range. The hand camera 202 is installed so that the hand of the lecturer L2 and the keyboard of the piano are within the shooting range. The hand camera 202 is connected to the switcher 210 , and a hand image of the hand of the lecturer L2 is output to the switcher 210 .

The foot camera 203 is a camera that captures the feet of the lecturer L2 as a shooting range. The foot camera 203 is installed so that the feet of the lecturer L2 and the pedals of the piano are within the photographing range. The foot camera 203 is connected to the switcher 210 , and a foot image of the instructor L2's feet is output to the switcher 210 .

The switcher 210 performs various image analysis, detection processing, recognition processing, etc. corresponding to the images from the instructor camera 201, the hand camera 202, and the foot camera 203, respectively. Based on these processing results, the switcher 210 selects and outputs an image suitable for the progress of the lesson (behavior of the lecturer L2) as an output target.

(Example of switcher functional configuration)
FIG. 10 is a block diagram showing a functional configuration example of the switcher 210. As shown in FIG.

The switcher 210 receives as inputs the instructor image from the instructor camera 201, the hand image from the hand camera 202, and the foot image from the foot camera 203, and outputs an output image and its metadata.

The switcher 210 includes a body orientation detection section 211 , a hand detection section 212 , an image selection section 213 , and an image output/synthesis section 214 .

Note that the body orientation detection unit 211, the hand detection unit 212, the image selection unit 213, and the image output/synthesis unit 214 basically correspond to the body orientation detection unit 111, the hand detection unit 212, and the image output/synthesis unit 214 described with reference to FIG. It has functions similar to those of the hand detection unit 112 , the image selection unit 114 , and the image output/synthesis unit 115 . A metadata generation unit 213m included in the image selection unit 213 basically has the same function as the metadata generation unit 114m described with reference to FIG.

However, the switcher 210 does not perform detection processing on the foot image.

With the above configuration, the switcher 210 selects the instructor image 251 shown on the left side of FIG. , can be output as the output image.

Here, since instructor L2 may use the pedals while playing the piano, the foot image may be selected while playing the piano. Also, the area of the keyboard that the viewer wants to pay attention to in the hand image is the area of the keyboard. If only the keyboard area is cut out as an output image, the image will have an extremely horizontally long aspect ratio.

Therefore, in order to keep the general aspect ratio, as shown on the right side of FIG. good too. In this case, even if the hand image 252 is selected as an output target, the instructor image 251 and the foot image 253 are supplied from the image selection unit 213 to the image output/synthesis unit 214 .

(Flow of image output processing)
The flow of image output processing in which the switcher 210 selects and outputs either the instructor image or the hand image is basically the same as the flow of the image output processing described with reference to the flowchart of FIG.

That is, if a hand is detected in the hand image 252 and the hand image 252 is selected as an output target, the hand image 252 will be output even when the hand is accidentally placed on the keyboard. On the other hand, in the configuration described above, not only the hand is detected in the hand image 252 but also the orientation of the body is detected in the lecturer image 251 .

As a result, even if a hand is detected in the hand image 252, if the body is not facing the direction of the piano, there is a high possibility that the instructor L2 is speaking toward the instructor camera 201. 252 is not output. Also, even if the body is facing the direction of the piano, if no hand is detected in the hand image 252, it is highly likely that the instructor L2 is not playing, so the hand image 252 is also output. not.

Therefore, it is possible to avoid erroneously switching to the hand image 252 just because the hand is reflected in the hand image 252 or if the body faces the direction of the piano. That is, different types of detection processing are performed for each image, and an image to be output is appropriately selected based on the detection results, so that it is possible to switch between different types of images more preferably. Become.

When lecturer L2's utterance is detected, it is unlikely that lecturer L2 is speaking to the audience and performing while speaking. Therefore, in the configuration of the switcher 210 of FIG. 10, a speaker detection unit may be provided to detect the speech of the lecturer L2. A conventional general technique may be applied to the speaker detection. For example, face parts may be detected in the lecturer image 251 to detect that the mouth is open, or the voice of lecturer L2 may be detected based on the voice input together with the lecturer image 251. good. In this case, a speaker detection unit may be provided instead of the body orientation detection unit 211, or both the body orientation detection unit 211 and the speaker detection unit may be provided.

In addition, in the configuration of the switcher 210 in FIG. 10, a sound detection unit that detects the sound of a piano, or the sound of a metronome or other musical instrument played in time with the performance of the piano, in response to voices from a microphone (not shown). may be provided. Furthermore, in the configuration of the switcher 210 of FIG. 10, a pedal usage detection unit may be provided for detecting whether or not the pedal is being used for the foot image. In this case, a sound detection section or a pedal use detection section may be provided instead of the hand detection section 212, or both the hand detection section 212 and the sound detection section or pedal use detection section may be provided.

(Operation of image output/compositing unit)
As described above, when the hand image 252 is selected as an output target, the image output/synthesis unit 214 cuts out the keyboard region from the hand image 252, and produces the lecturer image 251 and the foot image as shown in FIG. 253 can be output.

Note that the combination of images in the synthesized image P201 shown in FIG. 11 is not limited to this, and the hand image 252 selected as an output target may be output as it is. It does not have to be synthesized.

<4. Third Embodiment>
(Configuration example of image output system)
FIG. 12 is a diagram showing a configuration example of an image output system according to the third embodiment of the present disclosure.

In the image output system 300 of FIG. 12, a plurality of images are appropriately switched and output according to the progress of the lesson (behavior of the instructor L3) while the instructor L3 is conducting the lesson of the online cooking class.

The image output system 300 is composed of an instructor camera 301, three

hand cameras

302A, 302B, and 302C, and a switcher 310.

The lecturer camera 301 is a camera that shoots the lecturer L3 as the main subject. The lecturer L3 looks at the lecturer camera 301 and explains or talks to the students.

The instructor camera 301 is installed so that the instructor L3, who works while changing his standing position for each kitchen sink, cooking table, stove, etc., is photographed from one camera angle. The lecturer camera 301 is connected to the switcher 310 , and the lecturer image of the lecturer L3 is output to the switcher 310 .

Hand cameras

302A, 302B, and 302C are cameras that photograph the hands of instructor L3 as a photographing range.

Hand cameras

302A, 302B, and 302C capture images of instructor L3 cooking. Specifically, the hand camera 302A is installed so that the hand of the lecturer L3 working in front of the sink, cooking utensils such as a cutting board, and the like are within the photographing range. The hand camera 302B is installed so that the hands of the instructor L3 working in front of the cooking table, cooking utensils such as bowls, and the like are within the photographing range. The hand camera 302C is installed so that the hand of the instructor L3 who is working in front of the stove, cooking utensils such as a frying pan, and the like are within the imaging range. The

hand cameras

302A, 302B, and 302C are connected to the switcher 310, and the hand images of the hands of the lecturer L3 are output to the switcher 310. FIG.

The switcher 310 performs various image analysis, detection processing, recognition processing, etc. corresponding to the images from the instructor camera 301 and the

hand cameras

302A, 302B, and 302C. Based on these processing results, the switcher 310 selects and outputs an image appropriate for the progress of the lesson (behavior of the lecturer L3) as an output target.

(Example of switcher functional configuration)
FIG. 13 is a block diagram showing a functional configuration example of the switcher 310. As shown in FIG.

The switcher 310 receives the instructor image from the instructor camera 301 and the hand images from the

hand cameras

302A, 302B, and 302C as inputs, and outputs output images and their metadata.

The switcher 310 includes a body position detection section 311 , cooking

utensil detection sections

312 , 313 and 314 , an image selection section 315 , and an image output/synthesis section 316 .

Note that the image selection unit 315 and the image output/synthesis unit 316 basically have the same functions as the image selection unit 114 and the image output/synthesis unit 115 described with reference to FIG. A metadata generation unit 315m included in the image selection unit 315 basically has the same function as the metadata generation unit 114m described with reference to FIG.

The body position detection unit 311 performs body position detection processing to detect the body position of the lecturer L3 in the lecturer image input to the switcher 310 and supplies the detection result to the image selection unit 315 .

Body position detection processing, for example, finds the skeleton of a person by using skeleton estimation technology using general deep learning, etc., and detects the position of the body by specifying the position of the skeleton. It is a process to The detection result of the body position detection process is indicated by position (x, y, z), angle (picth, yaw, row), etc., for example, by defining an arbitrary coordinate system and reference point.

In this embodiment, it is necessary to associate the position of the instructor L3's body detected by the body position detection process with the positions of a plurality of cooking utensils, which will be described later. Therefore, in the same coordinate system, the position of the cooking utensil in the range of the body position (x, y, z) and the angle (picth, yaw, row) of the instructor L3 is determined in advance by calibration. be set.

For example, the range is set by a user's operation such as enclosing a predetermined area of the instructor image with a frame or by clicking a predetermined position, or the value is set by automatically recognizing the cooking utensils shown in the instructor image. It may be done. Cooking utensils that are automatically recognized here include fixed-position kitchen sinks, countertops, and cutting boards, bowls, frying pans, and pots that are used relatively immobile on the stove. Also, the position of the body of the instructor L3 and the position of each cooking utensil may be associated with each other by automatically recognizing the cooking utensils appearing in the instructor image and the cooking utensils appearing in the images at hand.

The cooking

utensil detection units

312 , 313 , and 314 perform cooking utensil detection processing for detecting the presence or absence of cooking utensils in the hand image input to the switcher 310 and supply the detection results to the image selection unit 315 .

The cooking utensil detection process includes the cooking utensils used relatively stationary on the kitchen sink, countertop, and stove as described above, as well as the corresponding knives and tongs that basically move together with the hand of instructor L3. , cooking utensils such as chopsticks. In the cookware detection process, for example, a conventional general object recognition technique using deep learning or the like is used. The detection result of cooking utensils that are used relatively unmoved is used for coordinate matching and position correspondence between the above-described instructor image and hand image. The detection result of cooking utensils that basically move together with the hand of lecturer L3 is used for selecting (switching) hand images in which those cooking utensils are detected.

With the above configuration, as shown in FIG. 14, the switcher 310 displays the instructor image 351 when the instructor L4 is speaking toward the instructor camera 301, and switches to Any one of the

hand images

352A, 352B, and 352C can be selected and output as an output image. In the example of FIG. 14, the

images

352B and 352C at hand are synthesized with telops and output.

Here, the flow of image output processing for selecting and outputting at least one of the instructor image 351 and the

hand images

352A, 352B, and 352C will be described.

(Flow of image output processing)
FIG. 15 is a flowchart for explaining the flow of image output processing for selecting and outputting one of the instructor image 351 and the

hand images

352A, 352B, and 352C.

In step S111, the

cookware detection units

312, 313, and 314 of the switcher 310 perform cookware detection processing on the

hand images

352A, 352B, and 352C, respectively.

In step S112, the cooking

utensil detection units

312, 313, and 314 determine whether cooking utensils have been detected in any of the

hand images

352A, 352B, and 352C based on the detection results of the cooking utensil detection process. Cooking utensils to be detected here are cooking utensils such as kitchen knives, tongs, and chopsticks that basically move together with the hand of instructor L3. If it is determined that cooking utensils have been detected in any of the

hand images

352A, 352B, and 352C, the process proceeds to step S113.

In step S113 , the body position detection unit 311 performs body position detection processing on the lecturer image 351 .

In step S114, the body position detection unit 311 determines whether the instructor L3 is in front of any cooking utensil based on the detection result of the body position detection process. Specifically, it is determined whether or not the instructor L3 is in front of any of the cutting boards, bowls, frying pans, pots, etc. that are used relatively unmoved. If it is determined that instructor L3 is in front of any cookware, the process proceeds to step S115.

In step S115, the image selection unit 315 selects the hand image corresponding to the position of the body as an output target. For example, when it is determined that instructor L3 is in front of the chopping board above the sink, hand image 352A from hand camera 302A whose shooting range is the sink is selected as an output target.

On the other hand, if it is determined in step S112 that cooking utensils are not detected in any of the images at hand, or if it is determined in step S114 that instructor L3 is not in front of any cooking utensils, the process proceeds to step S116. .

In step S116, the image selection unit 315 selects the lecturer image 351 as an output target.

After step S115 or step S116, in step S117, the image output/synthesis unit 316 outputs the image selected by the image selection unit 315 as an output target.

Simply, if a cooking utensil is detected in one of the images at hand and the image at hand is selected as an output target, the image at hand will be output even when a kitchen knife is placed on a cutting board. . In contrast, in the above-described processing, not only the cookware (for example, kitchen knife) is detected in the image at hand, but also the body position of the instructor L3 is detected in the instructor image 351 .

As a result, even if the kitchen knife is detected in the hand image, if the instructor L3 is not in front of the cutting board, the instructor L3 is talking to the instructor camera 301 or cooking in front of the bowl or frying pan. hand image 352A is not output. Further, even if the instructor L3 is in front of the cutting board, if the kitchen knife is not detected in the hand image 352A, there is a high possibility that the instructor L3 is not using the kitchen knife for cooking. 352A is not output.

According to the above processing, it is possible to avoid switching to an unintended hand image just because the cooking utensils are shown in the hand image or because the instructor L3 is just in front of the cooking utensils. That is, different types of detection processing are performed for each image, and an image to be output is appropriately selected based on the detection results, so that it is possible to switch between different types of images more preferably. Become.

In the configuration of the switcher 310 of FIG. 13, a hand detection unit may be provided to detect the hand of the lecturer L3 in each of the

hand images

352A, 352B, and 352C. In this case, instead of each of the cooking

utensil detection units

312, 313, and 314 that detect cooking utensils that basically move together with the hands of the lecturer L3, hand detection units may be provided, or the cooking

utensil detection units

312 and 313 may be provided. , 314 and a hand detector may be provided.

In addition, the image selection unit 315 uses the skeleton obtained from the instructor image by the body position detection unit 311 to estimate a posture such as a hand reaching for food at a position corresponding to each hand image. Alternatively, the hand image corresponding to the position (coordinates) to which the is extended may be selected.

(Operation of image output/compositing unit)
As described above, the image output/synthesis unit 316 outputs the selected image from the image selection unit 315 as a through image, and superimposes a telop or the like on the through image, such as the

hand images

352B and 352C in FIG. be able to.

Further, the image output/synthesis unit 316 performs picture-in-picture synthesis or side-by-side synthesis of either the instructor image or the hand image, for example, based on the plurality of selected images and the metadata from the image selection unit 315. A composite image may be output.

Furthermore, when multiple instructors are cooking, a composite image may be output in which images of the hands of the respective instructors are displayed at the same time. The layout of these synthesized images and the output timing (combination switching timing) may be determined based on metadata or may be determined based on a user's instruction.

In addition, by using the metadata that is output together with the output image, when the output image is recorded in a recording device or the like and played back, scenes such as cooking with a kitchen knife or a scene in which the lecturer L3 is speaking can be reproduced. It is also possible to cue a desired scene.

<5. Fourth Embodiment>
(Configuration example of image output system)
FIG. 16 is a diagram illustrating a configuration example of an image output system according to the fourth embodiment of the present disclosure;

In the image output system 400 of FIG. 16, while lecturer L4 is conducting an online lecture, a plurality of images are appropriately switched and output according to the progress of the lecture (behavior of lecturer L4).

The image output system 400 is composed of an instructor camera 401 , a hand camera 402 and a switcher 410 .

The lecturer camera 401 is a camera that shoots the lecturer L4 as the main subject. The lecturer L4 looks at the lecturer camera 401 and explains or talks to the students.

The lecturer camera 401 is set so that the orientation of the body of the lecturer L4 when giving an explanation toward the lecturer camera 401 is different from the orientation of the body of the lecturer L4 when working on the lecture desk. Installed. The lecturer camera 401 is connected to the switcher 410 , and the lecturer image of the lecturer L4 is output to the switcher 410 .

The hand camera 402 is a camera that captures the hand of the lecturer L4 as a shooting range. Specifically, the camera at hand 402 is configured as a document camera, and the camera at hand 402 captures the text and writing utensils on the lecturer's table, and how the lecturer L4 writes on them. The hand camera 402 is connected to the switcher 410 , and a hand image of the hand of the lecturer L4 is output to the switcher 410 .

In the example of FIG. 16, the switcher 410 is configured by a PC. This eliminates the need for the switcher 410 to accept PC images as input from the outside.

Also, the switcher 410 is connected via the network NW to a PC 420 used by a student U4 who is taking an online lecture (a student listening to lecturer L4's speech). From the PC 420 , a student image centering on the student U 4 photographed by a PC camera incorporated in the PC 420 is input to the switcher 410 .

The switcher 410 performs various image analysis, detection processing, recognition processing, etc. corresponding to the images from the instructor camera 401, the hand camera 402, and the PC 420, respectively. Based on these processing results, the switcher 410 selects and outputs an image suitable for the progress of the online lecture (behavior of lecturer L4) as an output target.

Although only one student U4 is shown in the example of FIG.

(Example of switcher functional configuration)
FIG. 17 is a block diagram showing a functional configuration example of the switcher 410. As shown in FIG.

The switcher 410 receives as input the instructor image from the instructor camera 401, the hand image from the hand camera 402, and the student image from the PC 420, and outputs the output image and its metadata. Note that the description of the PC image of the main body of the switcher 410 (PC) is omitted here.

The switcher 410 includes a body orientation detection unit 411 , an object detection unit 412 , a speaker detection unit 413 , a body orientation detection unit 414 , an image selection unit 415 , and an image output/synthesis unit 416 .

Note that the image selection unit 415 and the image output/synthesis unit 416 basically have the same functions as the image selection unit 114 and the image output/synthesis unit 115 described with reference to FIG. A metadata generation unit 415m included in the image selection unit 415 basically has the same function as the metadata generation unit 114m described with reference to FIG.

The body orientation detection unit 411 performs body orientation detection processing for detecting the orientation of the body of the instructor L4 in the instructor image input to the switcher 410 and supplies the detection result to the image selection unit 415 .

The object detection unit 412 performs object detection processing to detect whether or not an object to be shown to the student U4 appears in the hand image input to the switcher 410, and supplies the detection result to the image selection unit 415. Objects to be detected here are texts, printed matter, writing utensils, and other educational materials.

The speaker detection unit 413 performs speaker detection processing to detect the utterance of the student U4 in the student image input to the switcher 410, and supplies the detection result to the image selection unit 415. The speaker detection unit 413 may, for example, detect facial parts in the student image to detect that the student's mouth is open. U4's voice may be detected.

The body orientation detection unit 414 performs body orientation detection processing to detect the body orientation of the student U4 in the student image input to the switcher 410, and supplies the detection result to the image selection unit 415.

With the above configuration, the switcher 410 can select at least one of the instructor image, the hand image, and the student image and output it as an output image.

(Flow of image output processing)
Here, the flow of image output processing for selecting and outputting either the instructor image or the hand image will be described with reference to the flowchart of FIG.

In step S211, the object detection unit 412 of the switcher 410 performs object detection processing on the hand image.

In step S212, the object detection unit 412 determines whether an object has been detected in the hand image based on the detection result of the object detection process. Specifically, it is determined whether or not texts, printed materials, writing utensils, and other teaching materials have been detected on the teacher's desk. If it is determined that these objects have been detected, the process proceeds to step S213.

In step S213, the body orientation detection unit 411 performs body orientation detection processing on the lecturer image.

In step S214, the body orientation detection unit 411 determines whether the body of the instructor L4 is facing the direction in which the instructor L4 is working, based on the detection result of the body orientation detection processing. It is determined whether or not the posture is for giving an explanation. If it is determined that it faces the direction of working at hand, the process proceeds to step S215.

In step S215, the image selection unit 415 selects the hand image as an output target.

On the other hand, if it is determined in step S212 that no object has been detected, or if it is determined in step S214 that the body is not facing the direction of working at hand, the process proceeds to step S216.

In step S16, the image selection unit 114 selects the lecturer image as an output target.

After step S215 or step S216, in step S217, the image output/synthesis unit 416 outputs the image selected by the image selection unit 415 as an output target.

If text or printed matter is simply detected in the image at hand and the image at hand is selected as an output target, the image at hand will be output even if the text or printed matter is placed on the lecturer's desk. On the other hand, in the above-described processing, not only the text and printed matter are detected in the image at hand, but also the orientation of the body is detected in the lecturer image.

As a result, even if text or printed matter is detected in the hand image, if the body is not facing the direction in which the hand is working, there is a high possibility that the instructor L4 is speaking toward the instructor camera 401. , the hand image is not output. In addition, even if the body is facing the direction of working at hand, if no text or printed matter is detected in the image at hand, there is a high possibility that the instructor L4 has not explained the text or printed matter. , the image at hand is not output.

According to the above process, it is possible to avoid accidentally switching to the image at hand just because text or printed matter is reflected in the image at hand, or if the body is facing the direction in which the user is working. That is, different types of detection processing are performed for each image, and an image to be output is appropriately selected based on the detection results, so that it is possible to switch between different types of images more preferably. Become.

In general, in online lectures, web conferences, etc., there are many cases where the image is switched to focus on the speaker. Therefore, in the image output system 400 of this embodiment, in addition to the instructor image and hand image, the student image is output as an output target according to the detection results of the speaker detection processing and the body direction detection processing for the student image. It can also be selected.

(Flow of student image selection)
FIG. 19 is a flowchart for explaining the flow of student image selection. The processing of FIG. 19 can be executed in parallel with the image output processing described with reference to the flowchart of FIG.

In step S231, the speaker detection unit 413 performs speaker detection processing on the student image.

In step S232, the speaker detection unit 413 determines whether or not the utterance of student U4 has been detected based on the detection result of the speaker detection process. If it is determined that student U4's speech has been detected, the process proceeds to step S233.

In step S233, the body orientation detection unit 414 performs body orientation detection processing on the student image.

In step S234, the body orientation detection unit 414 determines whether or not the student U4's body is facing the PC camera of the PC 420 based on the detection result of the body orientation detection processing. If it is determined that the camera is directed toward the PC camera, the process proceeds to step S235.

In step S235, the image selection unit 415 selects the student image as an output target. At this time, the image selection unit 415 may select the lecturer image as output targets together with the student image, or both the lecturer image and the hand image may be selected as output targets.

On the other hand, if it is determined in step S232 that student U4's speech has not been detected, or if it is determined in step S234 that the body is not facing the PC camera, step S235 is skipped. That is, the student image is not selected as an output target.

According to the above processing, in addition to the instructor image and the image at hand, the student image is output as an output image, and even in the online lecture, smooth communication between the lecturer L4 and the student U4 at a remote location can be realized. can.

In the configuration of the switcher 410 in FIG. 17, the object detection unit 412 does not detect an object by image processing, but uses sensor data from a physical sensor to detect an object in a work area such as a classroom. It may detect that an object is placed.

In addition, the object detection unit 412 uses OCR (Optical Character Recognition) technology to recognize characters printed on text or printed matter, or handwritten characters written by the lecturer L4, so that the object to be shown to the student U4 is identified. It may be detected whether or not the image is displayed. Furthermore, in this case, actions such as writing and erasing of characters performed in time series may be detected.

It should be noted that in the object detection process, only specific objects among the objects listed as detection targets described above may be used as detection targets or recognition targets.

Further, in the configuration of the switcher 410 in FIG. 17, a hand detection unit may be provided to detect the hand of the instructor L4 in the hand image, or a body position detection unit may be provided to detect the body position of the instructor L4 in the instructor image. may be detected. Furthermore, in the configuration of the switcher 410 of FIG. 17, a face orientation detection unit may be provided to detect the orientation of the instructor L4's face in the instructor image and the orientation of the student U4 in the student image. .

In addition, only specific students may be targeted for detection or recognition by means of identifying individuals by face identification on student images or other methods. Further, the image selection unit 415 may select an image based on the result of recognition of a specific facial expression or emotion by performing facial expression recognition or emotion recognition on the student image.

(Operation of image output/compositing unit)
The image output/synthesis unit 416 can output at least one of the instructor image, the hand image, and the student image as an output image.

For example, when the lecturer image is selected as an output target, the image output/synthesis unit 416 outputs only the lecturer image as an output image, or outputs the lecturer image 451 as shown on the left side of FIG. A composite image P401 obtained by combining at least one student image 461 may be output as an output image.

Further, when the hand image is selected as an output target, the image output/synthesis unit 416 outputs only the hand image as an output image, or outputs the hand image 452 as shown on the right side of FIG. A composite image P402 obtained by combining the lecturer image 451 and at least one student image 461 may be output as an output image.

Further, when a student image is selected as an output target, the image output/synthesis unit 416 outputs only the student image in which the student who speaks is shown (not shown), or outputs the student image. A composite image obtained by combining an image with an image of the instructor, an image at hand, and an image of a student in which another student is shown may be output as an output image.

Note that the combination and layout of images in each of the synthesized images P401 and P402 shown in FIG. 20 are not limited to this. Also, the layout of the synthesized images P401 and P402 and the output timing (combination switching timing) may be determined based on the metadata, or may be determined based on the user's instruction.

Also, as the metadata output together with the output image, characters recognized in the image at hand by OCR technology may be output. As a result, when an output image is recorded in a recording device or the like and reproduced, it is possible to cue a desired scene by searching for a keyword.

<6. Variation>
Modifications of the above-described embodiment will be described below.

(Image output based on priority)
In the image output system according to the above-described embodiment, the subject of the lecturer image as the person image is one lecturer, but there may be a plurality of lecturers. In this case, a plurality of lecturer images obtained by photographing each lecturer as a subject are input to the switcher. The switcher performs similar detection processing on each of the plurality of lecturer images, and selects the lecturer image to be output based on the respective detection results.

Here, when all instructor images input to the switcher are selected as output images, all instructor images may be output as output images. An image may be identified.

For example, a priority may be given to the teacher's face registered in advance, and the teacher image to be output may be specified based on the priority given to the face recognized in the teacher image. .

Also, a priority may be given to each camera that captures the lecturer, and the lecturer image to be output may be specified based on the priority given to the camera that captured the lecturer image.

Further, priority is assigned to each instructor image in the order of the time when the instructor starts appearing in each instructor image. good.

(Response to cloud computing)
In the image output system according to the embodiment described above, the switcher (image output device) is provided in an on-premises environment together with each camera. Not limited to this, some functions of the switcher may be provided in the cloud environment.

For example, the image selection unit and the image output/synthesis unit included in the switcher may be provided in the cloud environment. In this case, detection processing for images captured by each camera is performed in an edge environment. From the edge environment, each image and the detection result of detection processing for each image are uploaded to the cloud environment. In the cloud environment, an image to be output is selected based on the image from the edge environment and the detection result.

By distributing switcher functions in the cloud environment and edge environment in this way, it is possible to reduce running costs compared to building all switcher functions in the cloud environment.

<7. Computer configuration example>
The series of processes described above can be executed by hardware or by software. When executing a series of processes by software, a program that constitutes the software is installed from a program recording medium into a computer built into dedicated hardware or a general-purpose personal computer.

FIG. 21 is a block diagram showing a hardware configuration example of a computer that executes the series of processes described above by a program.

A switcher as an image output device to which the technology according to the present disclosure can be applied is implemented by a computer 500 having the configuration shown in FIG.

The CPU 501 , ROM (Read Only Memory) 502 and RAM (Random Access Memory) 503 are interconnected by a bus 504 .

An input/output interface 505 is further connected to the bus 504 . The input/output interface 505 is connected to an input unit 506 such as a keyboard and a mouse, and an output unit 507 such as a display and a speaker. The input/output interface 505 is also connected to a storage unit 508 including a hard disk or nonvolatile memory, a communication unit 509 including a network interface, and a drive 510 for driving a removable medium 511 .

In the computer configured as described above, for example, the CPU 501 loads a program stored in the storage unit 508 into the RAM 503 via the input/output interface 505 and the bus 504 and executes the above-described series of processes. is done.

The programs executed by the CPU 501 are recorded on the removable media 511, or provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital broadcasting, and installed in the storage unit 508.

It should be noted that the program executed by the computer may be a program in which processing is performed in chronological order according to the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program in which processing is performed.

The embodiments of the present disclosure are not limited to the embodiments described above, and various modifications are possible without departing from the gist of the present disclosure.

In addition, the effects described in this specification are only examples and are not limited, and other effects may be provided.

Furthermore, the present disclosure can be configured as follows.
(1)
An image output device comprising: an image selection unit that selects one or more images to be output based on detection results obtained by different types of detection processing for each of a plurality of images.
(2)
The image selection unit selects at least one of the first image and the second image based on the state of the person in the first image and the state of the object in the second image as the detection result. , is selected as the output target. The image output apparatus according to (1).
(3)
If the state of the object in the second image satisfies a predetermined condition and the state of the person in the first image indicates a predetermined relationship with the object, the image selection unit selects the first image. 2. The image output apparatus according to (2), wherein the image of 2 is selected as the output target.
(4)
wherein the first image is a person image photographed with the person as a center subject;
The image output apparatus according to (2), wherein the state of the person includes at least one of posture, position, and presence or absence of speech of the person.
(5)
the second image includes a hand image in which the hand of the person is a shooting range;
(2) The image output device according to (2), wherein the state of the object includes at least one of presence/absence, number, position, orientation, and shape of the object.
(6)
The image output device according to (5), wherein the object is the person's hand.
(7)
The image output device according to (5), wherein the object is an object handled by the person.
(8)
the second image includes a display screen of a computer operated by the person;
The image output device according to (2), wherein the state of the object includes a change in the display screen.
(9)
the second image includes a listener image centering on the listener who listens to the person's utterance via a network;
The image output device according to (2), wherein the state of the object includes at least one of the posture of the listener and the presence or absence of speech.
(10)
The image output device according to (2), wherein the state of the object includes a sound detection result.
(11)
further comprising an image output unit that outputs the image selected as the output target,
The image output device according to any one of (1) to (10), wherein, when two or more of the images are selected as the output target, the image output unit synthesizes and outputs the two or more images.
(12)
The image output apparatus according to (11), wherein the image output unit outputs the image based on metadata of the image selected as the output target.
(13)
The image output device according to (12), wherein the image output unit outputs the image at timing based on the metadata.
(14)
(12) The image output device according to (12), wherein, when two or more of the images are selected as the output targets, the image output unit synthesizes the two or more images in a layout based on the metadata.
(15)
The image selection unit generates the metadata including the detection result,
The image output device according to (12), wherein the image output unit outputs the image based on the metadata generated by the image selection unit.
(16)
The image output apparatus according to any one of (1) to (15), wherein the image selection unit selects the image to be output based on the detection result for each frame.
(17)
The image output device according to any one of (1) to (15), wherein the image selection unit selects the image to be output based on the detection result continuously obtained for a certain period of time.
(18)
The image output device according to any one of (1) to (15), wherein the image selection unit selects the image to be output based on the detection result obtained at a specific frequency.
(19)
The image output apparatus according to any one of (1) to (18), further comprising a plurality of detection units that perform different types of detection processing on each of the plurality of images.
(20)
The image output device
An image output method, comprising: selecting one or more images to be output based on detection results obtained by different types of detection processing for each of a plurality of images.
(21)
to the computer,
A program for executing a process of selecting one or more images to be output based on detection results obtained by different types of detection processes for each of a plurality of images.

10 image output device, 11, 12, 13 state detection unit, 14 image selection unit, 14m metadata generation unit, 15 image output/synthesis unit

Claims

An image output device comprising: an image selection unit that selects one or more images to be output based on detection results obtained by different types of detection processing for each of a plurality of images.
The image selection unit selects at least one of the first image and the second image based on the state of the person in the first image and the state of the object in the second image as the detection result. , is selected as the output target.
If the state of the object in the second image satisfies a predetermined condition and the state of the person in the first image indicates a predetermined relationship with the object, the image selection unit selects the first image. 3. The image output device according to claim 2, wherein the image No. 2 is selected as the output target.
wherein the first image is a person image photographed with the person as a center subject;
3. The image output apparatus according to claim 2, wherein the person's state includes at least one of the person's posture, position, and presence/absence of speech.
the second image includes a hand image in which the hand of the person is a shooting range;
3. The image output device according to claim 2, wherein the state of the object includes at least one of presence/absence, number, position, orientation, and shape of the object.
6. The image output device according to claim 5, wherein said object is said person's hand.
The image output device according to Claim 5, wherein the object is an object handled by the person.
the second image includes a display screen of a computer operated by the person;
3. The image output device according to claim 2, wherein the state of the object includes changes in the display screen.
the second image includes a listener image centering on the listener who listens to the person's utterance via a network;
3. The image output apparatus according to claim 2, wherein the state of the object includes at least one of the posture of the listener and presence/absence of speech.
3. The image output device according to claim 2, wherein the state of the object includes a sound detection result.
further comprising an image output unit that outputs the image selected as the output target,
The image output device according to claim 1, wherein, when two or more of the images are selected as the output targets, the image output unit synthesizes and outputs the two or more images.
12. The image output apparatus according to claim 11, wherein the image output unit outputs the image based on metadata of the image selected as the output target.
The image output device according to claim 12, wherein the image output unit outputs the image at timing based on the metadata.
13. The image output apparatus according to claim 12, wherein when two or more of the images are selected as the output targets, the image output unit synthesizes the two or more images in a layout based on the metadata.
The image selection unit generates the metadata including the detection result,
The image output apparatus according to claim 12, wherein the image output section outputs the image based on the metadata generated by the image selection section.
The image output apparatus according to claim 1, wherein the image selection unit selects the image to be output based on the detection result for each frame.
2. The image output apparatus according to claim 1, wherein the image selection unit selects the image to be output based on the detection result continuously obtained for a certain period of time.
The image output apparatus according to claim 1, wherein the image selection section selects the image to be output based on the detection result obtained at a specific frequency.
The image output device
An image output method, comprising: selecting one or more images to be output based on detection results obtained by different types of detection processing for each of a plurality of images.
to the computer,
A program for executing a process of selecting one or more images to be output based on detection results obtained by different types of detection processes for each of a plurality of images.