WO2015125815A1

WO2015125815A1 - Video image editing apparatus

Info

Publication number: WO2015125815A1
Application number: PCT/JP2015/054406
Authority: WO
Inventors: 内海　端; 将伸八杉; 貴也山本; 知宏猪飼; 山本　智幸
Original assignee: シャープ株式会社
Priority date: 2014-02-20
Filing date: 2015-02-18
Publication date: 2015-08-27
Also published as: JPWO2015125815A1

Abstract

In order to address the problem of checking and viewing a large amount or number of still images or moving images in a short amount of time without requiring much effort, the present invention provides a video image editing apparatus (100) provided with: a scene information generation unit (102) which divides an image data group including moving images into one or more scenes as well as generates scene information representing the characteristics for each scene-unit; and a digest moving image generation unit (103) which generates a digest moving image on the basis of the scene information. The digest moving image generation unit (103) determines whether or not to use each scene when generating the digest moving image and whether or not to place a plurality of scenes in the same frame, and determines a spatial placement pattern for the scenes when placing a plurality of scenes within the same frame.

Description

Video editing device

The present invention relates to a video editing apparatus that automatically edits video information such as moving images and still images.

With the widespread use of video equipment with shooting functions for still images and moving images, such as digital cameras and smartphones, and the increased capacity of recording media such as memory cards, it is now possible to easily capture video information. ing. As one of means for utilizing video information taken by a user of such video equipment, there is generation of a digest moving image. The digest moving image is a moving image of a relatively short time that is reconstructed so that a large number or a long time moving image can be input, and all or all of the moving images can be viewed in summary or partially.

Patent Document 1 discloses an image display device that simultaneously and continuously displays a digest moving image and a still image. In Patent Document 1, still images or moving images arranged continuously are assigned to areas arranged in a frame shape on a movie film, and a plurality of images can be viewed simultaneously.

Japanese Patent Publication “Japanese Patent Laid-Open No. 2010-258768 (Publication Date: November 11, 2010)”

However, the conventional digest moving image generation and its display device have the following problems. Hereinafter, when “image” is described in this specification, it means either or both of a still image and a moving image unless otherwise noted. The same applies to “image file” and “image data”.

In Patent Document 1, a still image or a moving image that is desired to be displayed needs to be displayed once on the screen as a thumbnail image and selected from the displayed images. If the number of captured images is small, there is no particular problem. However, if a large number of images are taken and not organized, the user can select each of the huge thumbnail images. There is a need to choose. As the number of images increases, the time and labor required for the selection work increase, and the burden on the user increases. Further, if a still image is a target, it is easy to grasp the content from the thumbnail image, but if a moving image is a target, it may be difficult to grasp the content from the thumbnail image. In such a case, it leads to dissatisfaction that the user cannot select an appropriate image instead of selecting it with effort.

In Patent Document 1, there is a problem that the display is monotonous and easy to get tired because a still image or a moving image is separately displayed in a display area fixedly arranged on the display device. In addition, in a display device with a small screen such as a smartphone or a small tablet PC, there is a problem that it is difficult to see each image displayed separately because the display area is small.

The present invention has been made in view of the above points, and it takes a long time to confirm or appreciate the contents of an image, and it takes a lot of still images and moving images to take a long time or troublesome operation. Provided is a video editing apparatus or method that can be confirmed and viewed in a short time without spending time.

In order to solve the above-described problem, a video editing apparatus according to the present invention divides an image data group including a moving image into one or more scenes and generates scene information indicating scene-specific features. And a digest moving image generating unit that generates a digest moving image of the image data based on the scene information, wherein the digest moving image generating unit is based on the scene information, Whether to use each scene when generating a digest video, whether to place multiple scenes in the same frame, and the spatial arrangement pattern of scenes when placing multiple scenes in the same frame It is characterized by deciding.

In order to solve the above-described problem, a video editing apparatus according to the present invention includes a playback time candidate derivation unit that derives a playback time candidate for a digest moving image based on an image data group, and the playback time candidate to a user. A playback time candidate display unit that presents and sets a designated time based on a user event, a scene information generation unit that divides an image data group including a moving image into one or more scenes, and an image based on the scene A video editing apparatus comprising: a digest moving image generating unit configured to generate a clip and generate a digest moving image by temporally combining the image clips, wherein the digest moving image generating unit Adjustment is performed such that the reproduction time becomes the specified time.

In order to solve the above-described problem, the video editing apparatus according to the present invention divides an image data group including a moving image into one or more scenes and generates scene information indicating scene-specific features. An information generation unit, an output control unit for determining a digest moving image generation policy, and notifying the determined generation policy to the digest moving image generation unit, and a plurality of scenes on the screen based on the scene information and the generation policy A digest moving image generating unit that generates a digest moving image of the image data group, and a video display unit that displays video and operation information; A digest moving image editing control unit that reproduces the digest moving image and outputs it to the video display unit, and an operation unit that detects an operation input from the outside. That a video editing apparatus is characterized by changing the configuration of the digest moving image by the detected operation input by the operation unit.

According to the present invention, it is possible to confirm and view a large number and a large number of still images and moving images in a short time without trouble.

It is the schematic which shows the internal structure of the video editing apparatus based on this invention. It is a figure which shows an example of the scene information by this invention. It is a conceptual diagram which shows the production | generation process of the digest moving image by this invention. It is a figure which shows an example of the scene information by this invention, and the kind of scene. It is a figure which shows the example of a display of the digest moving image by this invention. It is a figure which shows another example of a display of the digest moving image by this invention. It is the schematic which shows the internal structure of a digest moving image generation part. It is the schematic which shows the internal structure of the video editing apparatus and digest moving image generation part which concern on 3rd Example of this invention. It is a conceptual diagram of the person information in the 3rd Example of this invention. It is a figure which shows an example of the scene information in the 3rd Example of this invention. It is a figure which shows the example of the scene arrangement | positioning in the 3rd Example of this invention. It is a figure which shows another example of the scene arrangement | positioning in the 3rd Example of this invention. It is a figure which shows another example of the scene arrangement | positioning in the 3rd Example of this invention. It is a figure which shows the example of the relationship between the scene information and filter strength in 3rd Example of this invention. It is a figure which shows the example of the classification | category of the image data group in this invention. It is a figure which shows an example of the display screen at the time of selecting the image data group of edit object in this invention. It is the schematic which shows the internal structure of the video editing apparatus and digest moving image generation part which concern on 4th Example of this invention. It is a figure which shows the example of the title screen and image clip which the video editing apparatus concerning 4th Example of this invention produces | generates. It is a figure which shows the example of the scene arrangement | positioning for the landscape screen by this invention. It is a figure which shows the example of the scene arrangement | positioning for portrait screens by this invention. It is a figure which shows the determination process of the image area | region used for the scene arrangement | positioning for the landscape screen by this invention. It is a figure which shows the determination process of the image area | region used for the scene arrangement | positioning for portrait screens by this invention. It is the schematic which shows the internal structure of the video editing apparatus based on the 5th Example of this invention. It is a figure which shows an example of the user interface which designates the reproduction time of the digest moving image which concerns on 5th Example of this invention. It is a conceptual diagram which shows the production | generation process of the digest moving image by the 5th Example of this invention. It is a conceptual diagram for demonstrating the reproduction | regeneration time and designated time of a digest moving image by the 5th Example of this invention. It is the schematic which shows the internal structure of the digest moving image generation part which concerns on the 5th Example of this invention. It is a conceptual diagram for demonstrating the process which shortens the reproduction time of an image clip. It is a conceptual diagram for demonstrating the process which lengthens the reproduction time of an image clip. It is the schematic which shows the internal structure of the digest moving image generation part which concerns on the 6th Example of this invention. It is the schematic which shows the internal structure of the video editing apparatus based on the 7th Example of this invention. It is a flowchart which shows the process of the digest moving image edit which concerns on the 7th Example of this invention. It is a figure which shows the example of edit operation of the digest moving image by the 7th Example of this invention. It is a figure which shows another example of edit operation of the digest moving image by the 8th Example of this invention. It is a figure which shows the example of a change of the digest moving image before and behind the scene deletion operation of the digest moving image by the 8th Example of this invention.

(First embodiment)
Hereinafter, embodiments of the present invention will be described with reference to the drawings.

FIG. 1 is a schematic diagram showing the configuration of a video editing apparatus according to the first embodiment of the present invention. The video editing apparatus 100 includes an image data classification unit 101, a scene information generation unit 102, a digest moving image generation unit 103, an event selection unit 104, and an output control unit 105. Although not shown, the video editing apparatus 100 may further include a data recording unit that stores image data, a video display unit that displays images, or a data recording device that has the same functions as those described above. The video display device may be configured to be connectable to the outside.

The image data classification unit 101 classifies image data. The image data is data that records a moving image. The playback time of the moving image, date and time information indicating the date and time when the image was shot or created, position information that indicates the location (position) when the image was shot and created, and shooting or creation. Electronic data including metadata such as creator information indicating the user or the device that has performed. Each image data may be an electronic file stored in a recording medium (not shown), or may be digital data including an image / audio signal input from the photographing apparatus. The image data may include a still image.

The image data classification unit 101 classifies each image into one or more image data groups that match a predetermined condition based on metadata included in the image data. For example, image data captured on the same date is classified as one image data group. Furthermore, referring to position information at the time of shooting, a plurality of image data having the same shooting date and time and position information within a predetermined range may be classified as one image data group. Alternatively, a plurality of pieces of image data whose position information at the time of shooting is within a predetermined range even when the shooting dates and times are different may be classified as one image data group. Further, for example, a plurality of pieces of image data having position information within a predetermined range and the same creator information may be classified as one image data group.

FIG. 15 shows an example of an image data group classified by the image data classification unit 101. It is assumed that

image data

11, 12, 13,... 1n, 21, 22, 23,. The image data group 10 includes

image data

11, 12, 13,... 1n, and the image data group 20 includes

image data

21, 22, 23,.

Image data

11, 12, 13,..., 1n includes

metadata

11a, 12a, 13a... Included in each image data, the date and time information is “01/01/2014” and the position information is “around home”. The point is common. The image data group 10 is an example in which the

image data

11, 12, 13,... Having the same date information (shooting date) and position information are classified as one image data group. The

image data

21, 22, 23,... 2n, among the

metadata

21a, 22a, 23a... Included in the respective image data, has different date / time information in the range from January 02, 2014 to January 05, 2014. The point is that the location information is “Hawaii Island”. The image data group 20 is an example in which the

image data

21, 22, 23,..., Which have different date information (shooting date) but whose position information is within a predetermined range, is classified as one image data group. The image data classification unit 101 generates image data

group identification information

10A and 20A as information indicating the image data groups classified in this way. The image data

group identification information

10A and 20A includes the name of the image data group and information indicating the image data included in the image data group in order to identify the image data group. In the example of FIG. 15, the image data classifying unit 101 gives a character string “01/01/2014 around the home” as the name of the image data group 10. In addition, as the name of the image data group 20, a character string “02.05 to 05.2014, Hawaii Island” is given. Although omitted in the figure, as the information constituting the image data

group identification information

10A, 20A, the name of each image data included in the image data group (file name such as /data/DSC_1001.mov in the figure) In addition, the image data group identification information may be configured to include the shooting date and time.

Returning to FIG. 1, the scene information generation unit 102 analyzes the image data, classifies the image data into one or more scenes characterized by an image signal or an audio signal, and generates scene information that is information indicating the feature of each scene. Generate. The scene information is, for example, motion information indicating changes in the time direction in the image, person information indicating the number or size of a person's area appearing in the image, or conversation indicating the presence or length of an utterance section in an audio signal. It includes information. Details of the scene information generation unit 102 and the generated scene information will be described later.

The digest moving image generation unit 103 reads the scene information generated by the scene information generation unit 102 in units of image data groups classified by the image data classification unit 101, and follows the time series of image data shooting / creation. Then, a digest moving image is generated. When there are a plurality of image data groups, a digest moving image is generated for the image data group selected by the event selection unit 104 described later. In addition, the digest moving image generation unit 103 generates a digest moving image according to a generation policy notified from the output control unit 105 described later when generating a digest moving image. The video editing device 100 outputs the generated digest moving image to a video display unit built in the video editing device 100 or an externally connected video display device, or a built-in data recording unit or an externally connected data recording Output to the device. Details of the operation of the digest moving image generating unit 103 will be described later.

(Event selection part)
The event selection unit 104 determines which image data group among the image data groups classified by the image data classification unit 101 is to be edited. For example, the image data group captured on the previous day, that is, the image data group whose shooting date and time is the previous day of the editing date, is determined as an editing target based on the editing date on which the digest moving image is automatically edited. Further, on the basis of the designated date and time designated by the user instead of the editing date, the image data group with the shooting date and time before and after the designated date and time may be determined as the editing target. The image data group that the event selection unit 104 determines to be edited may be based not only on date / time information but also on position information and creator information. For example, an image data group including image data having position information specified by the user or position information within a predetermined range including the position may be determined as an editing target. Alternatively, among a plurality of image data groups including image data having position information within a predetermined range for different creators, only an image data group having specific creator information may be determined as an editing target. Conversely, an image data group excluding an image data group having specific creator information may be determined as an editing target. The number of image data groups that the event selection unit 104 determines to be edited is not limited to one, and may be two or more. In addition, as a timing when the event selection unit 104 determines an image data group to be edited, a day change may be used as a trigger. For example, when midnight has passed, the image data group captured on the previous day may be determined as the editing target. Further, as another example of a method in which the event selection unit 104 determines an image data group to be edited, the event selection unit 104 may determine the image data group according to a user selection. For example, the event selection unit 104 displays information indicating one or more image data groups classified by the image data classification unit 101 on a display unit (not shown). The information indicating the image data group may be, for example, a character string representing the shooting date or creator of the image data group, or may be an icon or thumbnail image on a map image indicating the range of shooting position information included in the image data group. Good. From the displayed information, the user designates an image data group desired to be edited for the digest moving image. The event selection unit 104 determines the image data group designated by the user as the image data group to be edited. The event selection unit 104 notifies the digest moving image generation unit 103 of information (selection information) indicating the determined image data group.

FIG. 16 shows an example of a display screen by the event selection unit 104 when the user selects and determines an image data group to be edited. The event selection unit 104 uses the image data

group identification information

10A, 20A,... Indicating the

image data groups

10, 20,. And the selection display screen 40 including the

names

41, 42, 43,... Is output. The user inputs the name (41, 42, 43..., Etc.) of the image data group desired to be edited via an operation means (for example, touch panel, mouse, keyboard, etc.) connected to or incorporated in the video editing apparatus 100. specify. The event selection unit 104 selects the name of the image data group designated by the user (“January 02 to 05, 2014, Hawaii Island” in the example of FIG. 16) or image data group identification information (10A, 20A or the like) is sent to the digest moving image generation unit 103 as selection information indicating the image data group.

(Output control unit)
The output control unit 105 determines an output destination and a generation policy of the digest moving image generated by the digest moving image generating unit 103. The output control unit 105 inputs capability information indicating the number of display pixels of a video display device (not shown), audio output specifications, and the like, and determines a digest moving image generation policy based on the capability information. When there are a plurality of output destination video display devices, for example, the video editing device of this embodiment has a built-in video display unit capable of displaying a digest moving image, and another video display device is connected to the outside. In this case, a digest moving image generation policy is determined for each of the built-in video display unit and the externally connected video display device. Also, whether the digest video to be generated is displayed as video, encoded as data and stored in recording media, or output to the outside via communication media, regardless of whether it is internal or external output . The output control unit 105 notifies the digest video generation unit 103 of the determined generation policy.

Details of processing contents in the output control unit 105 will be described below. The output control unit 105 determines a digest moving image generation policy based on information constituting a digest moving image generation policy given by an input unit (not shown). The digest moving image generation policy is a kind of parameter set including information indicating output destination information, output image specifications, output audio specifications, scene selection criteria, and simultaneous arrangement of a plurality of scenes. Hereinafter, the process of determining the digest moving image generation policy in the output control unit 105 will be described for each parameter.

The output control unit 105 determines, as output destination information, information indicating whether the output destination video display device is a video display unit built in the video editing device 100 or an externally connected video display device. To do. Whether the output destination is built-in or externally connected is determined by electrically detecting whether or not a display device is connected to the outside of the video editing apparatus 100, or specified by an input unit (not shown). When the video editing device 100 has a built-in video display unit and an external video display device is connected, the output control unit 105 determines output destination information so that both are output destinations. Also good.

The output control unit 105 displays information indicating image display specifications such as the number of display pixels of an output destination video display device, that is, a video display unit built in the video editing device or an externally connected video display device. To determine the output image specifications. The output image specification is configured to include at least the number of output horizontal pixels and the number of output vertical pixels. Basically, the number of display pixels in the horizontal direction and the vertical direction of the output destination video display device is used as it is. Set to the number of output vertical pixels. However, if it is known that the output video display device does not display an image on the full screen of the display device, such as displaying an image in a window, the value is smaller than the number of display pixels of the output video display device. May be set to the number of output horizontal pixels and the number of output vertical pixels, respectively.

The output control unit 105 outputs information indicating the audio playback capability of the audio output device of the output destination, that is, the audio output unit incorporated in the video editing device or the audio output unit of the video display device connected to the outside, to the video display device. Receive from inside or outside and determine output audio specifications. The output audio specification includes at least the number of output audio channels. If the audio output corresponds to stereo, the number of output audio channels = 2, and if the audio output does not correspond to stereo, the number of output audio channels = 1, the audio output function If there is not, the number of output audio channels = 0 is set. In addition to the output audio channel, the output audio specification may include a sampling frequency, the number of quantization bits, etc., and any information sets information indicated by the audio reproduction capability of the output audio reproduction device. For example, there are examples of sampling frequencies such as 32 kHz, 44.1 kHz, 48 kHz, and 96 kHz, and examples of the number of quantization bits include 8 bits, 16 bits, and 24 bits.

The output control unit 105 receives information indicating the user's preference regarding the digest moving image generation by an input unit (not shown), and determines scene selection criteria such as “person main” and “landscape main”. The information indicating the user's preference may be language information such as “person” or “landscape”, or may be information indicating the image itself obtained by selecting from thumbnail images having different tendencies, for example. Information indicating user preferences is not only “person-centric” or “landscape-centric”, but also information indicating major subjects such as “specific persons”, “animals”, and “flowers” based on face recognition and shape recognition. Information indicating the details of the image, such as “seaside” and “forest” based on a distribution analysis of pixel values, may be used. When information indicating user preference is not specified, the output control unit 105 may set “person subject” as a standard scene selection criterion.

The output control unit 105 sets, as “multiple scene simultaneous arrangement”, information indicating whether or not multiple scenes are allowed to be simultaneously arranged in the same image frame in accordance with the video display device of the output destination. For example, if the video display unit built in the video editing device is valid as the output destination, the simultaneous arrangement of multiple scenes is set to “No”, and if the video display device connected externally is valid as the output destination, multiple scenes Set simultaneous placement to “Yes”. This criterion is based on the premise that the display device built in the video editing apparatus is small (for example, when the video editing apparatus is a smartphone), and therefore, when connected externally, it is connected to a large display device. is there. In addition, when the size of the display device of the output destination is known in advance, or when information capable of calculating the size of the display device (for example, information indicating pixel density: dpi) is obtained, The size is calculated, and if it is larger than the predetermined threshold, the simultaneous arrangement of a plurality of scenes is set to “permitted”, and if not, the simultaneous arrangement of a plurality of scenes is set to “No”.

(Scene information generator)
Next, details of the scene information generation unit 102 and the generated scene information will be described. FIG. 2 is an example of scene information generated by the scene information generation unit. The scene information 200 shown in FIG. 2 describes scene-related information in units of lines, and each

line

201, 202, 203,... Is configured to correspond to one scene. Information described in each

row

201, 202, 203,... Indicates an image file name, shooting date, shooting time, scene start frame number, scene end frame number, person information, motion information, and conversation information in order from the left. ing. The image file name is a character string indicating a storage location of still image data or moving image data including each scene. The shooting date and shooting time are basically character strings indicating the date and time when the image file including each scene is recorded. The scene start frame number and the scene end frame number are information indicating the time range (scene length) of the scene in the corresponding image file. For example, when the scene start frame number is 0 and the scene end frame number is 149, if the corresponding image file is moving image data of 30 fps, it indicates that the scene is 5 seconds from the start of the file. The person information, motion information, and conversation information are information indicating the characteristics of the image signal / audio signal of the scene. Next, personal information, motion information, and conversation information, which are information indicating the characteristics of the image signal / audio signal of each scene, will be described.

Person information is information including the presence or absence of a person in the scene. Furthermore, information indicating the number of persons, the personal name, the posture, the size of the person area, and the distribution pattern of a plurality of persons may be included. The motion information is information indicating the presence / absence and type of motion in the scene. The movement of each object may be shown, and the movement for each area may be shown. Conversation information is information indicating the volume and type of sound (silence, human voice, music, etc.) for a scene. Furthermore, information for identifying a speaker and information on a sound source such as music type may be included. In FIG. 2, the three types of information are represented as index numbers corresponding to predetermined types.

With respect to the person information, for example, three types of values of “no person (0)”, “main person (1)”, and “other person (2)” can be taken. No person (0) means that there is no or almost no figure of the person throughout the scene. The main character (1) means that one or two persons are shown in the scene and the area is larger than a predetermined size. For example, a scene in which a photographer has photographed a specific person with an intention corresponds. The other person (2) means that a person is captured in the scene, but the number of persons is large, or the captured area is smaller than a predetermined size. For example, it corresponds to a group photo-like scene including a specific person, or a scene shot so that people can see how people move without knowing who they are.

Regarding the motion information, for example, three types of values of “no motion (0)”, “part of motion (1)”, and “entire motion (2)” can be taken. No motion (0) means that there is almost no change in the image throughout the scene. Also, the motion part (1) means that there is motion in a part of the image area in the scene. For example, a scene in which a person is dancing in front of a fixed camera corresponds. Further, the entire motion (2) means that there is a motion over the entire image area in the scene. For example, it corresponds to a scene shot while moving the camera itself horizontally.

Concerning conversation information, for example, three kinds of values of “no sound (0)”, “conversation exists (1)”, and “other voice (2)” can be taken. No sound (0) means that no usable sound signal is recorded, for example, the sound signal level is extremely low throughout the scene. Also, “with conversation (1)” means that a voice including a conversation of a person is recorded in the scene. The other voice (2) means that a sound signal of a predetermined level or higher is continuously recorded although it is not a conversation. For example, a scene where music is flowing corresponds.

The scene information generation unit 102 determines and generates the scene information as described above by analyzing the image signal and the audio signal in the image data. At that time, for example, when an image signal and an audio signal are analyzed in units of length of 1 second, and there is no change in the three types of information indicating the characteristics of the image signal / audio signal, the scene is regarded as one continuous scene. Generate information. On the other hand, when a change occurs in any of the three types of information, the change is regarded as a scene break, and one piece of moving image data is divided into a plurality of scenes to generate respective scene information.

Furthermore, the scene information generation unit 102 may generate scene information so as to exclude scenes unsuitable for inclusion in the digest moving image in the process of generating scene information. For example, in the process of image signal analysis, scenes that are likely to not know what was captured even if you look at the image, such as when there is a sudden movement of the entire image or when there is no focus The information generation unit 102 does not generate scene information itself. Alternatively, the scene information generation unit 102 generates a digest inappropriate flag indicating that the scene is not suitable for inclusion in the digest moving image. Thus, for example, when starting to take a moving image with a digital camera, a smartphone, or the like, a scene that is not useful for inclusion in the digest moving image can be eliminated, such as when a large camera shake or focus shift occurs.

In the

scene information

200, 209 and 211 are scene information corresponding to a scene that is a still image. Since there is no time element in the still image, there is no scene start frame number and scene end frame number, which are information indicating the time range of the scene. In addition, since there is no movement or sound on the image, neither movement information nor conversation information exists. Such non-existing information is represented by a symbol * in FIG. On the other hand, with respect to the person information, the scene information generation unit 102 analyzes the image signal of the still image, and “no person (0)”, “main person (1)”, “other person (2)” like the moving image. Give any information.

Here, the relationship between the scene start frame number and the scene end frame number and the shooting time in the information included in the scene information will be described. The shooting time of the image file is normally recorded as the time when the image is recorded as a file, that is, the time when shooting is completed. When the scene information generation unit 102 does not divide the image file into a plurality of scenes in the process of generating scene information of a certain image file, the shooting time of the image file corresponds to the shooting time of the corresponding scene as it is. However, when the scene information generation unit 102 divides one image file into a plurality of scenes, the time corresponding to the shooting time of each scene may not match the time indicated by the shooting time of the original image file. is there. Therefore, when the scene information generation unit 102 divides one image file into a plurality of scenes, the shooting time ShootTime recorded for the image file, the frame rate fr_rate of the moving image, the number of frames FALL, and each divided The time EndTime (EndTime = ShootTime- (FALL-Send) / fr_rate) corresponding to the end time of each scene is calculated from the relationship with the scene end frame number Send of the scene, and this is used as the shooting time for each scene. Generate information. By doing so, when referring to the scene information later, the image file name and frame number are not referred to, and only the shooting date and time are compared to compare the scenes in time. It becomes possible to specify the relationship. The scene information generation unit 102 acquires the time length of each scene in the process of analyzing the image signal or audio signal of each scene, and the acquired scene length and the shooting time of the image data including the scene. Based on this comparison, adjustment is made so that an appropriate shooting date and shooting time are recorded. For example, even if the shooting date as an image file is recorded as January 1, 2014, and the shooting time is recorded as 0:00 am, the length of the entire moving image of the image file is several minutes. In this case, when the head part of the image file is extracted as a scene, the actual shooting date of the extracted scene is December 31, 2013 which is different from the shooting date of the image file, and the shooting time is For example, 23:55:00. In this way, by calculating an appropriate shooting date and time for each scene and recording it as scene information, a scene included in the image data group to be edited determined by the event selection unit 104 is appropriately selected. The

Note that the scene start frame number and the scene end frame number generated as the scene information may be replaced with other information specifying the temporal position in the image file. For example, instead of the scene start frame number and the scene end frame number, a character string indicating the scene start time of each scene and a character string indicating time information (= elapsed time from the scene start time) corresponding to the scene length are generated. May be. Alternatively, instead of the scene start frame number and the scene end frame number, an image file elapsed time indicating the scene start and an image file elapsed time indicating the scene end may be generated as the scene information. The elapsed time in the image file is expressed, for example, in seconds, milliseconds, or seconds + frame number with respect to the head of the image file. The information for specifying the temporal position of the scene may be represented by a character string as described above, or represented by a numerical value (for example, a numerical value representing an elapsed time with reference to a predetermined date and time or time). Also good. Further, information representing the frame rate of the moving image may be included.

In the scene information 200 of FIG. 2, the example in which the person information, the motion information, and the conversation information are represented by numerical values is shown, but may be represented by a character string that represents the feature of each information. For example, with regard to the person information, characters such as “NO_HUMAN” meaning “no person (0)”, “HERO” meaning “main person (1)”, “OTHERS” meaning “other person (2)”, etc. Represented by a column. Similarly, movement information and conversation information may be represented by character strings.

2 shows an example in which scene information, person information, motion information, and conversation information are indicated by numerical values (indexes corresponding to predetermined types). As described above, in addition to the numerical index, it may be stored as a character string corresponding to a predetermined type, or instead of a single numerical value or character string, parameters (number of people, motion vector, volume at each frequency, etc.) May be stored as a set of Furthermore, the data need not be readable text data, and may be binary data.

(Digest video generator)
Next, details of processing contents in the digest moving image generation unit 103 will be described. FIG. 3 is a conceptual diagram illustrating a process of generating a digest moving image by the video editing apparatus according to the present embodiment. As shown in the figure, the digest moving image generation unit 103 reads the corresponding scene information 303 for the selected image data group 302 of the image data group 301, and performs the digest according to the predetermined digest moving image generation policy 305. Generate a moving image. A group of image data 302 for which a digest moving image is to be generated is, for example, all image data photographed on a certain day. This image data group is determined by the image data classification unit 101 and the event selection unit 104 as described above. In this case, the event selection unit 104 notifies the digest moving image generation unit 103 of a parameter indicating “shooting date = YY year ΔΔ month □□ day” as selection information 304 indicating the target image data group. To do. The digest moving image generation unit 103 refers to the scene information generated by the scene information generation unit 102 from the top, and reads the scene information of the scene corresponding to the selection information. Next, the digest moving image generation unit 103 refers to the read scene information in the order of shooting date and shooting time, and selects the type of scene, such as a scene used alone and a scene used in combination with other scenes. decide. Based on the determined scene type, the digest moving image generation unit 103 generates

image clips

306a, 306b, 306c,... That are spatially arranged image data, and converts a plurality of image clips into time. Are combined to generate a digest video 307. In FIG. 3, notations such as S01, S02, and S03 each represent a scene. The notation “S01 + S02” in the image clip 306a indicates that the image clip 306a is an image clip in which both the scene S01 and the scene S02 are spatially arranged. The image clips 306a, 306b, 306c, and the like are still images or moving images that include at least one scene and have an appropriate length (for example, 1 second or longer).

FIG. 7 shows an internal configuration of the digest moving image generating unit 103 in the present embodiment. The digest moving image generation unit 103 includes a target image extraction unit 1031, a scene type determination unit 1032, a scene space arrangement unit 1033, a scene time arrangement unit 1034, and a digest control unit 1035.

The target image extraction unit 1031 refers to the selection information indicating the target image data group notified from the event selection unit 104, and extracts an input image when generating a digest moving image. Information indicating the extracted image data is notified to the scene type determination unit 1032 and the scene space arrangement unit 1033. The scene type determination unit 1032 refers to the scene information generated by the scene information generation unit 102, reads the scene information of the scene corresponding to the information indicating the image data extracted by the target image extraction unit 1031, and determines the scene type. decide.

FIG. 4 shows an example of the relationship between the scene information and the scene type determined by the scene type determination unit 1032. FIG. 4 shows an example of scene information in the same manner as FIG. 2, and each

row

401, 402, 403... Included in the scene information 400 describes scene information corresponding to one scene. Hereinafter, in the description related to FIG. 4, the

scene information

401, 402, 403... Is also described as meaning the scene itself in order to simplify the description.

The scene type determination unit 1032 refers to the scene information 400, compares the shooting times of two consecutive scenes in the order of the shooting times, and the difference ΔT between the two shooting times is within a predetermined threshold (scene proximity determination threshold THt). Each scene is determined as a “single scene” to be used alone or a “combined scene” to be used in combination depending on whether or not they are close to each other. The scene proximity determination threshold THt is, for example, 5 minutes (= 300 seconds). Since the difference ΔT between the shooting times of the

scenes

401 and 402 is ΔT = 1 minute 41 seconds = 101 seconds <THt, the scene 401 and the scene 402 are determined as “combined scenes”. Similarly, the scene 403 and the scene 404 are determined as “combined scenes” because they are close in time, and the

scenes

405 and 406 are also determined as “combined scenes” because they are close in time (in FIG. A set of scene information enclosed in brackets indicates a combined scene). The scene type determination unit 1032 further determines a main scene or a sub-scene for each scene determined as a combination scene as follows. The scene type determination unit 1032 refers to the person information, motion information, and conversation information included in the scene information of each scene. If it is determined that the scene is a main scene, the scene type determination unit 1032 determines that the scene is a main scene. Each scene is classified. For example, in the example of FIG. 4, since the person information of both the scene 401 and the scene 402 is “main person (1)”, both are determined to be main scenes and classified as main scenes. Since the scene information of the scene 403 is “main person (1)”, the scene 403 is determined to be a main scene and classified as a main scene. Since the scene information of the scene 404 is “other person (2)”, the scene 404 is determined not to be a main scene and is classified as a sub-scene. In the

scenes

405 and 406, the person information is “other person (2)” and “no person (0)”, respectively, and the scene 405 is relatively more important than the scene 406 and is a main scene. The scene 405 is classified as a main scene and the scene 406 is classified as a sub-scene.

It should be noted that in the examples of the scene 405 and the scene 406, since the person information of any scene is other than “main person”, it may be determined that both are not main scenes. In that case, it may be determined that a less important scene (scene 406 in the above example) is not used for the digest. By making such a determination, it is possible to prevent a part of a plurality of non-major scenes close in time from being used for the digest moving image, and reduce the redundancy of the generated digest moving image.

Next, the scene space placement unit 1033 determines the spatial placement of each scene and generates an image clip in which the scene is spatially placed. FIG. 5 shows an example of scene arrangement by the scene space arrangement unit 1033. The scene space arrangement unit 1033 determines the spatial arrangement (layout) of each scene based on the scene type determined by the scene type determination unit 1032 and the relationship between the scene information of the combination scenes as described above. For example, since the scene 401 and the scene 402 shown in the example of FIG. 4 are both main scenes, they are determined to be “parallel arrangement”, which is an arrangement that displays them in parallel at the same size (FIG. 5A). . At this time, since the person information of each of the

scenes

401 and 402 is “main person (1)”, there is a high possibility that a person appears in the central area of each scene. Therefore, the central area of each scene is cut out and placed in

areas

501 and 502, respectively.

In the following example, since the scene 403 and the scene 404 are the main scene and the sub scene, respectively, the main scene is displayed in the area 503 in the central portion of the screen while the sub scene is displayed on the entire image frame so that the main scene is noticed. The central area of the scene is determined to be “central arrangement”, which is an arrangement for superimposing and displaying (FIG. 5B). The reason why the central area of the main scene is displayed in a superimposed manner is that the person information of the scene 403 that is the main scene is “main person (1)”. A scene whose person information is “main person (1)” means that one or two persons having a relatively large size are captured in the image frame. In such a scene, the photographer intends to photograph a specific person, and therefore, the region where the person is photographed is highly likely to be the central portion of the image frame. Therefore, a central area where a person is highly likely to be captured in the main scene is cut out and placed in an area 503 at the center of the screen so that the area where the person is captured is noted. In FIG. 5B, instead of the central area of the main scene, the entire image frame of the main scene may be reduced and displayed in the area 503. As another example of “center arrangement”, an arrangement as shown in FIG. 5D may be selected. In the arrangement of FIG. 5D, the sub-scene is displayed in the entire image frame as in FIG. 5B, and the central area of the main scene is cut out larger than that in FIG. 5B and arranged in the area 507. It is a thing. In the arrangement of FIG. 5D, the display area of the sub-scene is smaller than that in FIG. The scene space arrangement unit 1033 selects this arrangement when, for example, the motion information of the sub-scene is “whole movement (2)”. By displaying a scene with large movement in the background area 504 of the main scene, a dynamic feeling is produced as an entire image, and the area of the background area 504 is smaller than that in FIG. It is possible to display an image layout that does not prevent viewing of the main scene area 507.

As another example, the scene 405 and the scene 406 shown in FIG. 4 are the same as the scene 403 and the scene 404 in that the scene 405 and the scene 406 are the main scene and the sub scene, respectively. Is “other person (2)”, it is unlikely that a specific area such as the center area of the image has an important meaning in the scene 405. Therefore, while the main scene is arranged in the entire image frame, an image obtained by reducing the sub-scene is determined as the “sub-screen arrangement”, which is an arrangement for superimposing the main scene on the main scene (FIG. 5C). At this time, the size of the small-screen area 506 is determined to be smaller than the main scene area (503, 507) in the above-described central arrangement. The reason is that the scene to be noticed is basically the main scene and the sub-scene is not particularly noticeable. For example, the size of the area 503 where the main scene in FIG. 5B, which is the central arrangement, is about 1/4 of the entire image frame (the area 507 in FIG. 5D is the number of pixels in the horizontal direction). Is about 1/2 of the total number of pixels in the horizontal direction of the entire image frame), and the size of the area 506 in which the sub-scene is arranged in the sub-screen arrangement is set to about 1/9 of the entire image frame. Cut out from the original image or reduce the original image. Thus, by making a difference in the size of the area where the scene is arranged, it is possible to make the area or scene to be noticed stand out.

Note that another example of the “small screen arrangement” shown in FIG. 5C is shown in FIG. In the example of FIG. 5E, the main scene is arranged in the area 505 in the same manner as in FIG. 5C, but the area 508 in which the sub-scene is arranged is different from the area 506 in FIG. 5C. It was changed to a spatial position. 5 (c) and FIG. 5 (e) is characterized in that the sub-scene is arranged in an area that does not hinder attention to the main scene. As an example of the case where the arrangement shown in FIG. 5E is determined instead of the arrangement shown in FIG. 5C, an area 506 on the area 505 where the main scene is arranged in the process of scene analysis in the scene information generation unit 102. In other cases, it is found that a person or a part of the person is reflected. In such a case, the area where the sub-scene is superimposed is changed to the area 508 instead of the area 506 so that the person area on the area 505 where the main scene is arranged is not hidden in the sub-scene. By changing the arrangement of the scenes in this way, attention to the main image areas shown in the main scene can be prevented from being disturbed.

Furthermore, a spatial filter may be applied to some scenes to make an image that emphasizes the difference between the main scene and the sub-scene. For example, if the sharpness of the image is reduced by applying a smoothing filter to the region 504 in FIGS. 5B and 5D, the difference between the central region displaying the main scene and the peripheral region displaying the sub-scene is at a glance. You will be able to understand and the areas of interest will become clearer. Whether or not to apply such a spatial filter is determined based on, for example, the similarity between the images of the main scene and the sub-scene. For example, the scene space arrangement unit 1033 applies a smoothing filter to the sub scene when the similarity between the main scene and the sub scene is high, and does not apply the smoothing filter to the sub scene when the similarity is low. For example, in the central arrangement as shown in FIGS. 5B and 5D, the average values for the color components of the pixel values in the images of the

main scene region

503 or 507 and the sub-scene region 504 are compared and averaged. When the difference between the values is smaller than the predetermined value, that is, when the similarity between the pixel values between the

regions

503 and 507 and the region 504 is high, it is determined to apply the spatial filter to the region 504. This makes it easier to focus on the

areas

503 and 507, that is, the main scene, compared to the case where no spatial filter is applied, and the image can be easily viewed as a whole. The spatial filter is not limited to the smoothing filter, and may be a color conversion filter that changes the color tone for each region. For example, the scene space arrangement unit 1033 may convert the sub-scene into gray scale or so-called sepia tone. If the image of the area 504 is converted to a gray scale or so-called sepia tone by color conversion, the

areas

503 and 507 which are main scenes can be easily noticeable. Alternatively, the scene space arrangement unit 1033 does not apply a spatial filter, but makes the change in the time direction of the image in the region 504 zero, that is, makes a still image, so that the difference from the

regions

503 and 507 that are the main scenes is different. May be emphasized.

Note that more than two scenes may be arranged on one screen. FIG. 5F shows an example in which three scenes are arranged. The example shown in FIG. 5F is an arrangement example in the case where all of the person information of three scenes close in time are “main person (1)”. In this case, the scene space arrangement unit 1033 determines that the three scenes are combination scenes with each other, and determines all of them as main scenes. Since three are main scenes, the central area of each scene is cut out and arranged in parallel in

areas

509, 510, and 511 so that they have the same size. When multiple scenes are placed in the same image clip, if the scenes are not the same length in time, other scenes can be matched to the scene with the shortest time between scenes placed in the same image clip. Cut off part of the scene and adjust.

The scene space arrangement unit 1033 outputs the image clip generated by the above method to the scene time arrangement unit 1034.

The scene time arrangement unit 1034 further combines the image clips in which the scenes are arranged spatially as described above in the time direction. 3,

image clips

306a, 306b, 306c,... Each correspond to an image clip composed of only a single scene or an image clip in which a combination scene is arranged. The scene time arrangement unit 1034 combines a plurality of image clips according to the context of the shooting time of the scene corresponding to each image clip. When an image clip is composed of combination scenes, that is, when one image clip contains multiple scenes, the shooting time of that image clip is the scene information of the scene with the latest shooting time among the multiple scenes included in the image clip. It is regarded as shooting time information.

The combination scene described above is a scene in which the difference in shooting time is relatively small, that is, the shooting time is close compared to the length of the entire event. There is a high possibility that scenes whose shooting times are close to each other are the same or similar scenes. In the generation of digest moving images, if scenes with high similarity are combined so as to be continuous in time, the generated digest moving images will become redundant when the similar scenes continue to be easily bored. . Therefore, by arranging scenes with high similarity in parallel in parallel or including them in a part of the same frame, it is possible to effectively use a large number of captured images and diversify the display layout. . Thereby, it is possible to generate a digest moving image that is difficult to get tired of, and to improve the satisfaction of the user.

Here, how to handle an audio track when generating a digest video will be described. As an audio track for generating a digest moving image, an audio track included in image data corresponding to each scene used for the digest moving image is used as it is. At that time, when the scene to be used is a single scene, the audio track of the scene is used as it is. However, in the case of a combination scene, since there are a plurality of audio tracks, the audio track to be used is determined by the method described below. When the combination scene arrangement is other than “parallel arrangement”, that is, “center arrangement” or “sub-screen arrangement”, the audio track of the main scene is used as the audio track of the digest moving image. When the arrangement of the combination scene is “parallel arrangement”, the audio track of each scene is used so as to be allocated to the left channel and the right channel of the audio track of the digest moving image in accordance with the positional relationship of the arranged scenes. By doing in this way, the scene to be noticed as an image matches the sound that can be heard, and the digest moving image can be viewed without a sense of incongruity.

(Switching digest video generation method)
Next, switching of the generation method when the video editing apparatus 100 generates a digest moving image will be described. The digest control unit 1035 changes the digest moving image generation method (generation algorithm) according to the digest moving image generation policy determined by the output control unit 105. Specifically, whether or not to include a scene in the digest, switching judgment criteria of the main scene and sub-scene, presence / absence and arrangement pattern of multiple scenes, image coding quality, audio coding quality, etc. A digest video is generated. Changes in the digest moving image generation method will be described in detail below.

In the digest moving image generation unit 103, the scene type determination unit 1032 determines whether or not each scene included in the image data group that is the generation target of the digest moving image is a main scene. At this time, the scene type determination unit 1032 may make the above determination based on a scene selection criterion included in the digest moving image generation policy. For example, the above description is used to indicate that the scene selection criterion is “human subject”, and when the scene selection criterion is different from this, the digest control unit 1035 determines the criterion for determining the main scene. The information indicating the determination criterion is notified to the scene type determination unit 1032, and the scene type determination unit 1032 determines the type of the scene according to the information. For example, in the case of “landscape-based”, a scene other than a scene capturing a person's figure or conversation, that is, a scene mainly including scenery such as nature is determined as a main scene. For example, in combination scenes that are close in time, scenes whose person information is “no person” or whose conversation information is other than “conversation” are classified as main scenes, and other combination scenes Are classified into sub-scenes. For single scenes that do not have temporally close scenes, select only scenes with the person information “No People”, and other scenes, that is, scenes with people in them, are converted into digest video images as single scenes. Select a scene not to use. With such a configuration, it is possible to preferentially select a scene that matches a specified feature and generate a digest moving image that reflects user preferences.

Based on the multiple scene simultaneous arrangement included in the digest moving image generation policy, the digest control unit 1035 switches whether or not to arrange in the same image frame when a plurality of scenes are close in time. Also good. The digest control unit 1035 determines whether or not to arrange a plurality of scenes in the same image frame, and notifies the scene type determination unit 1032 and the scene space arrangement unit 1033. When the multiple scene simultaneous arrangement notified by the digest control unit 1035 is “possible”, the scene space arrangement unit 1033 treats a plurality of scenes close in time as a combined scene as described above, and uses the same image frame. A digest moving image is generated so as to be placed inside. On the other hand, when the multiple scene simultaneous arrangement is “NO”, the scene space arrangement unit 1033 treats each scene as a single scene and generates a digest moving image so as not to arrange the scenes in the same image frame. As already described with respect to the output control unit 105, when the screen of the display device of the output destination is small, the output control unit 105 sets “simultaneous placement of multiple scenes”, for example, a child whose scene is reduced. It is possible to avoid selecting the layout of the screen layout and not to impair the visibility of the generated digest.

The digest control unit 1035 determines whether to encode an image or sound based on output destination information included in the digest moving image generation policy. When not encoded, the generated digest moving image is output to a built-in video display unit or an externally connected video display device so as to be displayed and reproduced as it is. In the case of encoding, when a digest video is generated, an image or sound is encoded according to a predetermined encoding method, and a digest moving image is output as encoded data. The encoding method used for encoding is, for example, MPEG-2, AVC / H. H.264, HEVC / H. In accordance with a method such as H.265, the sound follows a method such as MPEG-1, AAC-LC, HE-AAC. The digest control unit 1035 is a method having the highest performance as a basic encoding method for encoding, for example, an image is HEVC / H. 265. It is assumed that HE-AAC is used for audio, and the encoding method and encoding quality to be actually used are determined based on output image specifications and output audio specifications described later. The encoding method and encoding quality will be described later.

The digest control unit 1035 determines the image coding quality of the generated digest moving image and the arrangement pattern of a plurality of scenes based on the output image specifications included in the digest moving image generation policy. The output image specification includes at least information indicating the number of display pixels of an output destination video display device. The number of display pixels is composed of the number of pixels in the horizontal direction and the number of pixels in the vertical direction. As a result, the screen aspect ratio of the display device is also found. When the number of pixels and the screen aspect ratio of the input image to be edited match the number of display pixels and the screen aspect ratio of the output destination, the digest control unit 1035 performs a digest so as to maintain the number of pixels of the input image as it is. Generate a moving image. If the number of pixels of the input image and the screen aspect ratio do not match the number of display pixels of the output destination and the screen aspect ratio, the number of pixels of the input image should be maintained or utilized within the range not exceeding the number of display pixels of the output destination. Then, the arrangement of the scene is determined and a digest moving image is generated. The arrangement of the scene when the number of pixels of the input image and the screen aspect ratio do not match the number of display pixels of the output destination and the screen aspect ratio will be described later. When the digest moving image is encoded and recorded / transmitted as a file based on the output destination information, the digest control unit 1035 determines the image encoding rate based on the number of display pixels of the output destination. For example, a plurality of information indicating the number of pixels and an information table indicating the corresponding encoding rate are prepared, and the digest control unit 1035 displays the encoding rate corresponding to the display pixel number of the output destination in the information table. To determine. In addition, when the output image specifications include the output image reproduction capability in addition to the number of pixels, the digest control unit 1035 determines an encoding method for encoding the digest moving image according to the image reproduction capability. To do. For example, an image encoding method supported by an externally connected video display apparatus is MPEG-2 and AVC / H. In the case of H.264, the digesting control unit 1035 performs HEVC / H. Instead of selecting H.265, AVC / H.264 is set in accordance with the image reproduction capability indicated by the output image specification. H.264 is selected as the encoding method, and the digest moving image is encoded.

The digest control unit 1035 determines the audio coding quality of the digest video to be generated and the configuration of the audio track based on the output audio specifications included in the digest video generation policy. The output audio specification is information indicating an audio reproduction capability of at least an output destination audio output device, that is, an audio output unit built in the video editing apparatus 100 or an audio output unit of an externally connected video display apparatus. It is configured including the number of channels, sampling frequency, number of quantization bits, and the like. The digest control unit 1035 determines whether or not to use the audio track of the scene to be included in the digest moving image and channel allocation according to the number of output audio channels in the output audio specification. Also, audio resampling and bit number conversion are performed in accordance with the sampling frequency and quantization bit number of the output audio specification. Further, when the digest moving image is encoded and recorded and transmitted as a file based on the output destination information described above, information indicating the encoding method supported by the output destination audio output device can be obtained as the output audio specification. In this case, the digest control unit 1035 determines a speech encoding method based on the information. For example, if the audio encoding method supported by the externally connected video display apparatus is only MPEG-1 or AAC-LC, the digest control unit 1035 may select HE-AAC with higher encoding performance. AAC-LC is selected as the encoding method, and the audio track of the digest moving image is encoded.

Here, an example of a scene arrangement in the case where the number of pixels of the input image and the screen aspect ratio do not match the number of display pixels and the screen aspect ratio of the output destination indicated in the above-described output image specification will be described. FIG. 6 shows an arrangement example of a plurality of scenes determined by the digest moving image generation unit 103 when the screen aspect ratio of the input image is horizontally long and the screen aspect ratio of the output destination is vertically long. FIG. 6A is an example in which two scenes that are close in time are both main scenes, as in the “parallel arrangement” example of FIG. 5A, and are displayed in parallel at the same size. Arrangement. At that time, since the original image is a horizontally long image including the region 602 and the region 602 ′ and a horizontally long image including the region 603 and the region 603 ′, the original image is matched with the screen aspect ratio of the display region 601. The central area (602, 603) of each scene is cut out and arranged. 6B, as in the “center arrangement” example of FIG. 5B, the sub-scene is arranged in the entire image frame (display area) 601, and the center of the main scene is displayed in the area 604 in the central portion of the screen. It is an example arrange | positioned so that an area | region may overlap. Instead of the central area of the main scene, the entire image frame of the main scene may be reduced and displayed in the area 604. Further, as another example of “center arrangement”, an arrangement as shown in FIG. 6D may be selected. In the arrangement of FIG. 6D, the sub-scene is displayed on the entire image frame 601 similarly to FIG. 6B, and the central area of the main scene is cut out larger than FIG. It is arranged. Reference numeral 608 ′ in the figure denotes a partial area of the original image that is discarded by cutting out the area 608. FIG. 6C shows an example in which an image obtained by reducing the sub-scene is arranged as a sub-screen area 606 as in the “sub-screen arrangement” of FIG. However, in this case, since the input image is horizontally long, it is arranged in the area 604 so as to display the entire main scene, and the sub-scene is arranged in a part of the vacant area on the screen. At this time, the size of the small-screen area 606 is determined so as to be smaller than the main scene area 604 so that it can be distinguished from the main scene that is the scene to be noticed. For example, the main scene area 604 is determined so that the horizontal size is the same as the horizontal size of the entire image frame 601, and the sub-scene area 606 is horizontal in the horizontal direction of the entire image frame 601. It is determined to be about 2/3 of the size of the direction.

Furthermore, a spatial filter may be applied to some scenes to make an image that emphasizes the difference between the main scene and the sub-scene. For example, if the sharpness of the image is reduced by applying a smoothing filter to the region 605 in FIGS. 6B and 6D, the difference between the central region displaying the main scene and the peripheral region displaying the sub-scene is at a glance. You will be able to understand and the areas of interest will become clearer. In addition, in a region 607 in FIG. 6C, a main scene or a sub scene to which a spatial filter is applied may be displayed. By displaying the image on the entire image frame 601 including the region 607, the size of the displayed image is made the same as the image of the arrangement pattern other than that in FIG. 6C, so that no image is displayed in the region 607. Compared with other arrangements of image clips that may be combined in the time direction, it is possible to avoid the uncomfortable feeling that may occur when continuously viewing images. be able to.

Note that more than two scenes may be arranged on one screen. FIG. 6E shows an example in which three scenes are arranged. The example shown in FIG. 6 (e) is an arrangement example of three main scenes that are close in time as in FIG. 5 (f). Each scene is arranged in parallel in the

areas

609, 610, and 611 so as to include the central area.

Here, the display layout of a single scene when the screen aspect ratio is vertically long will be described. When arranging a single scene, for example, in the arrangement of FIG. 6B, the single scene is arranged in the area 604 and the single scene is arranged in the area 605 as well. At that time, the sharpness of the image is lowered by applying the smoothing filter as described above to the region 605. With such a configuration, as described above with reference to FIG. 6C, a spatial feeling when viewing an image can be obtained, and the size of the displayed image can be combined in the time direction. Therefore, it is possible to avoid a sense of incongruity that may occur when continuously watching images.

The above spatial filter is not limited to a smoothing filter, and may be a color conversion filter that changes the color tone of each region. For example, if the images of the

areas

605 and 607 are converted to gray scale or so-called sepia tone by color conversion, the

areas

604 and 608 that are the main scene can be made conspicuous. Alternatively, the difference from the

regions

604 and 608 that are the main scenes may be emphasized by making the change in the time direction of the images of the

regions

605 and 607 zero, that is, making a still image instead of the spatial filter.

As described above, with reference to FIG. 6, the example in which the image with the screen aspect ratio of the landscape is arranged on the screen with the screen aspect ratio of the portrait has been described. In the case of arranging on a horizontally long screen, the size and position of the area where the scene is arranged, the area cut out from each scene, and whether or not to apply the spatial filter can be determined based on the same concept.

FIG. 19 shows a scene space layout when an image with a screen aspect ratio of portrait (hereinafter referred to as “portrait image”) is placed on a screen with a screen aspect ratio of landscape (hereinafter referred to as “landscape screen”). An example of scene arrangement by the unit 1033 is shown. FIG. 19A is an example of “parallel arrangement” for a landscape screen, as in FIG. 5A. The images arranged in FIG. 19A are a main scene A that is a portrait image including areas 1902 and 1902 ', and a main scene B that is a portrait image including areas 1903 and 1903'. The scene space arrangement unit 1033 cuts out the central areas of the main scene A and the main scene B and arranges them in the

areas

1902 and 1903 so that they are displayed in parallel in the display area 1901. A region 1902 'and a region 1903' are regions that are not displayed in the main scene A and the main scene B, respectively.

FIG. 19 (b) is an example of “center arrangement” for the landscape screen, as in FIG. 5 (b). The image arranged in FIG. 19B is a main scene A which is a portrait image corresponding to the area 1904 and a sub-scene B which is a portrait image including areas 1905 and 1905 '. The scene space arrangement unit 1033 arranges the central part of the sub-scene B so as to be displayed in the area 1905 corresponding to the entire display area 1901, and the main scene A in the area 1904 located in the central part of the display area 1901. Deploy. Of the sub-scene B, an area 1905 'is an area that is not displayed.

FIG. 19D is another example of the “center arrangement” for the landscape screen. Similar to FIG. 5D for FIG. 5B, the main scene A is compared with FIG. 19B. The display area of the sub-scene B is accordingly reduced. The scene space arrangement unit 1033 arranges the central part of the sub-scene B so as to be displayed in the area 1905 corresponding to the entire display area 1901, and the central part of the main scene A is located in the central part of the display area 1901. Arrange in the area 1906. Of the main scene A, an area 1906 'is an area that is not displayed. Also, in the sub-scene B, the area 1905 'is an area that is not displayed.

FIG. 19C is an example of “child screen arrangement” for the landscape screen, as in FIG. 5C. The images arranged in FIG. 19C are a main scene A that is a portrait

image including areas

1906 and 1906 ′, and a sub-scene B that is a portrait image corresponding to the area 1907. The scene space arrangement unit 1033 arranges the central part of the main scene A in the area 1906 located in the central part of the display area 1901, reduces the sub-scene B, and arranges it in the area 1907 adjacent to the area 1906 of the main scene. To do. Of the main scene A, an area 1906 'is an area that is not displayed. The scene space arrangement unit 1033 may also be arranged to display the central part of the main scene A or the central part of the sub-scene B in the area 1908 as the background of the

areas

1906 and 1907.

FIG. 19 (e) is an example in which three scenes are arranged for the landscape screen, as in FIG. 5 (f). The images arranged in FIG. 19E are a main scene A that is a portrait image including a region 1909, a main scene B that is a portrait image including a region 1910, and a main scene that is a portrait image including a region 1911. C. The scene space arrangement unit 1033 cuts out the central areas of the main scene A, the main scene B, and the main scene C, and displays them in

areas

1909, 1910, and 1911 so that they are displayed in parallel in the display area 1901. Deploy.

Next, a description will be given of a display layout in which an image having a vertically long screen aspect ratio (portrait image) is arranged on a vertically long screen having the same screen aspect ratio (hereinafter referred to as “portrait screen”). FIG. 20 shows an example of scene arrangement by the scene space arrangement unit 1033 when the portrait image is arranged on the portrait screen. FIG. 20A is an example of “parallel arrangement” for a portrait screen in which two scenes are arranged in the vertical direction. The images arranged in FIG. 20A are a main scene A that is a portrait image including a region 2002 and a main scene B that is a portrait image including a region 2003. The scene space arrangement unit 1033 cuts out the central areas of the main scene A and the main scene B and displays them in the

areas

2002 and 2003 so that they are displayed in parallel in the vertical direction in the display area 2001 corresponding to the portrait screen. Deploy. Note that in FIG. 20, an area that is not displayed when the image area is cut out is not illustrated and will be described separately with reference to FIGS. 21 and 22.

FIG. 20B shows an example of “center arrangement” for a portrait screen in which the sub-scene is arranged as a background over the entire display area and the main scene is arranged so as to be superimposed on the center portion. The images arranged in FIG. 20B are a main scene A that is a portrait image including a region 2004 and a sub-scene B that is a portrait image including a region 2005. The scene space arrangement unit 1033 arranges the sub-scene B in the area 2005 corresponding to the entire display area 2001, cuts out the central area of the main scene A, and arranges it in the central area 2004 in the vertical direction of the display area 2001.

FIG. 20C shows a “child screen arrangement” for a portrait screen in which the main scene A is arranged in an area corresponding to the entire display area, and the sub scene B is arranged as a child screen area so as to be superimposed on the main scene. It is an example. The images arranged in FIG. 20C are a main scene A that is a portrait image corresponding to the area 2006 and a sub-scene B that is a portrait image corresponding to the area 2007. The scene space arrangement unit 1033 arranges the main scene A in the area 2006 corresponding to the entire display area 2001, and arranges the sub scene B so as to fit in the area 2007 having a size smaller than a quarter of the area of the entire display area 2001. To do. The size of the area 2007 is, for example, about 1/9 of the area of the entire display area 2001.

FIG. 20D shows an example in which three scenes are arranged in the vertical direction for the portrait screen. The images arranged in FIG. 20D are a main scene A that is a portrait image including a region 2008, a main scene B that is a portrait image including a region 2009, and a main scene that is a portrait image including a region 2010. C. The scene space arrangement unit 1033 cuts out the central areas of the main scene A, the main scene B, and the main scene C and displays them in the

areas

2008, 2009, and 2010 so that they are displayed in parallel in the vertical direction in the display area 2001. Deploy.

As described above with reference to FIGS. 19 and 20, the main scene and the sub-scene to be output are output regardless of whether the screen aspect ratio of the output video display device is landscape (landscape screen) or portrait (portrait screen). Each of these is an example of a scene arrangement in the case of a vertically long screen aspect ratio image (portrait image). Similarly, what has been described with reference to FIGS. 5 and 6 is an example of a scene arrangement in the case where both the main scene and the sub scene to be output are images having a horizontally long screen aspect ratio (hereinafter referred to as “landscape images”). It is. However, the main scene and the sub scene arranged on the same screen are not necessarily images having the same screen aspect ratio. Next, a method for determining an image area to be output for display from each image when images having different screen aspect ratios are mixed in the main scene and the sub-scene will be described with reference to FIGS.

In the scene space arrangement unit 1033, when determining the arrangement of a plurality of scenes to “parallel arrangement”, “center arrangement”, “sub-screen arrangement”, etc., the size and screen aspect ratio of the display area of the video display device of the output destination and Based on the image size and aspect ratio of each scene to be arranged, an area to be displayed in the image of each scene is determined. At this time, image processing such as scaling (enlargement / reduction) and cropping (cutting) of each image is performed so that the image of each scene can be used most effectively according to the arrangement pattern. These image processing steps will be described with reference to FIGS. 21 and 22.

FIG. 21 shows a processing example of image scaling and cropping in the scene space arrangement unit 1033 when an image is output to the landscape screen. As for the case of outputting an image to the landscape screen, examples of the scene arrangement are shown in FIGS. 5 and 19, but in FIG. 21, which areas 2101 ′ to 2104 ′ are extracted from the original images 2101 to 2104 for display. An example of how to determine this will be described. In the figure, the shaded area indicates a display area extracted from each original image. Further, Ho and Vo mean the horizontal size (number of pixels) and the vertical size (number of pixels) of the display area of the output destination, respectively, and H and V are the values before scaling and cropping processing. It means the size (number of pixels) in the horizontal direction and the size (number of pixels) in the vertical direction of the original image.

FIG. 21A shows an example of how to determine the display area 2101 ′ when the landscape image 2101 is used as the main scene of the “parallel arrangement” for the landscape screen as shown in FIG. 5A. . The scene space arrangement unit 1033 first scales the entire original image 2101 so that the vertical size V of the original image 2101 matches the vertical size Vo of the display area of the output destination (V → Vo). Thereafter, the scene space arrangement unit 1033 crops the central portion of the scaled original image 2101 so that the horizontal size becomes Ho / 2, and extracts a display area 2101 ′. When scaling the original image 2101, enlargement / reduction is performed so as to maintain the screen aspect ratio of the original image so as not to cause distortion of the image in the scene. In other words, scaling is performed so that the size ratio in the horizontal direction and the size ratio in the vertical direction are the same before and after scaling. For example, when the original image whose image size is H × V is scaled so that the vertical size V matches the vertical size Vo of the display area as described above, the vertical size of the scaled image V ′ can be expressed as V ′ = (Vo / V) × V, and the size ratio before and after scaling is Vo / V. Accordingly, scaling is performed so that H ′ = (Vo / V) × H for the horizontal size H ′ of the scaled image. Hereinafter, all the scaling processing in the description related to FIGS. 21 and 22 is performed based on the same concept.

FIG. 21B shows an example of how to determine the display area 2102 ′ when the portrait image 2102 is used as the main scene of the “parallel arrangement” for the landscape screen as shown in FIG. is there. The scene space arrangement unit 1033 first scales the entire original image 2102 so that the horizontal size H of the original image 2102 is adjusted to one half (= Ho / 2) of the horizontal size of the display area of the output destination. (H → Ho / 2). Thereafter, the scene space arrangement unit 1033 crops the central portion of the scaled original image 2102 so that the vertical size is Vo, and extracts a display area 2102 ′. The scene space arrangement unit 1033 also uses the portrait image 2102 as the main scene of the “center arrangement” for the landscape screen as shown in FIG. 5B, according to FIG. 21B. Display area 2102 ′ may be determined from 2102.

FIG. 21C shows how to determine the display area 2103 ′ when the landscape image 2103 is used as another “center arrangement” main scene for the landscape screen as shown in FIG. 5D. It is an example. The scene space arrangement unit 1033 first scales the entire original image 2103 so that the vertical size V of the original image 2103 matches the vertical size Vo of the display area of the output destination (V → Vo). Next, the scene space arrangement unit 1033 sets the central portion of the original image 2103 so that the horizontal size is half the horizontal size (= Ho / 2) of the display area of the output destination, and the vertical direction The display area 2103 ′ is extracted by cropping the size so that the size is smaller by a predetermined number of pixels Ω (= Vo−Ω) than the vertical size of the display area of the output destination. The predetermined number of pixels Ω is determined to be, for example, 5% of the vertical size Vo of the output destination display area.

FIG. 21D shows an example of how to determine the display area 2104 ′ when the portrait image 2104 is used as the main scene of the “child screen arrangement” for the landscape screen as shown in FIG. 5C. It is. First, the scene space arrangement unit 1033 scales the entire original image 2104 so that the horizontal size H of the original image 2104 matches the horizontal size Ho of the display area of the output destination (H → Ho). Thereafter, the scene space arrangement unit 1033 crops the central portion of the scaled original image 2104 so that the vertical size is Vo, and extracts a display area 2104 ′. The scene space arrangement unit 1033 also determines the display area 2104 ′ from the original image 2104 according to FIG. 21D when using the portrait image 2104 as a sub-scene of the “center arrangement” for the landscape screen. May be.

FIG. 22 shows an example of scene scaling and cropping in the scene space layout unit 1033 when an image is output on the portrait screen. Regarding the case of outputting an image on the portrait screen, examples of scene arrangement are shown in FIGS. 6 and 20, but in FIG. 22, regions 2201 ′ to 2203 ′ extracted for display from the original images 2201 to 2203 are displayed. An example of how to determine is described. The meanings of the symbols in the figure are the same as those in FIG.

FIG. 22A shows an example of how to determine the display area 2201 ′ when the landscape image 2201 is used as the main scene of the “parallel arrangement” for the portrait screen as shown in FIG. 6A. is there. The scene space arrangement unit 1033 first scales the entire original image 2201 so that the vertical size V of the original image 2201 is adjusted to one half (Vo / 2) of the vertical size of the display area of the output destination. (V → Vo / 2). Thereafter, the scene space arrangement unit 1033 crops the central portion of the scaled original image 2201 so that the horizontal size becomes Ho, and extracts a display region 2201 ′. The scene space arrangement unit 1033 also uses the landscape image 2201 as the main scene of the “center arrangement” for the portrait screen as shown in FIG. 6B in accordance with FIG. 22A. The display area 2201 ′ may be determined from 2201. The scene space layout unit 1033 also uses the landscape image 2201 as the main scene of the “child screen layout” for the portrait screen as shown in FIG. 6C according to FIG. 22A. The display area 2201 ′ may be determined from the image 2201.

FIG. 22B shows an example of how to determine the display area 2202 ′ when the portrait image 2202 is used as the main scene of the “parallel arrangement” for the portrait screen as shown in FIG. It is. The scene space arrangement unit 1033 first scales the entire original image 2202 so that the horizontal size H of the original image 2202 matches the horizontal size (Ho) of the display area of the output destination (H → Ho). After that, the scene space arrangement unit 1033 crops the central portion of the scaled original image 2202 so that the vertical size is one half (Vo / 2) of the vertical size of the display area of the output destination. Thus, the display area 2202 ′ is extracted. The scene space arrangement unit 1033 also uses the portrait image 2202 as the main scene of the “center arrangement” for the portrait screen as shown in FIG. 6B according to FIG. 22B. The display area 2202 ′ may be determined from the image 2202.

FIG. 22C shows an example of how to determine the display area 2203 ′ when the landscape image 2203 is used as the “center arrangement” sub-scene for the portrait screen as shown in FIG. 6B. is there. The scene space arrangement unit 1033 first scales the entire original image 2203 so that the vertical size V of the original image 2203 matches the horizontal size (Vo) of the display area of the output destination (V → Vo). After that, the scene space arrangement unit 1033 crops the central portion of the scaled original image 2203 so that the horizontal size is Ho, and extracts a display area 2203 ′. The scene space layout unit 1033 also uses the landscape image 2203 as the main scene or sub-scene of the background portion (area 607) of the “child screen layout” for the portrait screen as shown in FIG. Alternatively, the display area 2203 ′ may be determined from the original image 2203 in accordance with FIG.

As described above, when a plurality of images are combined and output from an image group that includes both landscape images and portrait images, it matches the screen size and screen aspect ratio of the output video display device. When scaling or cropping is performed for each original image, the screen aspect ratio of the original image and the screen aspect ratio of the output destination are different, or a scene arrangement in which multiple images with different image sizes or screen aspect ratios are mixed Even so, it is possible to output a high-quality video that does not cause image distortion while maximally effectively using the display area of the output screen.

When scaling is performed, either the number of pixels in the horizontal direction (H) or the number of pixels in the vertical direction (V) of the original image corresponds to the number of pixels (Ho, Ho) in the corresponding direction in the display area of the output destination. / 2, Vo, Vo / 2, etc.). By doing so, it is possible to reduce the frequency of performing the enlargement process of the original image, and to suppress the deterioration in image quality and the increase in the amount of image data accompanying the enlargement process.

As described above, by adjusting the output image specifications and output audio specifications of the generated digest moving image to the specifications and capabilities of the output destination video display device and audio output device, the digest moving image suitable for the output destination device Can be generated. Particularly for video, it is possible to generate an easy-to-see digest moving image in which a plurality of scenes are effectively arranged according to the size of the display device and the screen aspect ratio. Also, when encoding a digest video, it is possible to output high-quality video / audio that can make the most of the capabilities of the output destination device.

(Second Embodiment)
Next, a video editing apparatus according to the second embodiment of the present invention will be described. The video editing apparatus according to the second embodiment is characterized in that the digest moving image generation unit 103 is different from the video editing apparatus according to the first embodiment. Although illustration is omitted, the digest moving image generation unit in the present embodiment is configured to further include a digest moving image generation count unit and a random arrangement pattern determination unit. Hereinafter, the difference from the first embodiment will be described in detail.

In the digest moving image generation unit 103 in the present embodiment, the digest moving image generation counting unit counts the number of times a digest moving image is generated in units of image data groups indicated by the selection information notified from the event selection unit 104. The digest moving image generation counting unit notifies the random arrangement pattern determination unit of the counted number of generations. When the notified number of generations is 1, the random arrangement pattern determination unit does nothing, and when the number of generations is two or more, when determining the spatial arrangement pattern of a plurality of scenes, The arrangement pattern is changed randomly. As a result, when the digest moving image generating unit generates the digest moving image, the scene space arranging unit 1033 according to the first embodiment of the present invention has been described in the unit of the selected image data group at the first generation. As described above, the arrangement pattern of a plurality of scenes is determined based on the relationship between the type of scene and the scene information between the combination scenes. However, at the second and subsequent generations, the arrangement pattern of the plurality of scenes is randomly changed for each combination scene. Regarding the determination of the combination scene, as described with respect to the scene type determination unit 1032 in the first embodiment, it is determined that scenes close in time are selected.

With such a configuration, when a digest moving image is generated many times from the same image data group, the time sequence of the scene used for the digest moving image is generated based on the shooting time. Digest moving images having different layouts can be generated each time. As a result, it is possible for the user to view the same image data group with a fresh sensation, and to provide a digest moving image in which the user is less tired.

(Third embodiment)
Next, a video editing apparatus according to the third embodiment of the present invention will be described. The video editing apparatus according to the third embodiment is different from the video editing apparatus according to the first embodiment in that the scene information generation unit and the digest moving image generation unit are different. FIG. 8A shows the internal configuration of the video editing apparatus 100a according to the present embodiment. The video editing apparatus 100a includes an image data classification unit 101, a scene information generation unit 102a, a digest moving image generation unit 103a, an event selection unit 104, and an output control unit 105. Hereinafter, the difference from the video editing apparatus 100 according to the first embodiment will be described in detail.

(Scene information generation unit 102a)
The scene information generation unit 102a analyzes the image data, classifies the image data into one or more scenes characterized by an image signal or an audio signal, and generates scene information that is information indicating the feature of each scene. The scene information is configured to include “number of persons”, “maximum person size”, and “maximum person position” as information regarding the feature region in the image (hereinafter, these three types of information are collectively referred to as person information. Called). “Number of persons” represents the maximum number of image areas (person areas) of persons appearing in the image of each scene in units of image frames. “Maximum person size” represents the number of person areas in each scene. The size of the area with the largest area is represented, and the “maximum person position” represents the position (coordinate in the image) of the area corresponding to the maximum person size. The scene information generation unit 102a detects a face image and a whole body image in the image as feature regions of each scene, and when a face image is detected, generates scene information from information related to the region of the face image and detects the face image. If it is not performed (such as a person appearing but looking sideways or behind), scene information is generated from information about the whole body image area. As a method for detecting a face image, for example, image feature amounts are extracted in units of regions of a predetermined size, and face regions are detected (identified) based on a face image discriminator using Haar-Like feature amounts. There is a way. Further, as a method for detecting a whole body image, a gradient direction histogram (HOG: Histograms of oriented gradients) is calculated in a predetermined image area unit, and a whole body image region based on a whole body image discriminator using the HOG feature amount is calculated. There is a method of detecting (identifying). The method for detecting the face image and the whole body image is an example, and the method is not limited to the above method as long as the size and position of the area to be detected can be obtained. Also, not only the face image and the whole body image, but also the upper body image and lower body image areas are detected based on a classifier using separately prepared feature amounts. For example, when the face image and the whole body image are not detected, the upper body image Scene information may be generated based on information (number, size, position) on the area, and if no upper body image is detected, scene information based on information (number, size, position) on the lower body image area May be generated.

FIG. 9 is a diagram showing the concept of the person information. FIG. 9A shows an example of a scene in which the person area 701 is located at the coordinates 702 (x1, y1) and the size thereof is (H1 × V1). Thus, when the number of persons (number of person areas) in the scene is one, the “maximum person size” and the “maximum person position” are uniquely determined as (H1 × V1) and (x1, y1), respectively. In FIG. 9B, two person regions 703 and 704 are located at coordinates 705 (x2, y2) and 706 (x3, y3), respectively, and the sizes of the regions are (H2 × V2) and (H3 ×), respectively. It is an example of the scene which is V3). As shown in FIG. 9B, when the number of persons in the scene is two, information indicating the size (H2 × V2) of the area 703 having the larger area out of the two person areas (703, 704) is “ “Maximum person size” is defined, and information indicating the coordinates (x2, y2) of the area 703 is defined as “maximum person position”. FIG. 9C is an example of a scene including four person areas 707, 708, 709, and 710. In the example of FIG. 9C, it is assumed that the person area 707 has the largest area among the four person areas. In this case, information indicating the size (H4 × V4) of the area 707 is defined as “maximum person size”, and information indicating the coordinates (x4, y4) of the area 707 is defined as “maximum person position”.

FIG. 10 shows an example of scene information corresponding to the example shown in FIG. Like the scene information 200, the scene information 800 describes information about the scene in units of lines, and each

line

801, 802, 803,... Is configured to correspond to one scene. The information described in each line is the image file name, shooting date, shooting time, scene start frame number, scene end frame number, number of people, maximum person size, maximum person position, motion information, conversation information in order from the left. Is shown. Hereinafter, the number of persons, the maximum person size, and the maximum person position in the scene information 800 will be described. Hereinafter, in the description related to FIG. 10, for the sake of simplicity of description, the symbols of each scene information are also described as meaning the scenes themselves.

Scene information of the scene 801 is an example corresponding to the example of FIG. In FIG. 9A, since there is only one person area (area 701), the scene information regarding the person area of the scene 801 is the number of persons = 1, and the maximum person size and the maximum person position are the size of the area 701. (H1 × V1) and coordinates (x1, y1). In FIG. 10, H1 = 400, V1 = 500, x1 = 500, and y1 = 300 are respectively described as numerical values corresponding to H1, V1, x1, and y1 (scene information of the scene 801). The scene information of the scene 802 is an example corresponding to a scene having one person area, as in the example of FIG. The scene information of the scene 803 is an example corresponding to the example of FIG. In FIG. 9B, there are two person regions (regions 703 and 704), and the region with the larger area is the region 703. Accordingly, the scene information related to the person area of the scene 803 is the number of persons = 2, and the maximum person size and the maximum person position are the size (H2 × V2) and coordinates (x2, y2) of the area 703. In FIG. 10, H2 = 360, V2 = 480, x2 = 400, and y2 = 500 are respectively described as numerical values corresponding to H2, V2, x2, and y2 (scene information of the scene 803). The scene information of the scene 804 is an example corresponding to the example of FIG. In FIG. 9C, there are four person regions (regions 707, 708, 709, and 710), and the region 707 has the largest area. Accordingly, the scene information related to the person area of the scene 804 is the number of persons = 4, and the maximum person size and the maximum person position are the size (H4 × V4) and coordinates (x4, y4) of the area 707. In FIG. 10, H4 = 450, V4 = 520, x4 = 100, and y4 = 300 are described as numerical values corresponding to H4, V4, x4, and y4 (scene information of the scene 804). The scene information of the scene 805 is an example corresponding to a scene having five person areas. The scene information of the scene 806 is an example corresponding to a scene where the number of persons is zero, that is, no person is detected in the image. When the number of persons is zero, there is no scene information corresponding to the maximum person size and the maximum person position. In FIG. 10, these non-existing information is represented by the symbol “*”.

In the above description, regarding the scene information 800, “maximum person size” is represented by the number of pixels in the horizontal and vertical directions of the rectangular area corresponding to the person area, and “maximum person position” is the upper left pixel in the image as the origin. In the above description, it is expressed by the coordinates of the upper left pixel of the rectangular area. However, the shape of the region corresponding to the face image in the person region may be a circle instead of a rectangle, and in this case, the “maximum person size” may be expressed by the number of pixels corresponding to the diameter of the circle. Further, the coordinates corresponding to the “maximum person position” may be the coordinates of the pixel at the center of the area instead of the upper left of the area.

The scene information generation unit 102a generates scene information including the person information (number of persons, maximum person size, maximum person position) described above, and outputs the generated scene information to the digest moving image generation unit 103a.

(Digest video generation unit 103a)
The digest moving image generation unit 103a reads the scene information generated by the scene information generation unit 102a, and digests the image data group classified by the image data classification unit 101 or the image data group selected by the event selection unit 104 as a target. Generate a moving image. FIG. 8B shows an internal configuration of the digest moving image generating unit 103a in the present embodiment. The digest moving image generation unit 103a includes a target image extraction unit 1031, a scene type determination unit 1032a, a scene space arrangement unit 1033a, a scene time arrangement unit 1034, and a digest control unit 1035. Hereinafter, the differences from the first embodiment will be mainly described.

The target image extraction unit 1031 refers to the selection information indicating the target image data group notified from the event selection unit 104, and extracts an input image when generating a digest moving image. Information indicating the extracted image data is notified to the scene type determination unit 1032a and the scene space arrangement unit 1033a. The scene type determination unit 1032a refers to the scene information generated by the scene information generation unit 102a, reads the scene information of the scene corresponding to the information indicating the image data extracted by the target image extraction unit 1031, and determines the scene type. decide.

Hereinafter, the process of determining the scene type in the scene type determination unit 1032a will be described with reference to FIG. The scene type determination unit 1032a refers to the scene information 800, compares the shooting times of two scenes that are consecutive in the order of the shooting times, and determines whether the difference ΔT between the two shooting times is within the scene proximity determination threshold THt or exceeds it. That is, depending on whether or not they are close in time, each scene is determined as a “single scene” used alone or a “combined scene” used in combination. Assuming that the scene proximity determination threshold THt = 300 seconds, the difference ΔT between the shooting times of the

scenes

801 and 802 is ΔT = 1 minute 41 seconds = 101 seconds <THt, so the

scenes

801 and 802 are determined as “combined scenes”. To do. Similarly, the

scenes

803 and 804 are determined as “combination scenes” because they are close in time, and the

scenes

805 and 806 are also determined as “combination scenes” because they are close in time. The scene type determination unit 1032a refers to person information (number of persons, maximum person size, maximum person position), motion information, and conversation information included in the scene information for each scene determined as a combination scene, If it is determined that there is a scene, the main scene is classified. If it is determined that the scene is not a main scene, each scene is classified into a sub-scene. Since both the scene 801 and the scene 802 in FIG. 10 have “number of persons” = 1, both are determined to be main scenes and classified as main scenes. Since the

scenes

803 and 804 have “number of persons” of 2 and 4, respectively, the scene 803 is classified as a main scene and the scene 804 is classified as a sub-scene. Since the

scenes

805 and 806 have “number of persons” of 5 and 0, respectively, the scene 805 is classified as a main scene and the scene 806 is classified as a sub-scene. As described above, the scene type determination unit 1032a determines that a scene having a smaller (number of people) in the scene (but not zero) is the main in the combination scene, and a scene having a relatively larger number of persons. Is determined not to be major. A scene with a “number of persons” of zero is determined to be less important than a scene with a non-zero “number of persons”. If the “number of persons” of the combination scenes is the same, it is determined that both are main, and both are classified as main scenes.

The scene space arrangement unit 1033a determines the spatial arrangement of each scene, generates an image clip in which the scene is arranged spatially, and outputs the image clip to the scene time arrangement unit 1034. The scene space arrangement unit 1033a determines the spatial arrangement (layout) of each scene based on the scene type determined by the scene type determination unit 1032a and the scene information relationship between the combination scenes. The method for determining the layout of the scene in the scene space arrangement unit 1033a is basically the same as the method in the scene space arrangement unit 1033 described above, but in the scene space arrangement unit 1033a, the “person” included in the scene information is displayed. The “number” is used for layout determination in association with the “person information” used as the layout determination reference in the scene space arrangement unit 1033. For example, when the “number of persons” included in the scene information is 1 or 2, it is handled in the same way as a scene whose “person information” is “main person (1)”. Further, when the “number of persons” is 3 or more, it is handled in the same manner as a scene in which “person information” is “other person (2)”. Further, when the “number of persons” is 0, it is handled in the same way as a scene where the “person information” is “no person (0)”. In addition, regarding the layout determination method by the scene space arrangement unit 1033a, the difference from the scene space arrangement unit 1033 is that the control of the arrangement position of the scene according to the “maximum person position” indicated by the scene information and the “maximum person size” And effect control according to the “number of persons”.

FIG. 11 to FIG. 13 show processing examples related to scene arrangement position control and effect control by the scene space arrangement unit 1033a. A scene 901 and a scene 902 in FIG. 11 correspond to the

scenes

801 and 802 in FIG. Since the

scenes

801 and 802 are combined scenes as described above, and both are main scenes, the scene space layout unit 1033a displays the layouts of both scenes in parallel with the

scenes

901 and 902 having the same size. The arrangement is determined to be “parallel arrangement” (FIG. 11C). At this time, the scene space arrangement unit 1033a determines the areas (areas 921 and 922) each including the area (area 911 and 912) indicated by the “maximum person position” included in the scene information of each scene near the center. These

areas

921 and 922 are cut out from the images of the

scenes

901 and 902, respectively, and are arranged in

areas

931 and 932 in the output image 930.

As a next example,

scenes

903 and 904 in FIG. 12 correspond to

scenes

803 and 804 in FIG.

Scenes

803 and 804 are combined scenes as described above, and are a main scene and a sub-scene, respectively. Further, since the scene 803 which is the main scene is “number of persons” = 2, it is handled in the same way as the scene whose person information is “main person (1)”. The layout is determined to be “center arrangement”, which is an arrangement in which the main scene 903 is superimposed and displayed on the area 941 in the center portion of the screen while the sub-scene 904 is displayed over the entire area in the output image 940 (FIG. 12C). )). At this time, the scene space arrangement unit 1033a determines an area 923 that includes the area (area 913) indicated by the “maximum person position” included in the scene information of the main scene 903, and this area 923 is determined as the area 923. It is cut out from the image of the scene 903 and arranged in an area 941 in the output image 940.

As a next example,

scenes

905 and 906 in FIG. 13 correspond to

scenes

805 and 806 in FIG. As described above, the scene 805 and the scene 806 are combined scenes, and are a main scene and a sub-scene, respectively. Further, since the scene 805 which is the main scene is “number of persons” = 5, the scene information is treated in the same way as the scene whose person information is “other person (2)”. While displaying the main scene 905 in the entire area 951 in the output image 950, the layout is changed to “sub-screen arrangement” which is an arrangement in which an image obtained by reducing the sub-scene 906 is superimposed on the main scene as a sub-screen area 952. It decides (FIG.13 (c)). At this time, the scene space arrangement unit 1033a causes the region (region 915) indicated by the “maximum person position” included in the scene information of the main scene 805 to be output in the image 950 so that it is not hidden by the sub-scene to be superimposed. The position of the sub-screen area 952 to be superimposed on is determined. At that time, the scene space arrangement unit 1033a selects a position that is farthest from the position of the area (area 915) indicated by the “maximum person position” as one of the four corners in the screen as the position of the sub-screen area. Thus, the position of the small screen area 952 is determined. Note that the position of the sub-screen area 952 is not limited to the four corners in the screen as long as it does not overlap with the “maximum person position” included in the scene information of the main scene.

As described above, the scene space placement unit 1033a determines the layout of a plurality of scenes, so that a large subject (for example, a person area that is likely to be noticed) in the main scene becomes a boundary with another scene placed in the same screen. Therefore, it is possible to avoid cases that do not fit within the screen or are hidden in other scenes that are not main, and as a result, it is possible to generate a digest moving image that is easy to watch.

Next, effect control by the scene space arrangement unit 1033a will be described. When determining the layout of a plurality of scenes, the scene space arrangement unit 1033a may further apply a spatial filter to a part of the scenes to make an image that emphasizes the difference between the main scene and the sub-scene. For example, by applying a smoothing filter to a region 942 in an image 940 “centered” as shown in FIG. 12, the sharpness of the image between the central region 941 displaying the main scene and the peripheral region 942 displaying the sub-scene. To make the areas of interest clearer. At this time, the scene space arrangement unit 1033a controls the strength of the smoothing filter in accordance with the “maximum person size” included in the scene information. For example, a ratio HSratio (= HSmain / HSsub) of “maximum person size” HSmain included in the scene information of the main scene and “maximum person size” HSsub included in the scene information of the sub-scene is defined, and the size of HSratio is set. Control the strength of the smoothing filter to be inversely proportional. For example, the scene space arrangement unit 1033a uses three types of parameters α, β, and γ in order of increasing smoothness as the parameter Ff that controls the smoothness of the smoothing filter. At that time, control is performed such that the parameter γ is selected when the HS ratio is small, and the parameter α is selected when the HS ratio is large. FIG. 14A is a graph showing an example of the relationship between HSratio and the smoothing filter strength Ff. When the difference in maximum person size between the main scene and the sub scene is small (HSratio: small), when the main scene and the sub scene are superimposed, the sub scene image obstructs the viewing of the main scene image. Since it is easy (they are easily confused), the smoothing applied to the sub-scene display area (area 942) is strengthened to increase the sharpness difference between the main scene and the sub-scene. On the other hand, when the maximum person size difference between the main scene and the sub-scene is large (HSratio: large), when the main scene and the sub-scene are superimposed and displayed, the sub-scene image hinders viewing of the main scene image. Therefore, the difference in sharpness between the main scene and the sub-scene is reduced by reducing the smoothing applied to the sub-scene display area (area 942). When the difference between the maximum person sizes in the main scene and the sub-scene is larger, smoothing may not be applied (when HS Ratio> r3 in FIG. 14A). The purpose of differentiating the sharpness of the main scene and the sub-scene by the smoothing filter is mainly to increase the degree of attention to the main scene, but if the difference in sharpness is too large, the digest video will be viewed. In this case, it becomes difficult to determine what the sub-scene is copied, and the effect of spatially arranging a plurality of scenes is halved. Therefore, for a scene that is unlikely to obstruct attention to the main scene, the smoothing filter is weakened so that the sharpness is not significantly different, or the smoothing filter itself is not applied. With such a configuration, while increasing the degree of attention to the main scene, when displaying multiple scenes in various layouts, the display (appearance) variations will be further increased and the digest video will be viewed. In addition, it is possible to provide a moving image that is easier to see and less likely to get tired of the user.

FIG. 14B shows another example relating to the control of the smoothing filter strength Ff by the scene space arrangement unit 1033a. FIG. 14B is a graph showing an example of the relationship between HNsub and Ff when the smoothing filter strength Ff is determined by the “number of persons” HNsub included in the scene information of the sub-scene. As shown in the graph, the scene space arrangement unit 1033a selects a parameter γ having a high degree of smoothing when HNsub is small, and selects a parameter α having a low degree of smoothing when HSsub is large. When HNsub is 0, it may be controlled not to perform smoothing itself (when 0 ≦ HNsub <n1 in FIG. 14B). According to this method, the smoothing intensity can be easily determined from only the scene information of the sub-scene without referring to the scene information of the main scene. Since the target of smoothing is a sub-scene, if the strength of the smoothing filter is controlled according to the scene information (number of persons) of the sub-scene, the sub-scene image is effectively used while increasing the attention of the main scene. A digest moving image can be generated. Further, the smoothing filter strength Ff may be controlled so as to satisfy both the relationship shown in FIG. 14A and the relationship shown in FIG. For example, Ff can select not only three types of α, β, and γ, but also a large number of coefficients. First, the rough smoothing filter strength Ff is determined based on the “number of persons” HNsub of the sub-scene. Thereafter, Ff is controlled more finely based on the ratio HSratio (= HSmain / HSsub) of the “maximum person size” HSmain of the main scene and the “maximum person size” HSsub of the sub-scene. With such a configuration, it is possible to increase the types of filter strengths to be applied, and to further increase display (appearance) variations when a plurality of scenes are spatially arranged in various layouts.

The smoothing filter strength Ff described above may be, for example, a parameter indicating the number of thinned pixels when performing simple pixel thinning as a smoothing filter. In the example shown in FIG. 14, for example, α = 2, β = 4, and γ = 8 are set, and the scene space arrangement unit 1033a thins out the pixels so that the reciprocal of α, β, and γ is obtained. The image is smoothed by interpolating the pixels to the original number of pixels. For example, if Ff = α = 2, the pixels are thinned out so that the number of pixels of the image to be smoothed is ½ in both the horizontal direction and the vertical direction, and then the pixel values at the thinned pixel positions are thinned out. Copy and interpolate later (remaining) pixel values. When the value of the parameter Ff = 1, this means that thinning is not performed, and in this case, smoothing is not performed. Alternatively, the smoothing filter strength Ff may be, for example, a parameter indicating a window size corresponding to a pixel range to which a filter is applied when a moving average filter is used as the smoothing filter. In the example shown in FIG. 14, for example, α = 3, β = 5, and γ = 9 are determined, and the scene space arrangement unit 1033a is a unit of the window size indicated by α, β, γ (for example, Ff = α = 3). If so, the image is smoothed by averaging the pixel values. Similar to the above example, smoothing is not performed when the value of the parameter Ff = 1. In addition to these examples, the smoothing filter strength Ff may be a parameter indicating a predetermined coefficient set according to the smoothing filter method used by the scene space arrangement unit 1033a, such as a Gaussian filter or a weighting filter. .

Note that the spatial filter applied by the scene space arrangement unit 1033a is not limited to the smoothing filter, and may be a color conversion filter that changes the color tone of each region. For example, the scene space arrangement unit 1033a may change the saturation instead of smoothing the sub-scene image. For example, with respect to the pixels in the sub-scene region 942, the saturation of the pixels is changed so as to be proportional to the HSratio or HNsub described above. For example, as shown in FIG. 14 (c), a characteristic indicating the relationship between HSratio and saturation S is defined in the range from 0 to Smax in saturation S, and the sub-scene region 942 is adjusted to match that characteristic. Convert pixel values. Here, Smax means the maximum saturation in the target sub-scene before the pixel value is converted. With such a configuration, when the difference in maximum person size between the main scene and the sub-scene is small (HSratio: small), the sub-scene image is the main scene image when the main scene and the sub-scene are superimposed and displayed. Therefore, by reducing the saturation S of the sub scene, the saturation of the main scene and the sub scene is differentiated to make the main scene area stand out. At this time, if HSratio is smaller than a predetermined threshold value, the saturation of the sub-scene is set to S = 0 (when HSratio <r0 in FIG. 14C), that is, a gray scale image is obtained. The saturation difference may be particularly emphasized. If HSratio is greater than a predetermined threshold, the saturation of the sub-scene may not be changed (when HSratio> r4 in FIG. 14C). Note that the saturation of the sub-scene may be converted based on the same characteristic indicating the relationship between HNsub and saturation S instead of the characteristic indicating the relationship between HSratio and saturation S as shown in FIG. Good. As described above, the sub-scene region (for example, the region 942) is converted by changing the pixel value so as to lower the saturation of the region where the sub-scene is arranged according to the relationship between the scene information of the main scene and the sub-scene. An area of the main scene (for example, the area 941) arranged on the same screen can be made conspicuous as it approaches a gray scale image or becomes a gray scale image. Alternatively, the scene space arrangement unit 1033a does not apply a spatial filter, but makes the change in the time direction of the image in the sub-scene region 942 zero, that is, makes a still image, so that the difference from the main scene region 941 is different. May be emphasized.

As described above, according to the video editing apparatus 100a according to the present embodiment, in generating a digest moving image in which a plurality of scenes are combined and spatially arranged, an image that is easily noticed, such as a human region in a main scene. It is possible to provide a digest moving image in which an area is easily seen. In addition, depending on the difference in characteristics between major and non-major scenes spatially arranged on the same screen, sharpness, color, etc. While increasing the degree of attention, the number of display (viewing) variations when spatially arranging multiple scenes in various layouts is further increased, and it is easier to see and less tired of the user when watching a digest video. A moving image can be provided.

In the description of the third embodiment, an example is shown in which a person (a face image or a whole body image) is detected as a feature area in an image showing the features of each scene. However, another subject is used instead of a person. Information that is detected as a region and indicates the “number of regions”, “maximum region size”, and “maximum region position” corresponding to the feature regions may be included in the scene information instead of the person information. As a method for detecting a subject other than a person as a feature region, when using the Haar-Like feature amount or the HOG feature amount described above, a subject to be focused on, for example, an animal (dog, bird, etc.), a vehicle (vehicle, aircraft) Etc.) in advance, and a subject in the image may be detected (identified) based on the identifier.

(Fourth embodiment)
Next, a video editing apparatus according to the fourth embodiment of the present invention will be described. The video editing apparatus according to the fourth embodiment is different from the video editing apparatus according to the first embodiment in the target image extraction unit, the scene space arrangement unit, and the scene time arrangement unit included in the digest moving image generation unit. The point is a feature. In the fourth embodiment, the video editing apparatus 100b is configured to include a digest moving image generation unit 103b, and the digest moving image generation unit 103b includes a target image extraction unit 1031b, a scene space arrangement unit 1033b, and a scene time arrangement unit. 1034b. FIG. 17 shows an internal configuration of the video editing apparatus 100b and the digest moving image generating unit 103b according to the present embodiment.

The digest moving image generation unit 103b reads the scene information generated by the scene information generation unit 102, and uses the image data group classified by the image data classification unit 101 or the image data group selected by the event selection unit 104 as a target. Generate a moving image. The digest moving image generation unit 103b includes a target image extraction unit 1031b, a scene type determination unit 1032, a scene space arrangement unit 1033b, a scene time arrangement unit 1034b, and a digest control unit 1035. Hereinafter, the differences from the first embodiment will be mainly described.

The target image extraction unit 1031b refers to the selection information indicating the target image data group notified from the event selection unit 104, and extracts the input image when generating the digest moving image. The target image extraction unit 1031b notifies the scene type determination unit 1032 and the scene space arrangement unit 1033b of information indicating the extracted image data. At that time, the target image extraction unit 1031b extracts the name of the image data group, the image data name, and the shooting date / time of the image data from the image data group identification information, and notifies the scene space arrangement unit 1033b of the extracted image data.

The scene space arrangement unit 1033b determines the spatial arrangement of each scene and generates an image clip in which the scene is arranged spatially in the same manner as the scene space arrangement unit 1033 described with respect to the first embodiment. The scene space arrangement unit 1033b further has a function of superimposing a text image indicating image information when generating an image clip and a function of generating a title image as an additional image clip. This is a difference from the form.

FIG. 18 shows an example of an image clip generated by the scene space arrangement unit 1033b. FIG. 18A shows an example of a title screen generated by the scene space layout unit 1033b. The title screen 1000 is, for example, an image in which white text 1002 is superimposed on a black background 1001 and is a still image of about 5 seconds, for example. The scene space arrangement unit 1033b generates the title screen 1000 by superimposing the text 1002 indicating the name of the image data group notified via the target image extraction unit 1031b on the separately generated background image 1001. FIG. 18B is an example of an image clip including text information indicating image information for each scene, which is generated by the scene space arranging unit 1033b. The image clip 1003 is an image clip in which the scene 1004 and the scene 1005 are spatially arranged (corresponding to the image 930 in FIG. 11C), and text indicating the shooting date / time information is superimposed on the

scenes

1004 and 1005. It is an image. The scene space arrangement unit 1033b captures images corresponding to the image data of the scene 1004 and the scene 1005 (DSC_2001.mov and DSC_2002.mov in FIG. 15) included in the image data group identification information notified via the target image extraction unit 1031b. An image clip 1003 is generated by superimposing text (1006, 1007) indicating date and time information on each scene.

The scene time arrangement unit 1034b combines the image clips generated by the scene space arrangement unit 1033b in the time direction in the same manner as the scene time arrangement unit 1034 described with reference to the first embodiment. At that time, the scene time arrangement unit 1034b combines the image clips in the time direction so that the image clip of the title screen generated by the scene space arrangement unit 1033b is positioned at the head in time.

Thus, by automatically generating the title screen based on the date / time information and position information of the input image data, it is possible to reduce the trouble of user operation when generating the digest moving image, and the generated digest moving image is used by the user. Can see at a glance when and where the digest is for image data taken. Therefore, even when a digest moving image including image data that has passed the number of days after shooting is generated or viewed, the situation at the time of shooting can be easily remembered, and the user's satisfaction with the digest moving image can be improved. Furthermore, by superimposing the shooting date and time information in units of scenes, it is possible to easily specify image data when the user wants to watch the digest moving images carefully after watching the digest moving images.

In the scene space arrangement unit 1033b, whether or not to superimpose text indicating shooting date / time information on each scene may be determined in advance according to the user's selection. In that case, the digest control unit 1035 notifies the scene space layout unit 1033b whether or not to superimpose text according to the user's selection, and the scene space layout unit 1033b sends the text indicating the shooting date / time information to each scene according to the notification. Toggle control of whether or not to superimpose. In addition, when superimposing the text indicating the shooting date / time information on the image clip, for example, only the shooting date / time information of the main scene may be superimposed instead of superimposing the text for every scene.

By providing the configuration as described above, the video editing apparatus according to the present embodiment allows a user to confirm and view a large number and a large number of still images and moving images in a short time without trouble. In addition, it is possible to view the image with a quality suitable for a display device that displays the image and an easy-to-view image, and further, it is possible to view the same image data group again and again without getting tired.

(Fifth embodiment)
Hereinafter, embodiments of the present invention will be described with reference to the drawings. For convenience of explanation, members having the same functions as those shown in the above embodiment are given the same reference numerals, and the explanation thereof is omitted.

FIG. 23 is a schematic diagram showing the configuration of a video editing apparatus according to the fifth embodiment of the present invention.

The video editing apparatus 100c includes a target image data extraction unit 109, a scene information generation unit 102, a reproduction time candidate derivation unit 110, a reproduction time candidate display unit 111, and a digest moving image generation unit 103c. Although not shown, the video editing apparatus 100c may further include a data recording unit that stores image data, a video display unit that displays images, or a data recording device that has the same functions as those described above. The video display device may be configured to be connectable to the outside.

Next, each functional block of the video editing apparatus 100c will be described.

The target image data extraction unit 109 extracts image data that meets a predetermined condition based on the metadata included in the image data. The extracted image data is collected as an image data group.

For example, on the basis of the editing date on which the digest moving image is edited, the image data shot on the previous day, that is, the image data whose shooting date is the previous day of the editing date is determined as the editing target. Further, based on the designated date and time specified by the user, not the editing date, the image data with the shooting date and time before and after the designated date and time may be determined as the editing target. Further, the image data determined by the target image data extraction unit 109 as the editing target may be based not only on the date / time information but also on position information and creator information. For example, image data having position information specified by the user or position information within a predetermined range including the position may be determined as an editing target. Alternatively, among a plurality of image data having position information within a predetermined range for different creators, only image data having specific creator information may be determined to be edited, and conversely, a specific creation The image data excluding the image data having the person information may be determined as an editing target. The image data determined by the target image data extraction unit 109 as an editing target is not limited to one, and may be two or more. Note that the target image data extraction unit 109 may use a day change as a trigger as the timing for determining image data to be edited. For example, when midnight has passed, image data captured on the previous day may be determined as an editing target. The target image data extraction unit 109 outputs the image data group to the digest moving image generation unit 103c.

Also, the target image data extraction unit 109 calculates the total reproduction time by summing the reproduction times of all the extracted image data. The target image data extraction unit 109 outputs the total reproduction time to the reproduction time candidate derivation unit 110.

The reproduction time candidate derivation unit 110 derives digest video reproduction time candidates based on the total reproduction time input from the target image data extraction unit 109. As a derivation method, the square root of the total playback time is calculated as a playback time candidate. The square root of the total playback time in the unit of “minute” is calculated, and a value obtained by rounding down the decimal point is set as a playback time candidate. For example, when the total playback time is 1 hour, 7 which is a value obtained by rounding down the decimal point from the square root of 60 is a playback time candidate. The reproduction time candidate derivation unit 110 outputs the derived reproduction time candidate to the reproduction time candidate display unit 111.

The playback time candidate display unit 111 displays the playback time candidates input from the playback time candidate derivation unit 110 on a display device (not shown). It is assumed that the display device includes user input means such as a touch panel and a mouse. The reproduction time candidate display unit 111 receives a user event via the input means, and sets the reproduction time candidate selected by the user event as a designated time. The reproduction time candidate display unit 111 outputs the designated time to the digest moving image generation unit 103c.

FIG. 24 is an example of a user interface for designating the playback time of the digest moving image in the video editing apparatus 100c of the present embodiment. The user can select a desired reproduction time by sliding the button 32 of the bar 31 displayed on the lower side of the “digest moving image reproduction time” display to the left and right. Below the bar 31, the minimum value and the maximum value of the playback time that can be specified are displayed. In the case of FIG. 3, the minimum value is 1 minute, and the maximum value is 7 minutes, which is a reproduction time candidate. For example, when the user slides the button 32 to the left end of the bar 31, the specified time is 1 minute, and when the user slides to the right end, the specified time is 7 minutes, and the button 32 is slid in the middle of the bar 31. The specified time is 4 minutes which is an intermediate value between 1 minute and 7 minutes. In this embodiment, the example in which the playback time is selected by sliding the button 32 on the bar 31 has been described. However, the playback time may be selected from a pull-down menu, or a numerical value may be input. .

Next, processing contents of digest moving image generation performed by the digest moving image generating unit 103c will be described. FIG. 25 is a conceptual diagram showing a digest moving image generation process by the video editing apparatus 100c of the present embodiment. As shown in FIG. 25, the video editing apparatus 100c reads the corresponding scene information 303 for the image data group 302, which is a set of selected image data, from the image data 301, and from the reproduction time candidate display unit 111. A digest moving image is generated according to the input specified time. A group of image data 302 for which a digest moving image is to be generated is, for example, all image data photographed on a certain day. This image data group is determined by the target image data extraction unit 109. The image data group 302 is classified into one or more scenes by the scene information generation unit 102, and scene information, which is information indicating the feature of each scene, is generated. Next, the digest moving image generation unit 103c refers to the scene information in the order of shooting date and shooting time, and determines the type of scene such as a scene to be used alone and a scene to be used in combination with another scene. To do. Based on the determined scene type, the digest moving image generation unit 103c generates

image clips

306a, 306b, 306c,. Are combined to generate a digest moving image 307. The image clips 306a, 306b, 306c, and the like are moving images including at least one scene, but may include still images.

Also, the digest moving image generating unit 103c adjusts the digest moving image so that the reproduction time of the generated digest moving image becomes the specified time. Here, “the playback time becomes the specified time” may mean that the playback time matches the specified time, or means that there is a slight difference between the playback time and the specified time. Also good.

For example, as shown in FIG. 26A, the digest moving image 50A is composed of image clips 51 to 57, and the specified time passes during the reproduction of the last image clip 57 of the digest moving image 50A. However, it may be considered that the playback time has reached the specified time.

Also, as shown in FIG. 26B, the digest moving image 50B is composed of image clips 51 to 56, and the playback time of the digest moving image 50B is shorter than the specified time, but one more image clip, In the case of FIG. 26B, when the image clip 57 is combined, even if the playback time of the digest moving image 50B becomes longer than the specified time, it may be considered that the playback time has reached the specified time.

That is, if the difference between the playback time and the specified time is less than or equal to one image clip, the playback time may be regarded as the specified time. In addition, as a range allowing the difference between the reproduction time and the designated time, a specific numerical value such as 30 seconds or 1 minute may be used, or a ratio with respect to the designated time, for example, 1% of the designated time may be used.

FIG. 27 shows the internal configuration of the digest moving image generating unit 103c in the present embodiment. The digest moving image generating unit 103c includes a scene type determining unit 1032, a scene space arranging unit 1033, a scene time arranging unit 1034, and a digest moving image editing unit 1036. The processing contents of the scene type determination unit 1032, the scene space arrangement unit 1033, and the scene time arrangement unit 1034 are the same as those in the first embodiment.

(Adjustment of digest video playback time)
A method for adjusting the reproduction time when the digest moving image generating unit 103c generates a digest moving image will be described.

The digest moving image editing unit 1036 adjusts the reproduction time of the digest moving image by editing the digest moving image output from the scene time arranging unit 1034.

The digest moving image editing unit 1036 outputs the input digest moving image as it is when the reproduction time of the digest moving image is the designated time.

The digest moving image editing unit 1036 edits the digest moving image so that the reproduction time of the digest moving image becomes the specified time when the reproduction time of the digest moving image is not the specified time.

Specifically, when the digest video playback time is longer than the specified time, the digest video editing unit 1036 shortens each image clip included in the digest video. First, the digest moving image editing unit 1036 adjusts the reproduction time of the digest moving image by shortening the reproduction time of the image clip having no motion. Specifically, referring to the motion information of the scene information of the scene included in the image clip in order from the beginning of the digest moving image, the motion information of all the scenes included in the image clip is “no motion (0)”. Reduces the playback time by thinning out the frames of the image clip. For example, by halving the number of frames by simple decimation, the reproduction time of the image clip is halved, that is, the reproduction speed is doubled. FIG. 28 is a conceptual diagram for explaining processing for shortening the reproduction time of an image clip in the digest moving image editing unit 1036. The image clip 60A is an image clip composed of only a scene whose motion information is “no motion (0)”. Frames 61 to 66 are frames constituting the image clip 60 </ b> A, and are arranged in chronological order from the frame 61 to the frame 66. When the digest moving image playback time is longer than the specified time, the digest moving image editing unit 1036 displays one frame for every two frames in the image clip 60A, and in the case of FIG. 28, the frame 62, the frame 64, the frame 66. By deleting the image clip 60B, the number of frames is half that of the image clip 60A. Since the frame rate for displaying the image clip 60B is the same as that of the image clip 60A, the image clip 60B is an image clip in which the playback speed of the image clip 60A is doubled. The digest moving image editing unit 1036 repeats the above processing until the digest moving image reproduction time reaches the specified time.

The digest moving image editing unit 1036 performs the above processing up to the last image clip of the digest moving image, and if the digest moving image playback time is not the specified time, the motion of all scenes included in the image clip The playback time of the digest moving image is adjusted by cutting out a part of the image clip whose information is not “no motion (0)”. More specifically, assuming that the digest video playback time is Td and the specified time is Ts, the digest video editing unit 1036 has a part of the image clip so that the playback time Ti of each image clip is Ts / Td times. Cut out. For example, a time corresponding to 1− (Ts / Td) times of the reproduction time Ti of the image clip is cut from the head of the image clip. In addition to the beginning, the part to be cut may be the last part of the image clip, or may be cut from both the beginning and the end.

If the digest moving image playback time is shorter than the specified time, first, the digest moving image editing unit 1036 adjusts the playback time of the digest moving image by increasing the playback time of the image clip having no motion. Specifically, referring to the motion information of the scene information of the scene included in the image clip in order from the beginning of the digest moving image, the motion information of all the scenes included in the image clip is “no motion (0)”. Extends the playback time by interpolating the frames of the image clip. For example, by interpolating one frame between each frame, the reproduction time of the image clip is doubled, that is, the reproduction speed is halved. As another example, by interpolating two frames between each frame and then deleting even-numbered frames, the playback time of the image clip is 1.5 times, that is, the playback speed is 2/3 times. . FIG. 29 is a conceptual diagram for explaining processing for extending the playback time of an image clip in the digest moving image editing unit 1036. The image clip 70A is an image clip composed of only a scene whose motion information is “no motion (0)”.

Frames

71, 74, and 77 are frames constituting the image clip 70A, and are arranged in time series in the order of the

frames

71, 74, and 77. When the digest moving image playback time is shorter than the specified time, the digest moving image editing unit 36 first sets two frames between the frames in the image clip 70A, and in the case of FIG. 29, the frame 72, the frame 73, and the frame 75. The frames 76... Are interpolated by frame interpolation to obtain an image clip 70B having three times as many frames as the image clip 70A. Next, the digest moving image editing unit 1036 deletes one frame out of two frames in the image clip 70B, and in the case of FIG. 29, deletes the frame 72, the frame 74, the frame 76,. Thus, the image clip 70C has half the number of frames. Since the number of frames of the image clip 70C is 3/2 times that of the image clip 70A and the frame rate at the time of display is the same as that of the image clip 70A, the image clip 70C is an image obtained by doubling the playback speed of the image clip 70A. It becomes a clip. Although the specific method of frame interpolation is not particularly limited, for example, linear interpolation or a method of estimating motion between frames and interpolating based on the motion may be used. The digest moving image editing unit 1036 repeats the above processing until the digest moving image reproduction time reaches the specified time.

The digest moving image editing unit 1036 performs the above processing up to the last image clip of the digest moving image, and when the digest moving image playback time is not the specified time, from the image clip included in the digest moving image, Randomly selected image clips are combined at the end of the digest video. However, when the randomly selected image clip is the same as the last image clip of the digest moving image so that the same image clip is not continuously reproduced, the combination of the selected image clips may be skipped. The digest moving image editing unit 1036 repeats the above processing until the digest moving image reproduction time reaches the specified time.

Also, as another adjustment method, there is a method of using a video effect such as a transition when switching image clips. While the video effect is being played back, the playback of the image clip is stopped, so that the digest video playback time can be extended. The digest moving image editing unit 1036 extends the digest moving image reproduction time Td by inserting video effects in order from the location where the difference in shooting time between image clips is large. The digest moving image editing unit 1036 repeats the above processing until the digest moving image playback time reaches the specified time or video effects are inserted between all the image clips. If the digest video playback time does not reach the specified time even if the video effect is inserted between all the image clips, the above-described method of combining the randomly selected image clip at the end of the digest video is used. The specific video effect to be inserted is not particularly limited, but for example, a crossfade, dissolve, or wedge-shaped wipe that is one type of transition may be used.

As described above, with regard to an image clip with no motion, by adjusting the playback time by simple decimation or frame interpolation, it is possible to view a digest moving image with a desired playback time without giving the user a sense of discomfort as much as possible. The effect of is obtained.

(Sixth embodiment)
Next, a video editing apparatus according to the sixth embodiment of the present invention will be described. The video editing apparatus according to the sixth embodiment is different from the video editing apparatus according to the fifth embodiment in that a digest moving image generating unit 103d is provided instead of the digest moving image generating unit 103c. is there.

In the fifth embodiment, when shortening an image clip, a part of the image clip may be cut out, but there is a problem that the user may cut out a portion that the user wants to view.

On the other hand, the video editing apparatus according to the present embodiment provides an adjustment method in which the playback time of the digest moving image becomes the specified time without cutting the image clip.

FIG. 30 shows an internal configuration of the digest moving image generating unit 103d in the present embodiment. The digest moving image generation unit 103d includes a scene type determination unit 1032d, a scene space arrangement unit 1033, and a scene time arrangement unit 1034. The processing contents of the scene space arrangement unit 1033 and the scene time arrangement unit 1034 are the same as those in the first embodiment.

Scene type determination unit 1032d determines the type of scene based on the scene information and threshold value THt in the same manner as scene type determination unit 1032.

The scene type determination unit 1032d calculates the digest video playback time Td from the scene information and the scene type. The initial value of the playback time Td is 0, and for a scene whose scene type is “single scene”, the playback time of that scene is added to the playback time Td. For a plurality of scenes whose scene type is “multiple scenes”, the playback time of the scene with the shortest playback time among the scenes is added to the playback time Td.

Then, the scene type determination unit 1032d adjusts the digest video playback time Td to be the specified time when the calculated digest video playback time Td is not the specified time.

As a specific adjustment method, the scene type determination unit 1032d changes the threshold THt so that “multiple scenes” are more easily selected than “single scenes” when the digest video playback time Td is longer than the specified time. To do. For example, the threshold value THt is changed from 5 minutes to 10 minutes. By changing the threshold value, the ratio of “multiple scenes” included in the digest moving image increases, so that the reproduction time of the digest moving image can be adjusted to be short. In addition, when the digest moving image playback time Td is shorter than the specified time Ts, the scene type determination unit 1032d changes the threshold value THt so that “single scene” is more easily selected than “multiple scenes”. For example, the threshold value THt is changed from 5 minutes to 3 minutes. By changing the threshold value, the ratio of the “single scene” included in the digest moving image increases, so that the reproduction time of the digest moving image can be adjusted to be longer.

Then, the scene type determination unit 1032d determines the scene type based on the changed threshold value THt, and calculates the digest moving image playback time Td again. The scene type determination unit 1032d repeats the above processing until the digest moving image playback time Td reaches the specified time.

According to the method described above, by adjusting the ratio of “single scene” and “multiple scenes” of the digest video, the playback time of the digest video without deleting part of the image clip that the user wants to view Since the reproduction time desired by the user can be set, an effect of generating a digest moving image that satisfies the user can be obtained.

(Seventh embodiment)
Hereinafter, embodiments of the present invention will be described with reference to the drawings.

FIG. 31 is a schematic diagram showing the configuration of the video editing apparatus according to the first embodiment of the present invention. The video editing apparatus 100e includes an image data classification unit 101, a scene information generation unit 102, a digest moving image generation unit 103, an event selection unit 104, an output control unit 105, a video display unit 106, a digest moving image editing control unit 107, and an operation. The unit 108 is configured to be included. Although not shown, the video editing apparatus 100e may include a data recording unit that stores image data inside, or may be configured to connect a data recording apparatus having the same function as the outside to the outside. . The basic processing contents of the image data classification unit 101, the scene information generation unit 102, the digest moving image generation unit 103, the event selection unit 104, and the output control unit 105 are the same as those in the first embodiment.

The video display unit 106 outputs a video including a digest moving image generated by the video editing apparatus 100e and a user interface (UI) used for operation to a display device. The display device is built in the video editing apparatus 100e or connected to the outside.

The digest moving image editing control unit 107 reproduces the digest moving image generated by the digest moving image generating unit 103, and outputs it to the video display unit 106 while synchronizing the image and sound and adjusting the frame rate. In parallel with this, video editing processing is performed based on the input from the user. The digest moving image to be reproduced may be image data once stored in the recording medium by the digest moving image generation unit 103 or may be image data directly input from the digest moving image generation unit 103. Further, it may be image data in which a digest moving image generated by another video editing device equivalent to the video editing device 100e is stored in a recording medium. When the format of the digest moving image is different from the format of the display data that can be directly used by the video display unit 106, the digest moving image is converted into display data that can be used by the video display unit 106. For example, when the digest moving image is compressed by an encoding method such as HEVC or AAC, the digest moving image editing control unit 107 decodes the image data and outputs the decoded image data to the video display unit 106. In addition, if digest video data is stored in a format that requires video generation processing, including video layout, transformation, segmentation, and overlay processing during playback, digest video generation The generation processing is performed by controlling the unit 103 to acquire and reproduce the video. In addition, it is desirable that the digest moving image editing control unit 107 can perform reproduction control including pause / fast forward / rewind / movement between scenes in reproduction.

The operation unit 108 detects an input operation from the user including a position designation on the display screen by using, for example, a touch sensor integrated with the video display unit 106, a mouse or a keyboard connected to the outside. As an example, when the video editing apparatus 100e includes a touch sensor as an operation unit, the user can input by a general operation such as tap, flick, or pinch. In addition, buttons and keys for recording and playback control may be provided.

Next, each part will be described in more detail.

(Output control unit)
In the present embodiment, the output control unit 105 is based on a condition including image display specifications of the video display unit 106, audio output specifications of an audio output device (not shown), and information indicating user preferences for the digest moving image. Set the generation policy. Information indicating user preferences is received by the operation unit 108 or other input means. For example, there is a method of displaying options such as “person main” and “landscape main” on the video display unit 106 and allowing the user to select, but the present invention is not limited to this. Further, when information indicating user preferences is not input, for example, “person main” may be set as a standard value.

The output control unit 105 sets, as “multiple scene simultaneous arrangement”, information indicating whether or not multiple scenes are allowed to be simultaneously arranged in the same image frame in accordance with the video display device of the output destination. For example, when the resolution or screen size of the display device used by the video display unit 106 is smaller than a certain threshold, the simultaneous placement of multiple scenes is set to “No”, and when the resolution is larger, the simultaneous placement of multiple scenes is set to “Yes”. .

(Digest video editing control unit)
The digest moving image editing control unit 107 plays back and edits the digest moving image generated as described above. This may be started by an instruction from the user, or may be started when the generation of the digest moving image is completed.

The digest moving image editing control unit 107 reproduces the digest moving image, and receives an input from the user and further edits the digest moving image being reproduced. FIG. 32 is a diagram showing an editing process when a digest moving image is reproduced. Although the reproduction process itself and the reproduction control process such as fast forward / rewind are not shown, they are executed in parallel with the editing process. Hereinafter, steps S101 to S104 in the figure will be described.

In step S101, the digest moving image editing control unit 107 is used to start reproduction of the digest moving image. Furthermore, a process for interpreting and executing an input from the operation unit 108 as an editing process is started.

Step S102 is a check of whether or not a moving image is being reproduced. If it is detected that the reproduction of the moving image is finished or interrupted, the editing process is finished.

Step S103 is an input operation check. It is checked whether an operation that can be interpreted as an instruction for editing processing is input. If the operation has not been input, the process returns to step S102.

It should be noted that the detection of the event related to the reproduction or operation performed in step S102 and step S103 can be realized by a periodic or non-periodic interruption, and therefore does not necessarily have to be executed in the order shown in FIG. Further, before the check step, a waiting time for waiting for a change in reproduction state or occurrence of input may be inserted.

Step S104 is execution of an editing operation. When an editing operation occurs, a process corresponding to the type of editing operation is executed with the scene being reproduced as the scene to be edited. The reproduction of the digest moving image is temporarily stopped at the start of the editing process, and resumed using the edited data after the editing process is completed. Hereinafter, some editing operations in step S104 will be described in more detail.

The type of editing operation is distinguished by input from the operation unit. For example, when the video display unit 106 includes a touch panel, direct designation of coordinates on the screen and editing operation by gestures can be realized. In addition, the above-described distinction can be made by a pointing device such as a mouse. Thus, although an input device is not necessarily a touch panel, the example using the operation input by a touch panel in which the most intuitive operation for a user is possible is demonstrated here. Note that windows, icons, and other GUI components may be displayed on the screen for operations other than the editing operation.

First, there are various types of touch panel operations (hereinafter referred to as touch operations) generally seen. For example, tap (tap the screen with the fingertip), double tap (tap the screen twice with the fingertip), flick (move the fingertip to touch the screen and play quickly), swipe (with the fingertip touching the screen in a certain direction) ), Drag (move while the fingertip is in contact with the screen, not necessarily in a certain direction), pinch in (two or more fingertips touch the screen and move closer to close), pinch out (fingertip on the screen 2 or more in contact and release to open), twist or rotate (2 or more fingertips touch the screen and move to twist). In addition, the functions may be distinguished depending on the number of fingers used for each operation and the position / shape / speed of the fingertip locus. The above is a description of a general touch operation, and not all of the touch operations are used in the editing operation of the video editing apparatus 100e, and other touch operations may be assigned to the editing operations described below.

FIG. 33 schematically shows an example of an operation performed on the digest moving image. The thick frame indicates the entire image area of the digest moving image, and when there is a rectangular frame in the thick frame, it indicates that the main scene or the sub scene of the combination scene is displayed. A dotted frame indicates a change caused by editing. The arrow indicates the approximate trajectory and length of the touch operation. Further, the coordinates where the touch operation is started are called start point coordinates, and the coordinates where the touch operation is finished are called end point coordinates.

FIG. 33A shows a flick operation on the screen 81. The flick operation is associated with scene deletion in the video editing apparatus 100e. Here, the scene to be edited is set as a scene to be deleted. The digest moving image editing control unit 107 deletes the deletion target scene data from the digest moving image, or marks the deletion target scene not to be reproduced. After editing, since the total playback time of the digest moving image is shortened by the deletion operation, the playback is resumed from the scene next to the deleted scene. When the last scene of the digest moving image is deleted, playback is stopped. In addition, it is preferable that a user can easily understand the visual effect that the scene to be deleted moves in the flicked direction when the deletion operation is accepted. In this way, scenes that the user feels unnecessary during playback can be easily deleted.

FIG. 33 (b) is an example in which a twist operation is performed in a combination scene arranged in parallel as shown in FIG. 5 (a). The twist operation is associated with the change of the arrangement pattern in the video editing apparatus 100e. FIG. 33B shows an arrangement pattern in which two

element scenes

82 and 83 are present on the screen, but the two

element scenes

82 and 83 can be switched left and right by a twist operation. When the digest moving image is encoded, the digest moving image editing control unit 107 generates an editing target scene that has been changed to a new layout and encodes it again. As in the example of deletion, it is preferable to present a visual effect such that the

element scenes

82 and 83 are switched when a twist operation is received. In this way, even when the user feels that the right and left arrangement of the element scene is unnatural during reproduction, it can be easily changed to a more preferable arrangement.

FIG. 33 (c) is an example in which a twist operation is performed in a combination scene obtained by dividing the screen into three equal parts as shown in FIG. 5 (f). In the twist operation in this case, the video editing apparatus 100e selects and changes the spatial arrangement of the element scenes in order from possible combinations. For example, in the original digest moving image,

element scenes

84, 85, and 86 are arranged in this order from the left of the screen. Assuming that the three element scenes are A, B, and C, the digest moving image editing control unit 107 changes this arrangement to {A, B, C} → {A, C, B} → {B every time a twist operation is executed. , A, C} → {B, C, A} → {C, A, B} → {C, B, A} → {A, B, C}.

The twist operation in the example of FIGS. 33 (b) and 33 (c) may be performed anywhere on the screen, but in the twist operation performed near the boundary of the combination scene, only the element scene that touches the boundary is used. May be replaced. In this way, it is possible to quickly select a more preferable arrangement for the user even in a combination scene including three or more element scenes.

FIG. 33D shows an example of performing a pinch-out operation in a centrally arranged combination scene in which a reduced main scene 88 is arranged at the center of a sub-scene 87 arranged on the entire screen as shown in FIG. 5B. It is. In the video editing apparatus 100e, the pinch out operation is associated with an increase in size with respect to the element scene. The enlargement ratio of the element scene is determined according to the distance between the start point coordinate and the end point coordinate of the pinch out operation, the minimum value is the size of the main scene 88 before the operation, and the maximum value is the size of the entire screen, that is, the sub scene 87. . The digest moving image editing control unit 107 extracts the area of the main scene 88 from the editing target scene, expands it according to the expansion ratio, generates an image rearranged on the editing target scene, and re-encodes it. To do. The position of the main scene 88 is maintained at the center so that the main scene 88 before editing is completely hidden by the enlarged main scene 88. Thus, by enlarging the size of the main scene 88, the contents and persons of the main scene 88 can be made more conspicuous.

FIG. 33 (e) shows a central arrangement similar to the example of FIG. 33 (d), but in a combination scene in which the main scene 89 is cut out and arranged at the center of the screen as shown in FIG. 5 (d). This is an example of a pinch-out operation. In this case, the upper limit of the scene enlargement ratio is determined as w0 / w1 using the digest moving image, that is, the horizontal pixel number of the sub-scene 87 (denoted w0) and the horizontal pixel number of the area of the main scene 89 (denoted w1). . If the number of pixels in the vertical direction (h1) of the main scene 89 exceeds the number of vertical pixels (h0) of the entire digest moving image as a result of the enlargement, {h0 × ( Trimming by w0 / w1-1) / 2} pixels to match the image aspect ratio of the main scene 89 with the image aspect ratio of the entire digest moving image. The digest moving image editing control unit 107 generates and encodes an image in which the main scene 89 enlarged and trimmed in this way is rearranged on the scene to be edited. As in the example of FIG. 33 (d), the content and person of the main scene 89 can be made more conspicuous by doing in this way.

33 (b) to 33 (e) are based on the premise that the scene to be edited is a combination scene. If there is no information on whether or not the editing scene is a combination scene in the digest moving image being played back, or if the scene information related to the digest moving image cannot be acquired, a pixel value histogram generation for each frame, If contour extraction (particularly a straight line portion), motion detection for each region, or the like is used, it can be determined whether or not the scene to be edited is a combination scene.

FIG. 33 (f) shows an example of a drag operation having a complicated trajectory. Such a drag operation is associated with a filter effect for an area near the locus in the video editing apparatus 100e. The trajectory as in the example of FIG. 33F is determined as, for example, a trajectory in which the distribution of the coordinate values of the points forming the trajectory is wide without deviation in the horizontal direction, and the vertical distribution is biased between the maximum value and the minimum value. To do. When such a drag operation is input, the digest moving image editing control unit 107 sets an area including the start point and end point of the drag operation as a filter target area. For example, a region including pixels within a certain distance from the line segment of the locus or a region including a rectangular region having a straight line connecting the start point and the end point as a diagonal line. The digest moving image editing control unit 107 filters the pixels in the filter target region in all frames included in such an editing target scene, and updates the digest moving image. At this time, if necessary, the filter result is encoded again. The filter used here has a function to make the target area inconspicuous, including a non-sharpening filter and simple filling with a predetermined pixel value, if the drag operation shows a trajectory that erases or disturbs a certain area. Things are desirable. In addition, if a drag operation of a trajectory that can be interpreted as meaning attention, such as surrounding a certain region, it is desirable to have a function that makes the target region stand out, including a sharpening filter and a filter that increases luminance. By such an operation, it is possible to make an area that is not necessary or unnoticeable from an editing target scene inconspicuous or to make an important area clear. Note that the filter target area needs to be changed for each frame when the object or the camera is moving. For this reason, it is desirable that the digest moving image editing control unit 107 performs a motion detection process of the filter target region for the editing target scene and adjusts the position, shape, and size of the filter target region. Furthermore, if the locus pattern is recognized in more detail, a plurality of filters can be automatically switched even with a similar target filter. For example, an unsharpening filter is applied in the case of a trajectory that reciprocates in the vertical direction as shown in FIG. 33 (f), and a fill filter is applied in the case of a trajectory that reciprocates in the horizontal direction as in FIG. it can.

FIG. 33 (h) is an example in which a button 90 indicating the start of video shooting is displayed on the screen, and a scene can be added. When the button 90 is tapped, the digest moving image editing control unit 107 stops the reproduction of the digest moving image and displays the display if there is a camera (not shown) built in the video editing apparatus 100e or connected to the outside. Switch the screen to the input image from and start shooting. Shooting is terminated by the user's operation, and the shot image data V is stored in a recording medium. The digest moving image editing control unit uses the image data V to perform the same processing as the already described digest moving image generation. However, only the image data V is input. The digest moving image of the image data V to be output is added to the end of the digest moving image that is already being reproduced, and is stored as a new digest moving image. More simply, the captured image data V may be added to the end of the digest moving image as it is.

In addition, it is desirable that all of the editing processes described above are executed immediately upon input of an operation and the result is obtained. However, in order to save the editing result, high-load processing such as video and audio re-encoding is required. There is a case. If the video editing apparatus 100e is in a high load state during playback, or if playback or operation is hindered when the editing process is executed, the editing process type and the operation unit 108 are avoided by avoiding immediate execution of the editing process. The input information and the instruction information including the scene to be edited are stored, and when the video editing apparatus 100e is in a low load state later, editing is performed based on the instruction information and the digest moving image is updated. . Even in such a case, it is desirable to present to the user that the scene is to be edited using an animation, an icon, a temporary processing result image obtained by processing with a low load, and the like at the time of operation input.

As described above, by matching the output video specifications and output audio specifications of the generated digest video to the specifications and capabilities of the output video display device and audio output device, it is possible to achieve high quality suitable for the output device. A digest moving image is generated, and the configuration of the digest moving image can be edited at the time of reproduction with a simple operation.

(Eighth embodiment)
Next, a video editing apparatus according to an eighth embodiment of the present invention will be described. The video editing apparatus according to the eighth embodiment has the same configuration as that of the video editing apparatus according to the seventh embodiment, but includes information indicating the spatial and temporal arrangement of scenes in the digest video. The difference is that the information used for generation (hereinafter referred to as arrangement information) and input image data are stored in a recording medium or memory so that they can be used during reproduction. Alternatively, the digest moving image itself may be in a format including input image data and the arrangement information. For example, the digest moving image may be data composed of one or more files including input image data and data corresponding to the arrangement information. At the time of reproduction, the video image intended by the digest moving image generation unit 103 can be generated by arranging the input image data with reference to the arrangement information.

Of the arrangement information, information indicating the spatial arrangement of the scene refers to the index of the input image data corresponding to the element scene in each arrangement pattern as described in the seventh embodiment, and the vertical and horizontal directions of the element scene. It includes the size (number of pixels), the position (coordinates) on the screen, and the cutout position on the input image data. Alternatively, indirect information for deriving these may be used, for example, an index for selecting an arrangement pattern / a predetermined size / a predetermined position. The information indicating the temporal arrangement of scenes is information indicating where each scene corresponds on the time axis of the final digest moving image, and at least the start time and end time (or length) of each scene. )including. The time and length may be expressed using the number of frames.

The digest moving image generating unit 103 according to the present embodiment can store and reuse data used for generating a previous digest moving image including arrangement information in a recording medium or a memory. As a result, even when partially or completely the same digest moving image is generated again, the load can be reduced by avoiding re-execution of the same process.

34 (a) to 34 (b) show a flick operation on the screen. The flick operation is associated with the deletion of the scene, as in the seventh embodiment. FIG. 34A shows a case where the scene to be edited is a combination scene, and the sub-scene 92 in which the starting point coordinates of the flick operation are arranged so as to overlap the main scene 91. At this time, the scene to be deleted is the sub-scene 92. FIG. 34B shows the case where the scene displayed at the starting point coordinates is the main scene 91, and the deletion target scene is the main scene 91. Alternatively, all the editing target scenes may be set as deletion target scenes. It should be noted that if the deletion process is limited to only the flick operation in the up and down direction, it is possible to easily distinguish the operation from the drag operation in the left and right direction, which will be described below, and to reduce operation errors. The digest moving image editing control unit 107 deletes the editing target scene from the spatial and temporal arrangement information so that the editing target scene is not reproduced, and generates a digest moving image again. In deleting an element scene, if an element scene that is not placed over another element scene is deleted, an area where no scene is displayed is generated on the screen of the digest moving image. In such a case, the screen may be rearranged as a single scene if there is only one element scene after deletion, and the screen may be subdivided if there are two or more element scenes after deletion. For example, when one is deleted from the parallel arrangement of three element scenes as shown in FIG. 5 (f), the remaining two element scenes may be rearranged using the parallel arrangement of FIG. 5 (b).

34 (c) to 34 (e) show examples of the drag operation using the sub-scene 92 of the combination scene as the starting point coordinates.

FIG. 34 (c) shows a case where dragging continues to either the left or right screen edge. At this time, the digest moving image editing control unit 107 changes the sub-scene 92 to a single scene and deletes the sub-scene 92 from the editing target scene. A newly generated single scene corresponding to the original sub-scene 92 is inserted immediately before the scene to be edited if the end point of the drag is the left screen edge, and immediately after the scene to be edited if the right screen edge. . FIG. 35 shows changes in the digest video before and after editing when the end point of the drag is the right screen edge. The digest moving image 1100 before editing includes

scenes

1100a, 1100b, and 1100c on the way. The scene 1100b is a combination scene, and includes a main scene S21 and a sub scene S22 as element scenes. When the scene 1100b is an editing target scene and the drag operation started from the sub-scene S22 is performed up to the right end of the screen, the sub-scene S22 is independent from the scene 1100b, and the edited digest moving image 1101 has a single scene 1100b2. It becomes. The scene 1100b from which the sub-scene S22 is deleted is only the main scene S21 and becomes a single scene 1100b1. As a simpler embodiment, it may be inserted at a predetermined position immediately before or after the scene to be edited regardless of the position of the end point. The digest moving image editing control unit 107 first deletes the original scene to be edited, and a new scene in which the sub scene 92 is deleted from the original scene to be edited and the sub scene at the temporal position where the scene to be edited exists. Two scenes of a single scene corresponding to 92 are inserted in the order described above. When the editing target scene includes only one element scene other than the sub-scene 92 as shown in FIG. 34C, the editing target scene becomes two single scenes after editing. When there are three or more element scenes included in the editing target scene, the editing target scene becomes one combination scene and one single scene after editing. 34D and 34E show a case where the end point of the drag operation does not reach the screen edge. The drag direction in the figure is an example.

FIG. 34 (d) shows a case where a portion other than the boundary portion of the sub scene 92 is dragged. At this time, the digest moving image editing control unit 107 moves the sub-scene 92 to another place on the main scene. As a result, even when the sub-scene 92 covers a person or object to be noticed on the combination scene, the sub-scene 92 can be moved to make the video more preferable for the user. The movement destination may be an arbitrary position near the end point of the drag, or, as indicated by a dotted rectangle in FIG. 34 (d), a position closest to the end point of the drag among a plurality of positions defined by the system. May be. The digest moving image editing control unit 107 rewrites the information of the editing target scene in the arrangement information so as to correspond to the above, and generates and stores the editing target scene again.

FIG. 34 (e) shows a case where the boundary portion between the main scene and the sub scene 92 in the sub-screen arrangement is dragged. At this time, the digest moving image editing control unit 107 changes the display size (number of vertical and horizontal pixels) of the sub-scene 92. The new size of the sub-scene may be an arbitrary size derived from the end point of the drag, or may be a size closest to the size represented by the end point of the drag among a plurality of sizes determined by the system. The new size and area of the sub-scene 92 may be larger or smaller than the size and area before the operation, but an upper limit and a lower limit may be provided. For example, the upper limit is 1/4 of the area of the entire moving image, and the lower limit is 1/16. Thereby, the size of the sub-scene 92 can be adjusted within a range that does not hinder viewing of the main scene, and the balance for superimposing the sub-scene 92 can be set more suitably for the user. FIG. 34F shows a pattern in which the main scene 94 is arranged in the center of the screen with the sub-scene 93 as a background. The arrangement in which the main scene overlaps the sub-scene is basically the same except that the scene whose size is to be changed becomes the main scene. In the case of FIG. 34 (f), the boundary between the main scene 94 and the sub-scene 93 is only the right and left vertical sides of the main scene.

In either case of FIG. 34 (e) or FIG. 34 (f), basically, the size change of the element scene is performed while maintaining the original image aspect ratio of the element scene regardless of the drag direction. If only the rate or the reduction rate is changed, the user can easily understand and the operation is simple. On the other hand, if the size can be changed without maintaining the aspect ratio, the size can be changed more flexibly. In that case, the operation may be changed depending on the starting point of the drag. If the starting point of the drag is the corner of the boundary portion, the vertical and horizontal sizes of the element scene are simultaneously changed according to the drag operation. If the drag viewpoint is one of the four sides excluding the corner, dragging the vertical side changes the size in the horizontal direction, and if it is a horizontal side, changes the size in the vertical direction. The image aspect ratio specified as a result of such dragging is different in many cases from the aspect ratio of the input image data corresponding to the element scene. In that case, the input image data is scaled at different magnifications in the vertical and horizontal directions, or the input image data is trimmed in accordance with a new image aspect ratio. In the latter case, in order to cut out as large an area as possible with a new image aspect ratio, the following is performed. Hereinafter, the size of the input image data corresponding to the element scene is set to ws0: hs0 (horizontal: vertical), and the new size is set to ws1: hs1.

If ws0 / hs0 <ws1 / hs1, the top and bottom of the input image data corresponding to the element scene are deleted in a horizontal band to form an image with the number of pixels ws0: (hs1 × ws1 / ws0) and reduced to ws1: hs1. .

If ws0 / hs0> ws1 / hs1, the left and right sides of the input image data corresponding to the element scene are deleted in a vertical band shape to form an image with the number of pixels (ws1 × hs1 / hs0): hs0 and reduced to ws1: hs1. . When an important object such as a main person is included in the element scene, the image may be simply trimmed so that the new image aspect ratio is obtained centering on the object.

The digest moving image editing control unit 107 changes the arrangement information in the editing target scene to arrange the element scene whose size has been adjusted as described above, and uses the input image data corresponding to the element scene to edit the editing target scene Is generated again. When any element scene disappears from the screen as a result of the size change, the load when generating the edit target scene can be reduced by deleting the element scene from the arrangement information of the edit target scene.

FIG. 34 (g) shows an example in which a twist operation is performed in a combination scene. Each time a twist operation is performed, the digest moving image editing control unit 107 selects another arrangement pattern from the possible arrangement patterns using the input image data corresponding to the element scene included in the scene to be edited, and edits it. The target scene is generated again. In the present embodiment, since all input image data is stored, only the arrangement order of arrangement is changed as in the examples shown in FIGS. 33B and 33C, and the arrangement pattern is changed to change the overlapping manner. It is also possible to do. Thereby, a preferable arrangement pattern for the user can be easily selected.

In the example of FIG. 34 (g), when the twist operation is performed near the boundary of the element scene, the arrangement pattern may be changed so that only the element scene near the boundary is changed. Further, when the boundary is a boundary between the main scene and the sub-scene, an arrangement in which the assignment of the main scene and the sub-scene is changed may be included in the possible arrangement pattern. In the example shown in FIG. 34 (g), the original main scene is used as a new sub-scene and the original sub-scene is replaced with a new main scene without changing the sub-screen layout. Thereby, even when the user feels that the scene set as the sub-scene is more important than the main scene, the arrangement pattern can be easily changed.

The digest moving image editing control unit 107 changes the arrangement information according to the new arrangement pattern, and again generates the editing target scene based on the arrangement information. As described above, the video editing apparatus that stores the input image data and the arrangement information used for generating the digest moving image can generate the digest moving image again based on a simple operation at the time of reproduction, which is more preferable for the user. Can be modified to a moving image.

Note that some of the video editing apparatuses 100, 100a, 100b, 100c, and 100e in the above-described embodiment, for example, the image data classification unit 101, the scene information generation unit 102, the digest moving

image generation units

103, 103a, 103b, and 103c, 103d, event selection unit 104, output control unit 105, video display unit 106, digest moving image editing control unit 107, operation unit 108, target image data extraction unit 109, reproduction time candidate derivation unit 110, and reproduction time candidate display unit 111 It may be realized by a computer. In that case, the program for realizing the control function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by a computer system and executed. Here, the “computer system” is a computer system built in the video editing apparatuses 100, 100a, 100b, and includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a portable medium such as a memory card, a magneto-optical disk, a CD-ROM, a DVD-ROM, a storage device such as a hard disk built in a computer system, an SSD. . Furthermore, the “computer-readable recording medium” is a medium that dynamically holds a program for a short time, such as a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line, In this case, it is possible to include a server or client that holds a program for a certain period of time, such as a volatile memory inside a computer system. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

Further, a part or all of each video editing apparatus in the above-described embodiment may be realized as an integrated circuit such as an LSI (Large Scale Integration). Each functional block of the video editing apparatus may be individually made into a processor, or a part or all of them may be integrated into a processor. Further, the method of circuit integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. Further, in the case where an integrated circuit technology that replaces LSI appears due to progress in semiconductor technology, an integrated circuit based on the technology may be used.

(Summary)
The video editing apparatus according to the first aspect of the present invention includes a scene information generation unit that divides an image data group including a moving image into one or more scenes and generates scene information indicating features of a scene unit, and the scene A digest moving image generating unit that generates a digest moving image of the image data based on information, wherein the digest moving image generating unit generates a digest moving image based on the scene information Whether or not to use each scene, whether to place multiple scenes in the same frame, and the spatial arrangement pattern of scenes when placing multiple scenes in the same frame It is said.

According to the above-described configuration, a large number and a large number of still images and moving images can be confirmed and viewed in a short time without trouble. Furthermore, it is easy to see according to the size and shape of the screen displaying the image, and the image can be viewed without getting tired.

The video editing apparatus according to aspect 2 of the present invention is the video editing apparatus according to aspect 1, wherein the digest moving image generating unit compares scene information of a plurality of scenes that are temporally close to each other, and is a type of scene based on the comparison result. A scene and a sub-scene are determined, and at least two or more main scenes are included in the same frame as the spatial arrangement pattern of the plurality of scenes based on the relationship of scene types between scenes that are temporally close to each other. "Parallel arrangement" pattern, the main scene and the sub-scene are arranged in the center area of the screen, and the "center arrangement" pattern, the main scene is arranged so that the sub-scene is located around the main scene area. It may be selected and selected from any of “sub-screen arrangement” patterns that are displayed over the entire frame and arranged so that the sub-scene is superimposed on a part of the area.

In the video editing apparatus according to aspect 3 of the present invention, in the aspect 2, the digest moving image generating unit arranges the main scene in the center area of the screen and the sub-scene so as to be located around the area of the main scene. When the “center arrangement” pattern is selected, a spatial filter may be applied to the sub-scene to differentiate it from the main scene area in terms of image sharpness or color tone.

The video editing apparatus according to Aspect 4 of the present invention is the video editing apparatus according to any one of Aspects 1 to 3, wherein the digest moving image generating unit is further configured to generate a digest moving image in units of image data groups to be generated as a digest moving image. The number of generations may be counted, and the arrangement pattern for arranging a plurality of scenes may be changed according to the number of generations.

The video editing apparatus according to aspect 5 of the present invention is the video editing apparatus according to any one of aspects 1 to 4, wherein the scene information generation unit is information indicating the number of feature areas in the image frame in units of scenes. And “maximum region size” which is information indicating the size of the region having the largest area among the feature regions, and information indicating the position in the image of the region having the maximum area among the feature regions. `` Maximum region position '' is generated as a part of the scene information, and the digest moving image generation unit, when arranging a plurality of scenes in the same frame, based on each information indicated by the scene information, The strength of the filter when applying the spatial filter to the image area to be cut out as the main scene or the sub-scene may be varied.

The video editing apparatus according to aspect 6 of the present invention is the video editing apparatus according to any one of the aspects 1 to 5, wherein the video editing apparatus generates a digest moving image based on an output condition including characteristics of an output device that outputs the digest moving image. An output control unit for determining a policy and notifying the determined generation policy to the digest moving image generation unit, wherein the digest moving image generation unit generates a digest moving image based on the generation policy and the scene information; You can do it.

The video editing apparatus according to Aspect 7 of the present invention is the video editing apparatus according to any one of Aspects 1 to 6, wherein the video editing apparatus stores image data in units of events based on metadata indicating a shooting condition included in the image data. An image data classifying unit for classifying, and an event selecting unit for selecting, from the image data classified into event units, an image data group composed of image data whose metadata conforms to a predetermined condition as a digest moving image generation target The digest moving image generation unit may generate a digest moving image with the image data group selected by the event selection unit as an input.

The video editing apparatus according to aspect 8 of the present invention, in aspect 1, determines a digest moving image generation policy based on an output condition including characteristics of an output device that outputs a digest moving image, and the determined generation policy is An output control unit for notifying the digest moving image generation unit may be further provided, and the digest moving image generation unit may determine a spatial arrangement pattern of the scene in the digest moving image based on the generation policy and the scene information. .

The video editing apparatus according to aspect 9 of the present invention is the video editing apparatus according to aspect 3, wherein the scene information generation unit includes information indicating the number of feature areas in an image frame in units of scenes, and the feature area. "Maximum region size" that is information indicating the size of a region having the largest area, and "maximum region position" that is information indicating the position in the image of the region having the largest area among the feature regions Is generated as a part of the scene information, and the digest moving image generation unit sets the “number of areas” or “maximum area size” indicated by the scene information when arranging a plurality of scenes in the same frame. Based on the size, the strength of the filter when applying the spatial filter to the sub-scene may be varied.

In the video editing apparatus according to aspect 10 of the present invention, in the

aspect

2 or 3, the digest moving image generation unit further determines the number of times that the digest moving image is generated in units of image data groups that are the generation targets of the digest moving image. The arrangement pattern at the time of arranging a plurality of scenes may be changed according to the number of generations.

The video editing apparatus according to aspect 11 of the present invention is the video editing apparatus according to aspect 8, wherein the digest moving image generation unit encodes a digest moving image based on the generation policy, and a code used when encoding the digest moving image. The conversion quality may be determined.

The video editing apparatus according to the twelfth aspect of the present invention provides a playback time candidate derivation unit for deriving a playback time candidate of a digest video based on an image data group, presents the playback time candidate to the user, and based on a user event. A playback time candidate display unit that sets a designated time, a scene information generation unit that divides an image data group including moving images into one or more scenes, generates an image clip based on the scene, and A video editing apparatus including a digest video generation unit that generates a digest video by temporally combining clips, wherein the digest video generation unit includes a reproduction time of the digest video and the specified time. It is characterized by performing such adjustment.

According to the above-described configuration, a large number and a large number of still images and moving images can be confirmed and viewed in a short time without trouble. Furthermore, the image can be viewed by various display methods and can be viewed at a time desired by the user.

In the video editing apparatus according to aspect 13 of the present invention, in the aspect 12, the digest moving image generation unit may shorten the reproduction time of the image clip with less movement as the designated time becomes shorter.

In the video editing apparatus according to aspect 14 of the present invention, in the aspect 13, the digest moving image generation unit may shorten the reproduction time by thinning out the frames of the image clip.

In the video editing apparatus according to aspect 15 of the present invention, in the aspect 12, the digest moving image generation unit may lengthen the reproduction time of the image clip with less movement as the designated time increases.

In the video editing apparatus according to aspect 16 of the present invention, in the aspect 15, the digest moving image generation unit may extend the reproduction time by interpolating the frame of the image clip.

In the video editing apparatus according to aspect 17 of the present invention, in the aspect 12, the digest moving image generating unit classifies the scene as either a single scene using the scene alone or a plurality of scenes using a combination of multiple scenes. As the specified time becomes shorter, the ratio of the plurality of scenes constituting the digest moving image may increase.

In the video editing apparatus according to aspect 18 of the present invention, in the aspect 12, the reproduction time candidate derivation unit makes the reproduction time candidate shorter than the total reproduction time of the image data group, and the image data group The longer the total playback time, the longer the playback time candidate may be.

The video editing apparatus according to the nineteenth aspect of the present invention includes a scene information generation unit that divides an image data group including a moving image into one or more scenes and generates scene information indicating features of a scene unit, a digest video An output control unit that determines an image generation policy and notifies the digest moving image generation unit of the determined generation policy, and a scene in which a plurality of scenes are spatially arranged on the screen based on the scene information and the generation policy (Hereinafter, referred to as a combination scene), a digest moving image generating unit that generates a digest moving image of the image data group, a video display unit that displays video and operation information, and playing back the digest moving image A video editing apparatus comprising: a digest video editing control unit that outputs to the video display unit; and an operation unit that detects an operation input from the outside. It is characterized by changing the configuration of the digest moving image by the detected operation input by the operation unit.

According to the above-described configuration, a large number and a large number of still images and moving images can be confirmed and viewed in a short time without trouble. Furthermore, an image configured so that a large number and a large number of still images and moving images can be easily checked and viewed can be easily corrected at the time of reproduction so as to have a configuration preferable for the user.

The video editing apparatus according to aspect 20 of the present invention is the video editing apparatus according to aspect 19, wherein the video editing apparatus deletes a part of a scene constituting a scene or combination scene specified by the operation input from the digest moving image. Good.

In the video editing apparatus according to aspect 21 of the present invention, in the aspect 19, the video editing apparatus may change a spatial arrangement pattern of the combination scene designated by the operation input.

In the video editing apparatus according to aspect 22 of the present invention, in the aspect 19, the video editing apparatus may filter a moving image with respect to an area designated by the operation input.

In the video editing apparatus according to aspect 23 of the present invention, in the aspect 19, the video editing apparatus may add newly captured image data to the digest moving image by the operation input.

The video editing apparatus according to aspect 24 of the present invention is the video editing apparatus according to aspect 19, wherein the video editing apparatus selects any scene constituting the combination scene as a single scene from the combination scene specified by the operation input. You may insert in the said digest moving image before or after the said combination scene in time.

The video editing apparatus according to Aspect 25 of the present invention is the video editing apparatus according to any one of Aspects 19 to 24, wherein the video editing apparatus is an image used for generating the digest moving image by an operation input during reproduction of the digest moving image. And the content of the digest moving image may be changed using information indicating the spatial and temporal arrangement of the image.

In the aspect 19, the video editing apparatus according to aspect 26 of the present invention may delete a part of the scenes constituting the combination scene designated by the operation input from the combination scene.

In the aspect 25, the video editing apparatus according to aspect 27 of the present invention may change a spatial arrangement pattern of the combination scene by the operation input.

In the video editing apparatus according to aspect 26 of the present invention, in the aspect 25, any one of the scenes constituting the combination scene from the combination scene specified by the operation input as the single scene in the digest moving image. It may be inserted before or after the combined scene.

As described above, the embodiments of the present invention have been described in detail with reference to the drawings. However, the specific configuration is not limited to that described above, and various design changes can be made without departing from the scope of the present invention. Etc. are possible.

The present invention can be suitably applied to a video editing apparatus that generates a so-called digest moving image by inputting a still image or a moving image.

100, 100a, 100b, 100c ... Video editing apparatus 101 ... Image

data classification unit

102, 102a ... Scene

information generation unit

103, 103a, 103b, 103c, 103d ... Digest moving image generation unit 104 ... Event selection unit 105 ... Output control unit 106 ... Video display unit 107 ... Digest video editing control unit 108 ... Operation unit 109 ... Target image data extraction unit 110 ... Reproduction time candidate derivation unit 111 ... Reproduction time

candidate display units

301, 302 ...

Image data groups

200, 303, 400 , 800 ... Scene information 304 ... Selection information 305 ... Digest moving image generation policy 307 ... Digest moving

images

1031, 1031b ... Target

image extracting units

1032, 1032a, 1032d ... Scene

type determining units

1033, 1033a, 1033b ... Scene

space arranging unit

1034 , 103 b ... scene time arranging unit 1035 ... digest of the control unit 1036 ... digest moving image editing unit

Claims

A scene information generation unit that divides an image data group including a moving image into one or more scenes and generates scene information indicating features of each scene;
A video editing device comprising a digest video generation unit that generates a digest video of the image data group based on the scene information,
The digest moving image generating unit determines whether to use each scene when generating a digest moving image based on the scene information, whether to arrange a plurality of scenes in the same frame, and a plurality of scenes. A video editing apparatus characterized by determining a spatial arrangement pattern of scenes when arranged in the same frame.
The digest moving image generation unit compares scene information of a plurality of scenes that are temporally close to each other, determines a main scene and a sub-scene that are types of scenes based on the comparison result, and further, scenes that are temporally close to each other As a spatial arrangement pattern of the plurality of scenes based on the relationship of the scene types, at least,
A “parallel arrangement” pattern in which two or more main scenes are arranged in the same frame,
A “center placement” pattern that places the main scene and the sub-scene so that the main scene is placed in the center area of the screen and the sub-scene is located around the area of the main scene.
The main scene is displayed over the entire frame, and is selected and determined from any of the “small screen layout” patterns arranged so that the sub-scene is superimposed on a part of the area. The video editing device described in 1.
The digest moving image generator generates a spatial filter for the sub-scene when selecting the “center arrangement” pattern in which the main scene is arranged in the center area of the screen and the sub-scene is arranged so as to be positioned around the area of the main scene. 3. The video editing apparatus according to claim 2, wherein the video editing apparatus is applied to make a difference from an area of the main scene in terms of image sharpness or color tone.
The digest moving image generation unit further counts the number of times the digest moving image is generated in units of the image data group that is the target of the digest moving image generation, and changes the arrangement pattern when multiple scenes are arranged according to the number of generations. The video editing apparatus according to any one of claims 1 to 3, wherein the video editing apparatus is characterized in that:
The scene information generating unit is information indicating the “number of regions” that is information indicating the number of feature regions in an image frame in units of scenes and the size of a region having the largest area among the feature regions. Generating a `` maximum region size '' and a `` maximum region position '' that is information indicating a position in the image of the region having the largest area among the feature regions, as a part of the scene information;
The digest moving image generation unit applies a spatial filter to an image region to be cut out as a main scene or a sub-scene based on each information indicated by the scene information when a plurality of scenes are arranged in the same frame. 5. The video editing apparatus according to claim 1, wherein the strength of the filter is variable.
The video editing apparatus determines a digest moving image generation policy based on an output condition including characteristics of an output device that outputs a digest moving image, and notifies the digest moving image generation unit of the determined generation policy Further comprising
6. The video editing apparatus according to claim 1, wherein the digest moving image generating unit generates a digest moving image based on the generation policy and the scene information.
The video editing apparatus includes: an image data classification unit that classifies image data in units of events based on metadata indicating image capturing conditions included in the image data;
An event selection unit that selects, from the image data classified into event units, an image data group composed of image data whose metadata conforms to a predetermined condition as a digest moving image generation target;
7. The video editing apparatus according to claim 1, wherein the digest moving image generation unit generates a digest moving image with the image data group selected by the event selection unit as an input. 8. .
A playback time candidate derivation unit for deriving a digest video playback time candidate based on the image data group;
A playback time candidate display unit that presents the playback time candidates to a user and sets a designated time based on a user event;
A scene information generation unit that divides an image data group including a moving image into one or more scenes;
A video editing device comprising: a digest moving image generating unit that generates an image clip based on the scene and generates a digest moving image by temporally combining the image clips;
The digest moving image generating unit
An image editing apparatus, wherein adjustment is performed such that a reproduction time of the digest moving image becomes the specified time.
The digest moving image generating unit
9. The video editing apparatus according to claim 8, wherein the playback time of the image clip with little motion is shortened as the designated time is shortened.
The digest moving image generating unit
The video editing apparatus according to claim 9, wherein the playback time is shortened by thinning out frames of the image clip.
The digest moving image generating unit
9. The video editing apparatus according to claim 8, wherein the playback time of the image clip with little motion is lengthened as the designated time becomes longer.
The digest moving image generating unit
The video editing apparatus according to claim 11, wherein a reproduction time is extended by interpolating a frame of the image clip.
The digest moving image generating unit
Classify the scene as either a single scene using a single scene or a plurality of scenes using a combination of multiple scenes.
The video editing apparatus according to claim 8, wherein a proportion of the plurality of scenes constituting the digest moving image increases as the designated time becomes shorter.
A scene information generation unit that divides an image data group including a moving image into one or more scenes and generates scene information indicating features of each scene;
An output control unit that determines a digest moving image generation policy, and notifies the digest moving image generation unit of the determined generation policy;
Based on the scene information and the generation policy, a digest moving image generating unit that generates a digest moving image of the image data group including a scene in which a plurality of scenes are spatially arranged in a screen (hereinafter referred to as a combined scene). When,
A video display unit for displaying video and operation information;
A digest video editing control unit that reproduces the digest video and outputs the digest video to the video display unit;
A video editing device including an operation unit for detecting an operation input from the outside,
A video editing apparatus, wherein the configuration of the digest moving image is changed by an operation input detected by the operation unit.
15. The video editing apparatus according to claim 14, wherein the video editing apparatus deletes a part of a scene constituting a scene designated by the operation input or a combination scene from the digest moving image.
15. The video editing apparatus according to claim 14, wherein the video editing apparatus changes a spatial arrangement pattern of a combination scene designated by the operation input.
15. The video editing apparatus according to claim 14, wherein the video editing apparatus filters a moving image with respect to an area designated by the operation input.
15. The video editing apparatus according to claim 14, wherein the video editing apparatus adds newly captured image data to the digest moving image by the operation input.
The video editing device inserts any scene constituting the combination scene from the combination scene specified by the operation input as a single scene before or after the combination scene in the digest moving image. The video editing apparatus according to claim 14, wherein:
The video editing apparatus uses the image used for generating the digest moving image and information indicating the spatial and temporal arrangement of the image by an operation input at the time of reproduction of the digest moving image, and the digest moving image. 20. The video editing apparatus according to claim 14, wherein the content of the image is changed.