CN113269854B

CN113269854B - Method for intelligently generating interview-type comprehensive programs

Info

Publication number: CN113269854B
Application number: CN202110803384.0A
Authority: CN
Inventors: 袁琦; 李�杰; 杨瀚
Original assignee: Chengdu Sobei Video Cloud Computing Co ltd
Current assignee: Chengdu Sobei Video Cloud Computing Co ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-10-15
Anticipated expiration: 2041-07-16
Also published as: CN113269854A

Abstract

The invention discloses a method for intelligently generating interview-type comprehensive programs, which comprises the following steps: s1, recording program videos shot by a plurality of cameras on a program site through multichannel recording software; s2, setting the role played by each channel material according to the camera shooting picture in the program video; s3, extracting video characteristics of each channel material; s4, generating a plurality of candidate video clips in each channel according to the extracted video features; s5, selecting candidate video clips according to predefined rules, synthesizing program initial clips and the like; the invention can quickly generate the initial film, provides the later editing personnel with quick editing and film forming, and reduces the manual load.

Description

Method for intelligently generating interview-type comprehensive programs

Technical Field

The invention relates to the field of video program synthesis, in particular to a method for intelligently generating interview-type integrated art programs.

Background

The interview-type program is a television program form which is easy and pleasant in atmosphere and is carried out around a certain theme between a host and guests in a mode of taking conversation as a main form, and the interview-type comprehensive program is an interview program which is mainly aimed at pleasure, mind and body and leisure fun, and is added with more comprehensive components and comic situation design to achieve a dramatic effect so as to be earmarked by entertainers. Its guests are mainly celebrity and sports stars, and therefore tend to have a very high popularity among young people. Although the programs are not similar to other art programs and are usually shot in a single scene and stage, a large number of cameras are still required to be arranged on the spot, and during shooting, pictures shot at different angles by different scenes on the spot are fully utilized to synthesize the initial program through a series of complicated operations such as real-time coordination between a director on the spot and each machine group member, lens cutting and the like, which often requires rich command experience and on-site capability of the director.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method for intelligently generating interview-type comprehensive programs, which can quickly generate an initial film and provide the editing personnel for later-stage editing to quickly edit and generate the film, thereby reducing the manual load.

The purpose of the invention is realized by the following scheme:

a method for intelligently generating interview-type integrated art programs comprises the following steps:

s1, recording program video materials shot by a plurality of cameras on a program site through multichannel recording software;

s2, setting the role played by each channel material according to the camera shooting picture in the program video;

s3, extracting video characteristics of each channel material;

s4, generating a plurality of candidate video clips in each channel according to the extracted video features;

and S5, selecting candidate video clips according to predefined rules, and synthesizing the program initial clips.

Further, in step S2, the setting the role played by each channel material includes the following steps: dividing the channel materials into three categories according to the scene, namely a close scene, a middle scene and a long scene; the shooting picture of the close shot is close-up of a guest and a host; the shot picture of the middle scene is the interaction between the guests and the guests, between the guests and the host and between the host and the host; the shot picture of the long shot is the whole stage.

Further, in step S3, the following steps are included:

s31, establishing a face library containing the host and the guest of the field program;

s32, performing face recognition analysis on the video material of each channel, and extracting face frame coordinates, face 68 key point coordinates and corresponding names in each frame;

s33, performing picture stability analysis on the video material of each channel, and marking a blurred picture caused by camera movement or focusing error;

and S34, using the data in the step S31 and the face key point data of the same person in continuous time dimension, carrying out mouth shape analysis and judging whether the person is speaking in the set time.

Further, in step S31, if the program is shared

And the person collects the single photos of the host and the guest related to the program through the Internet, one photo for each person, extracts 512-dimensional face features through a face recognition network to serve as the character representation of the person, and if the person has the face features, the 512-dimensional face features

Feature matrix of

And

name matrix of

；

Is an integer which is the number of the whole,

respectively correspond to the matrix

And

to (1) a

Go to the first

Column elements.

Further, in step S32, if there is any

Video material of each channel, wherein each video material is

Frames, each frame having been aligned on a timeline, are passed through

An individual material

To (1) a

Frame image

Face recognition processing is carried out to obtain a processing result set of the frame

，

，

Wherein

Denotes the first

A face feature matrix obtained by frame extraction,

in order to detect the number of faces,

is shown as

First of frame extraction

The characteristics of the individual's face are,

denotes the first

All the face frames detected by the frame are,

is shown as

First of frame detection

The number of the face frames is one,

denotes the first

The key points of all the faces detected by the frame,

is shown as

First of frame detection

Personal face keyThe point(s) is (are) such that,

denotes the first

The face detected by the frame corresponds to the identified name,

is shown as

First of frame detection

The name of the person corresponding to the individual person,

，

namely, the name with the highest similarity in the face database is taken as the name corresponding to the face,

is shown as

The name of the individual person is used,

indicating that the index corresponding to the maximum value is taken,

representing a similarity calculation function. The result of extracting video features from all stories is expressed as

。

Further, in step S33, for the second step

An individual material

To (1) a

Frame image

Given its width of

High is

By counting the picture stability scores

To characterize whether the frame of image picture is stable,

,

,

,

,

wherein the content of the first and second substances,

is to show to

The frame image is taken as a gray-scale image,

which represents the fourier transform of the signal,

representing the conversion of the 0 frequency component to the center of the spectrum,

it is indicated that the absolute value is taken,

is composed of

The absolute value of (a) is,

is composed of

The grayscale map of (a) is transformed to the frequency domain and the 0-frequency component is converted to the result of the center of the frequency spectrum,

is a threshold value set as

Of medium maximum value

，

Is composed of

The number of pixels greater than the threshold value in

If the value is larger than the set empirical value, the image is represented

And (5) stabilizing the picture.

Further, in step S34, for the second step

An individual material

Taking a fixed time window size of

(i.e., fixed duration of time of

) The key point data of the face of the same person

I.e. by

，

Calculating the area of the mouth

I.e. by

，

Thereby calculating out

Variance of the area of the figure's mouth

：

，

Wherein

Is shown by

The average value of the human figure mouth-shaped area,

representing a character

At the moment of time

The key points of the face at the time of the operation,

indicates the calculated area thereof when

When the value is larger than the set empirical value, the name is

The character of

Speaking during the time period is marked as a speaker.

Further, in step S4, the following steps are included:

s41, generating initial candidate video clips of each channel according to the picture stabilization result obtained by analyzing the video material of each channel in the step S33; for the first

An individual material

All-frame analysis results of

Go through all the results when

Greater than a set empirical value, the flag

Continuously traversing subsequent results for the entry point of the updated candidate segment when

When the value is less than or equal to the set empirical value, marking

Generating material for the out-pointing of the updated candidate segment, and so on

In a common vessel

Initial candidate segment list of candidate segments

；

S42, traversing the initial candidate segment list generated in S41

Comparing the current segment

Out point of

With the next segment

In the point of entry

If, if

If the value is larger than the set empirical value, the segment is divided

And fragments thereof

Are combined into

At the point of entry is

In the point of entry

At the point of departure is

Out point of

And so on, generating a final candidate segment list

。

Further, in step S5, the following steps are included:

s51, setting priority according to the scene according to the shooting picture category of each channel material;

s52, integrating the step S42

Final candidate segment list for individual channel material

And the step ofAnd (3) filling the segments in the final candidate list of each channel material into the final slicing timeline according to the speaker marking result in the S34 and according to the following rules (the higher the priority is, the more the front is), so as to obtain the final composite video:

the segment is a close shot, there is a speaker, and the speaker is a guest;

the segment is a close shot, there is a speaker, and the speaker is the moderator;

the segment is a medium scene, speakers exist, and the number of the speakers is not higher than 3;

the segment is a perspective.

Further, in step S51, the priority is set: close range>Middle view>And (5) distant view. Further, in step S52, a time line gap filling method is adopted, i.e. the current time, according to the above rule

And selecting the most suitable candidate segment, filling the segment into the corresponding time line for generating the initial segment, updating the current time as the corresponding time of the candidate segment out point, and repeating the steps until all time lines for generating the initial segment are filled.

The beneficial effects of the invention include:

(1) the method of the invention provides a program initial film generation method by utilizing video face recognition, speaker recognition and picture stability analysis through observing the on-site command and lens cutting logic of a director when shooting interview-type integrated art programs, and the method extracts the most appropriate lens segments from pictures shot from different angles and automatically generates the interview-type integrated art program initial film so as to reduce the workload of the director and later-period program editors.

(2) The invention provides a simple and efficient method for automatically synthesizing interview-type comprehensive video program initial films only by a small amount of presetting; specifically, roles are divided according to scenes by shooting pictures of different cameras on site in a node list system, a host and a guest are marked through face recognition processing, a speaker is marked through mouth shape analysis, invalid lenses are filtered through calculating picture stability scores to generate a candidate video clip list, and finally all candidate video clips are combined regularly to generate a program primary clip. The method of the invention achieves the purposes of quickly generating the initial film, providing the post-editing personnel with quick editing and film forming and reducing the manual load.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of steps in an embodiment of a method of the present invention;

FIG. 2 is a flow chart of the method embodiment of the present invention for extracting visual features from a channel material.

Detailed Description

All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

As shown in fig. 1 and 2, a method for intelligently generating interview-type integrated art programs includes the steps:

s1, recording program videos shot by a plurality of cameras on a program site through multichannel recording software;

for example, in this step, videos of programs shot by 6 cameras at the scene of the program "when going on the spring and evening" are recorded, respectively

(ii) a Other programs may also be recorded, and the number of cameras may be 8, 10, 12, and the like, which is not described herein again.

according to pictures shot by a cameraSetting the role played by each channel material; in particular, the amount of the solvent to be used,

the camera is fixed, the shot picture is a close shot,

the camera is fixed, the shot picture is a middle view,

the camera is fixed, the shot picture is a long shot,

the camera is a rocker arm camera, and the shot picture is a long shot.

In step S2, the setting of the role played by each channel material includes the following steps: dividing each channel material into three categories according to the scene, namely a close scene, a middle scene and a long scene; the shooting picture of the close shot is close-up of a guest and a host; the shot picture of the middle scene is the interaction between the guests and the guests, between the guests and the host and between the host and the host; the shot picture of the long shot is the whole stage.

S3, extracting video features for each channel material, in step S3, the method includes the following steps:

in step S31, if the program is shared

Feature matrix of

And

name matrix of

；

Is an integer which is the number of the whole,

respectively correspond to the matrix

And

to (1) a

Go to the first

Column elements.

S32, performing face recognition analysis on the video material of each channel, and extracting face frame coordinates, face 68 key point coordinates and corresponding names in each frame; in step S32, if there is any

The video materials of the channels, here, N =6 (may be other numbers), each of which is a video material of one channel

Frames, each frame having been aligned on a timeline, are passed through

An individual material

To (1) a

Frame image

；

，

Wherein

Denotes the first

A face feature matrix obtained by frame extraction,

in order to detect the number of faces,

is shown as

First of frame extraction

The characteristics of the individual's face are,

denotes the first

All the face frames detected by the frame are,

is shown as

First of frame detection

The number of the face frames is one,

denotes the first

The key points of all the faces detected by the frame,

is shown as

First of frame detection

The key points of the face of the individual,

denotes the first

The face detected by the frame corresponds to the identified name,

is shown as

First of frame detection

The name of the person corresponding to the individual person,

，

is shown as

The name of the individual person is used,

indicating that the index corresponding to the maximum value is taken,

representing a similarity calculation function. The result of extracting video special frames from all the materials is expressed as

。

S33, performing picture stability analysis on the video material of each channel, and marking a blurred picture caused by camera movement or focusing error; in step S33, for the second step

An individual material

To (1) a

Frame image

Given its width of

High is

By counting the picture stability scores

To characterize whether the frame of image picture is stable,

,

,

,

,

wherein the content of the first and second substances,

is to show to

The frame image is taken as a gray-scale image,

which represents the fourier transform of the signal,

it is indicated that the absolute value is taken,

is composed of

The absolute value of (a) is,

is composed of

is a threshold value set as

Of medium maximum value

，

Is composed of

The number of pixels greater than the threshold value in

When the image is larger than a certain preset value, the image is represented

And (5) stabilizing the picture. In the present embodiment, for example, the preset value is taken as

I.e. by

Then represent the image

And (5) stabilizing the picture.

And S34, using the data in the step S31 and the face key point data of the same person in continuous time dimension, carrying out mouth shape analysis and judging whether the person is speaking in the set time. In step S34, forFirst, the

An individual material

Taking a fixed duration of

The key point data of the face of the same person

I.e. by

，

Calculating the area of the mouth

I.e. by

，

Thereby calculating the variance of the figure mouth area in the period of time

：

，

Wherein

The average value of the human mouth shape area in the period of time is shown,

representing a character

At the moment of time

The key points of the face at the time of the operation,

indicates the calculated area thereof when

If it is greater than a predetermined value, V may be 500, and is referred to as

The character of

Speaking during the time period is marked as a speaker. In this embodiment, T may be 250 units, for example, and is selected according to actual conditions.

S4, generating a plurality of candidate video clips in each channel according to the extracted video features; in step S4, the method includes the steps of:

An individual material

All-frame analysis results of

Go through all the results when

Above a certain preset value (where the preset value can be 0.002, depending on different programs), the mark is marked

Less than or equal to a predetermined value (the predetermined value may be 0.002, depending on the program), marking

In a common vessel

Initial candidate segment list of candidate segments

；

S42, traversing the initial candidate segment list generated in S41

Comparing the current segment

Out point of

With the next segment

In the point of entry

If, if

Above a certain preset value (here, 50 frames) the segment is segmented

And fragments thereof

Are combined into

At the point of entry is

In the point of entry

At the point of departure is

Out point of

And so on, generating a final candidate segment list

。

And S5, selecting candidate video clips according to predefined rules, and synthesizing the program initial clips. In step S5, the method includes the steps of:

s51, setting priority according to the scene according to the shooting picture category of each channel material; in particular, for the 6 channel material

To

，

The highest priority is given to the first group,

the priority level is set to a second priority level,

the lowest priority;

s52, integrating the step S42

Final candidate segment list for individual channel material

And the speaker marking result in the step S34, filling the segments in the final candidate list of each channel material into the final slicing time line according to the following rules to obtain the final composite video:

the segment is a close shot, there is a speaker, and the speaker is a guest;

the segment is a perspective.

The parts not involved in the present invention are the same as or can be implemented using the prior art.

The above-described embodiment is only one embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be easily made based on the application and principle of the present invention disclosed in the present application, and the present invention is not limited to the method described in the above-described embodiment of the present invention, so that the above-described embodiment is only preferred, and not restrictive.

Other embodiments than the above examples may be devised by those skilled in the art based on the foregoing disclosure, or by adapting and using knowledge or techniques of the relevant art, and features of various embodiments may be interchanged or substituted and such modifications and variations that may be made by those skilled in the art without departing from the spirit and scope of the present invention are intended to be within the scope of the following claims.

The functionality of the present invention, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium, and all or part of the steps of the method according to the embodiments of the present invention are executed in a computer device (which may be a personal computer, a server, or a network device) and corresponding software. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, or an optical disk, exist in a read-only Memory (RAM), a Random Access Memory (RAM), and the like, for performing a test or actual data in a program implementation.

Claims

1. A method for intelligently generating interview-type comprehensive programs is characterized by comprising the following steps:

s2, setting the role played by each channel material according to the camera shooting picture in the program video; in step S2, the setting of the role played by each channel material includes the following steps: dividing the channel materials into three categories according to the scene, namely a close scene, a middle scene and a long scene; the shooting picture of the close shot is close-up of a guest and a host; the shot picture of the middle scene is the interaction between the guests and the guests, between the guests and the host and between the host and the host; the shot picture of the long shot is the whole stage;

s3, extracting video characteristics of each channel material; in step S3, the method includes the steps of:

s31, establishing a face library containing the host and the guest of the field program; in step S31, if the program is shared

Feature matrix of

And

name matrix of

；

Is an integer which is the number of the whole,

respectively correspond to the matrix

And

to (1) a

Go to the first

A column element;

Video material of each channel, wherein each video material is

Frames, each frame having been aligned on a timeline, are passed through

An individual material

To (1) a

Frame image

，

，

Wherein

Denotes the first

A face feature matrix obtained by frame extraction,

in order to detect the number of faces,

is shown as

First of frame extraction

The characteristics of the individual's face are,

is shown as

All the face frames detected by the frame are,

is shown as

First of frame detection

The number of the face frames is one,

is shown as

The key points of all the faces detected by the frame,

is shown as

First of frame detection

The key points of the face of the individual,

is shown as

The face detected by the frame corresponds to the identified name,

is shown as

First of frame detection

The name of the person corresponding to the individual person,

，

is shown as

The name of the individual person is used,

indicating that the index corresponding to the maximum value is taken,

representing a similarity calculation function; the result of extracting video features from all stories is expressed as

；

An individual material

To (1) a

Frame image

Given its width of

High is

By counting the picture stability scores

To characterize whether the frame of image picture is stable,

,

,

,

,

wherein the content of the first and second substances,

is to show to

The frame image is taken as a gray-scale image,

which represents the fourier transform of the signal,

it is indicated that the absolute value is taken,

is composed of

The absolute value of (a) is,

is composed of

is a threshold value set as

Of medium maximum value

，

Is composed of

The number of pixels greater than the threshold value in

If the value is larger than the set empirical value, the image is represented

The picture is stable;

s34, using the data in the step S31 and using the human face key point data of the same person continuous time dimension to analyze the mouth shape and judge whether the person is speaking in the set time; in step S34, for the second step

An individual material

Taking a fixed time window size of

The key point data of the face of the same person

I.e. by

，

Calculating the area of the mouth

I.e. by

，

Thereby calculating out

Variance of the area of the figure's mouth

：

，

Wherein

Is shown by

The average value of the human figure mouth-shaped area,

representing a character

At the moment of time

The key points of the face at the time of the operation,

indicates the calculated area thereof when

When the value is larger than the set empirical value, the name is

The character of

Speaking in a time period and marking as a speaker;

An individual material

All-frame analysis results of

Go through all the results when

Greater than a set empirical value, the flag

When the value is less than or equal to the set empirical value, marking

In a common vessel

Initial candidate segment list of candidate segments

；

S42, traversing the initial candidate segment list generated in S41

Comparing the current segment

Out point of

With the next segment

In the point of entry

If, if

If the value is larger than the set empirical value, the segment is divided

And fragments thereof

Are combined into

At the point of entry is

In the point of entry

At the point of departure is

Out point of

And so on, generating a final candidate segment list

；

S5, selecting candidate video clips according to predefined rules to compose a primary program, in step S5, the method includes the following steps:

s52, integrating the step S42

Final candidate segment list for individual channel material

the segment is a close shot, there is a speaker, and the speaker is a guest;

the segment is a perspective.

2. The method for intelligently generating interview-like variety programs according to claim 1, wherein in step S51, priority is set as: short shot > medium shot > long shot.