CN114257757A

CN114257757A - Automatic cutting and switching method and system of video, video player and storage medium

Info

Publication number: CN114257757A
Application number: CN202111576101.XA
Authority: CN
Inventors: 张明; 董健
Original assignee: Ruimo Intelligent Technology Shenzhen Co ltd
Current assignee: Ruimo Intelligent Technology Shenzhen Co ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-29
Anticipated expiration: 2041-12-21
Also published as: CN114257757B

Abstract

The invention discloses a method and a system for automatically cutting and switching videos, a video player and a storage medium. The method comprises the following steps: acquiring parameter information of a panoramic view, a standing character view and a close-up view which are sequentially arranged from low priority to high priority based on a video image sequence image; when the output time of the current output view exceeds a threshold value, if only one standing figure in the standing figure view is speaking, cutting the close-up view from the video image sequence according to the bounding box information of the close-up view and outputting the close-up view; otherwise, if at least one standing character exists in the panoramic view, cutting the standing character view from the video image sequence according to the bounding box information of the standing character view and outputting the cut standing character view; otherwise, outputting the panoramic view. The invention can carry out real-time close-up shooting on students who speak at the beginning in a classroom, can automatically process any number of students who speak at the beginning, can complete automatic switching among the students who speak at different beginnings, and realizes automatic switching among different visual angles.

Description

Automatic cutting and switching method and system of video, video player and storage medium

Technical Field

The embodiment of the invention relates to the technical field of videos, in particular to a method and a system for automatically cutting and switching videos, a video player and a storage medium.

Background

The quality course accumulation, the school book resource construction and the teaching and research level brought by the recording and broadcasting classroom application are improved, and the function in teaching is increasingly highlighted. In the intelligent class recording system, the students need to answer questions immediately and take close-up pictures. The prior art mainly adopts an optical zooming method or a simple digital cutting method to simply carry out close-up shooting on standing students, but the optical zooming method has the following problems: 1. the close-up shooting has dynamic processes of entering and exiting, and the dynamic processes can influence the display effect of the output video; 2. when the system is used for close-up shooting, the visual angle of the shot is very small, the perception capability of the whole classroom is lost, for example, when other students stand up, the system cannot process the situation because the visual angle of the shot is limited; 3. the cost is high, and in a common classroom and a place or area with limited education expenses, the cost is high when the high-cost optical zoom lens is used. Although the traditional simple digital clipping method can avoid the problems, the execution logic is very simple, the situation that a plurality of students stand up cannot be processed, and the intelligent degree is very low.

Disclosure of Invention

The invention provides a method and a system for automatically cutting and switching videos, a video player and a storage medium, which are used for realizing real-time close-up shooting of students who speak upright in a classroom, automatically processing any number of students who speak upright, completing automatic switching among students who speak upright and realizing automatic switching among different visual angles.

In a first aspect, an embodiment of the present invention provides an automatic video clipping and switching method, where the automatic video clipping and switching method includes:

A. acquiring parameter information of a panoramic view, a standing figure view and a close-up view which are sequentially arranged from low priority to high priority based on a video image sequence image, wherein the parameter information comprises bounding box information and state information;

B. determining that an output time of a current output view has exceeded a threshold;

C. if only one standing character in the standing character views speaks, the state information of the close-up view is effective, and the close-up view is cut from the video image sequence according to the bounding box information of the close-up view and output; otherwise, if at least one standing figure exists in the panoramic view, the state information of the standing figure view is effective, and the standing figure view is cut from the video image sequence and output according to the bounding box information of the standing figure view; otherwise, outputting the panoramic view.

In a second aspect, an embodiment of the present invention provides an automatic video cropping switching system, where the automatic video cropping switching system includes:

the view acquisition module is used for acquiring parameter information of a panoramic view, a standing figure view and a close-up view which are sequentially arranged from low priority to high priority based on a video image sequence image, wherein the parameter information comprises bounding box information and state information;

an output time determination module for determining that the output time of the current output view has exceeded a threshold;

a view cropping output module, configured to crop the close-up view from the video image sequence according to the bounding box information of the close-up view and output the close-up view if only one of the close-up view is speaking; otherwise, if at least one standing figure exists in the panoramic view, the state information of the standing figure view is effective, and the standing figure view is cut from the video image sequence and output according to the bounding box information of the standing figure view; otherwise, outputting the panoramic view.

In a third aspect, an embodiment of the present invention further provides a video player, where the video player includes:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method for automatic crop switching of video as described above.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for automatic cropping switching of a video as described above.

When the output time of the current output view exceeds the threshold value, if only one standing figure in the standing figure view is speaking, the state information of the close-up view is effective, and the close-up view is cut from the video image sequence and output according to the border frame information of the close-up view; otherwise, if at least one standing figure exists in the panoramic view, the state information of the standing figure view is effective, and the standing figure view is cut from the video image sequence and output according to the bounding box information of the standing figure view; otherwise, outputting the panoramic view. The invention can carry out real-time close-up shooting on students who speak at the beginning in a classroom, can automatically process any number of students who speak at the beginning, can complete automatic switching among the students who speak at different beginnings, and realizes automatic switching among different visual angles.

Drawings

Fig. 1 is a flowchart of a method for automatically cutting and switching video according to a first embodiment of the present invention;

fig. 2 is a flowchart of a sub-method of an automatic clipping switching method for video according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for automatically switching between cropping and cropping video according to a second embodiment of the present invention

Fig. 4 is a flowchart of a sub-method of an automatic clipping switching method for video according to a second embodiment of the present invention;

fig. 5 is a flowchart of another sub-method of an automatic cropping switching method for video according to a second embodiment of the present invention;

fig. 6 is a block diagram of a video automatic cropping switching system according to a third embodiment of the present invention;

fig. 7 is a block diagram of another video automatic cropping switching system according to a third embodiment of the present invention;

fig. 8 is a block diagram of a video player according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a method for automatically cutting and switching a video according to an embodiment of the present invention, where the embodiment is applicable to a classroom recording system requiring switching between different views, and the method can be executed by a video player, and specifically includes the following steps:

and step 110, acquiring parameter information of a panoramic view, a standing character view and a close-up view which are sequentially arranged from low priority to high priority based on the video image sequence images, wherein the parameter information comprises bounding box information and state information.

In this embodiment, priority levels of different view views are set, the view views are sequentially divided into a panoramic view, a standing figure view and a close-up view according to the priority levels from low to high, and bounding box information and state information corresponding to the view at each level are acquired, where the state information indicates whether the corresponding view is valid, and if the state information is valid, it is determined that the corresponding view satisfies a preset state condition; and if the state information is invalid, indicating that the corresponding view finding view does not meet the preset state condition.

Step S120, judging that the output time of the current output view exceeds a threshold value, if so, executing step S130; if not, executing the step S140, not operating, and continuously outputting the current output view.

Judging whether the output time of the current output view exceeds a threshold value, if so, indicating that the output time of the current output view is longer, and judging whether the view needs to be switched according to a preset switching condition; if not, no operation is carried out, and the current output view is continuously output. The threshold value may be set according to actual needs, and is not limited herein.

Step S130, if only one standing figure in the standing figure views speaks, the state information of the close-up view is effective, and the close-up view is cut from the video image sequence according to the boundary frame information of the close-up view and output; otherwise, if at least one standing figure exists in the panoramic view, the state information of the standing figure view is effective, and the standing figure view is cut from the video image sequence and output according to the bounding box information of the standing figure view; otherwise, outputting the panoramic view.

Specifically, as shown in fig. 2, step S130 includes steps S131 to S135, which are as follows:

step S131, judging whether only one standing character in the standing character view is speaking, if so, executing step S134; if not, go to step S132.

Step S132 is to determine whether there is at least one standing character in the panoramic view, if so, step S133 is performed, otherwise, step S135 is performed.

And step S133, judging that the state information of the standing character view is effective, and cutting and outputting the standing character view from the video image sequence according to the bounding box information of the standing character view.

And S134, judging that the state information of the close-up view is effective, and cutting the close-up view from the video image sequence according to the bounding box information of the close-up view and outputting the cut-up view.

And step S135, outputting the panoramic view.

In this embodiment, the viewfinder views can be divided into a panoramic view, a standing character view and a close-up view, the priority levels of the viewfinder views are sequentially increased, if the output time of the current output view exceeds a threshold value, the state information of different viewfinder views is sequentially judged from high to low according to the priority levels, that is, the state information of the close-up view, the standing character view and the panoramic view is sequentially judged, whether the state information of the close-up view is effective is firstly judged, if the state information of the close-up view is effective, that is, only one character in the standing character view (the minimum view including all standing characters in the shooting range) is speaking, the close-up view is cut from the video image sequence according to the border frame information of the close-up view and output, that is, the close-up shot of the speaking standing character is output; if the state information of the close-up view is invalid, judging whether the state information of the standing-up figure view is valid, if the state information of the figure view is valid, namely, at least one standing-up figure exists in the panoramic view, cutting the standing-up figure view from the video image sequence according to the bounding box information of the standing-up figure view and outputting the cut-up figure view; and if the state information of the standing figure view is invalid, judging whether the state information of the panoramic view is valid, and if the state information of the panoramic view is valid, cutting the panoramic view from the video image sequence according to the bounding box information of the panoramic view and outputting the panoramic view. In this embodiment, the state information of the panoramic view is always valid, and if the state information of the standing character view is invalid, the panoramic view is output, so that the video output logic is stronger. The embodiment can be applied to a classroom recording and broadcasting system, can perform real-time close-up shooting on students who speak upright in a classroom, can automatically process any number of students who speak upright, and can complete automatic switching among the students who speak upright; when no student speaks or more than two students speak, the view can be switched to the standing character view, and when no standing character exists, the view can be automatically switched to the panoramic view.

When the output time of the current output view exceeds the threshold value, if only one standing figure in the standing figure view is speaking, the state information of the close-up view is effective, and the close-up view is cut from the video image sequence and output according to the border frame information of the close-up view; otherwise, if at least one standing figure exists in the panoramic view, the state information of the standing figure view is effective, and the standing figure view is cut from the video image sequence and output according to the bounding box information of the standing figure view; otherwise, outputting the panoramic view. The invention can carry out real-time close-up shooting on the students who speak upright in a classroom, can automatically process any number of students who speak upright, can complete automatic switching among the students who speak upright, and can realize automatic switching among different visual angles, and has low cost and simple deployment.

Example two

Fig. 3 is a flowchart of a method for automatically cutting and switching a video according to a second embodiment of the present invention, where this embodiment is applicable to a classroom recording and playing system, and the method can be executed by a video player, and specifically includes the following steps:

and 210, acquiring parameter information of a panoramic view, a standing character view and a close-up view which are sequentially arranged from low priority to high priority based on the video image sequence image according to a preset period, wherein the parameter information comprises bounding box information and state information.

In this embodiment, priority levels of different viewfinder views are set, and bounding box information and state information corresponding to the viewfinder view of each level are acquired according to a preset period, where the state information indicates whether the corresponding viewfinder view is valid, and if the state information is valid, it indicates that the corresponding viewfinder view meets a preset state condition; and if the state information is invalid, the corresponding view finding view does not meet the preset state condition. The preset period can be set according to actual needs, which depends on the response time of switching, too short system load is large, too long results in too long response time of image switching, and the value range of the preset period is generally 0.1 s-0.5 s.

In some embodiments, the panoramic view is specifically an image shot by a camera, the bounding box information of the panoramic view is a shot picture of the whole camera, and the state information of the panoramic view is always valid. The obtaining of the standing figure view specifically comprises the following steps: obtaining a minimum view which directly cuts a bounding box containing all standing people from the panoramic view and is used as a standing people view; or acquiring the independent bounding box of each standing figure, and performing region fusion on the independent bounding boxes of all the standing figures to obtain the standing figure view. If the number of the standing figures in the panoramic image is zero, the state information of the standing figure view is invalid, otherwise, the state information of the standing figure view is valid, and the standing figure view is a view corresponding to the minimum bounding box comprising all the standing figures or a view obtained by fusing the independent bounding box regions of all the standing figures; for the close-up view, if only one standing figure in the standing figure views is judged to be speaking based on the deep convolutional neural network, the state information of the close-up view is effective, and the border frame information of the close-up view is the border frame information corresponding to the standing figure close-up picture, namely the close-up view is the close-up picture of the speaking standing figure; otherwise, the state information of the close-up view is invalid. In some embodiments, obtaining the independent bounding box of each standing person, and performing region fusion on the independent bounding boxes of all the standing persons to obtain the standing person view specifically includes:

a1, acquiring an independent bounding box of each standing character;

and a2, sequentially performing telescopic splicing on the independent bounding boxes of each standing character to enable the length of each standing character to be consistent, wherein the telescopic ratio of each independent bounding box is consistent with the original aspect ratio. Therefore, the obtained standing figure view can accord with the composition aesthetic, and the stretching proportion of the independent bounding box is consistent with the original aspect ratio of the independent bounding box, so that the picture can be prevented from being stretched.

In some embodiments, as shown in fig. 4, the obtaining of the close-up view parameter information specifically includes steps S211 to S214, and the details are as follows:

step S211, performing person detection on the video image frame sequence, and acquiring person detection information of each standing person in the current frame video image, where the person detection information includes: face feature information, lip feature information, and coordinate information.

In this embodiment, a face detection algorithm is used to perform face detection on people in a video image frame sequence to obtain face feature information and lip feature information of each standing person in a current frame, and coordinate information of each standing person is obtained from a current frame video image.

In some embodiments, the extracting lip feature information from the face detection result includes steps a to c, and the specific content is as follows:

step a, detecting key points of the human face in the human face detection result, and detecting the position of the lips.

And b, cutting out an image block by taking the position of the lip as a center, and zooming to a fixed size to obtain the lip image.

And c, inputting the lip image into a convolutional neural network to obtain lip characteristic information.

The human face detection and the lip characteristic information processing approach to real-time processing, a lightweight network architecture and a rapid image processing algorithm can be used, the processing efficiency of equipment is improved, and the response speed is improved.

Step S212, a person position sequence table is maintained according to the person detection information.

The method comprises the steps that a character position sequence list is constructed for each standing character, the character position sequence information comprises face feature information, lip feature information, coordinate information and a time stamp of a current frame image, and the character position sequence list is composed of all character position sequence information. Preferably, the step of maintaining the person position sequence table according to the person detection information detected from each frame of video image includes: matching the face characteristic information with the face characteristic information of the figures in the figure position sequence table, and if the matched figures exist, updating figure position sequence information corresponding to the figures; if no matched person exists, constructing new person position sequence information for the person; or matching the coordinate information detected by the current frame image with the coordinate information of the persons in the person position sequence table, and if the matched persons exist, updating the position sequence information corresponding to the persons; if there is no matching person, a new person position sequence information is constructed for the person. According to the embodiment, matching is carried out according to the face feature information, so that the detection accuracy can be improved, and the accuracy of positioning close-up shooting is further improved; and for the places with fixed seats, especially for the places with fixed and orderly arranged seats, the character position sequence table is updated according to the coordinate information, so that the system has small burden and higher response speed on the premise of ensuring the accuracy.

And step S213, determining that only one standing character is speaking according to the lip characteristic information in the character position sequence list.

Each standing character corresponds to a sequence of lip characteristic information in the character position sequence table, namely, the lip track in the lip change process is recorded, whether the corresponding character speaks or not can be accurately judged according to the sequence of lip characteristic information, and then only one standing character speaking is determined according to the judgment result.

In some embodiments, as shown in fig. 5, the step S213 of determining that only one standing character is speaking according to the lip feature information in the character position sequence list includes steps S2131 to S2133, which specifically include the following:

step 2131, sequentially extracting the lip feature information in the character position sequence information corresponding to each standing character from the current frame back by N frames at intervals, and sending the lip feature information corresponding to the extracted K frames of video images into a speaking classifier to obtain the real-time score of each standing character.

Calculating the real-time score of the lip track of each standing character once every period of time, extracting K frames of lip characteristic information from the current frame and back at intervals of N frames, splicing the extracted lip characteristic information corresponding to the K frames of video images in sequence, and then sending the spliced lip characteristic information into a speaking classifier to obtain the real-time score of the lip track of the standing character.

Step 2132, performing average calculation on the real-time scores of each standing figure calculated for the previous M times to obtain the speaking score of the current frame video image, and if the speaking score is greater than or equal to a preset threshold, determining that the standing figure is speaking.

The lip track of each standing character is composed of a series of real-time score sequences, the higher the value of the lip track is, the higher the possibility that the corresponding character is speaking is, the real-time scores calculated for the previous M times of each standing character are averaged to be used as the speaking score of the current frame video image, and if the speaking score of the current frame video image of the standing character is larger than or equal to a preset threshold value, the standing character is judged to be speaking. The value of M is only the frame number shot by the camera within 1s, if the value of M is too small, the calculation load is too large, and the system function is responded; if the value of M is too large, real-time performance will be reduced.

Step S2133 is to count the number of the rising persons speaking, and if the number of the rising persons speaking is equal to 1, determine that only one of the rising persons is speaking.

The embodiment can judge whether the person speaks according to the lip characteristics of the person, count the number of the standing persons speaking, and cut a close-up shot of the standing person from the current frame video image according to the bounding box information of the standing person corresponding to the current frame video image if the number of the standing persons speaking is equal to 1, that is, only one standing person speaks.

And S214, taking the obtained bounding box information corresponding to the close-up picture of the talking standing character as the bounding box information of the close-up view, and setting the state information of the close-up view as effective.

If the talking standing character does not exist in the video image, the state information of the close-up view is invalid; if more than two people speaking at the same time exist in the video image, the state information of the close-up view is invalid; the status information of the close-up view is valid only when the number of standing characters speaking in the video image is 1. When the number of the talking standing characters in the video image is 1, acquiring the border frame information corresponding to the close-up picture of the talking standing characters in the current frame video image, wherein the border frame information corresponding to the close-up picture of the talking standing characters is the border frame information of the close-up view; the status information of the close-up view is set to active.

The embodiment can judge whether the character speaks or not according to the lip characteristics (namely the lip track) of a sequence of characters, can accurately judge whether the character speaks or not, if only one standing character speaks, the boundary frame information corresponding to the close-up picture of the standing character speaking in the current frame video image is obtained, namely the close-up picture of the standing character speaking is cut from the current frame video image according to the boundary frame information corresponding to the close-up picture.

Step S220, judging whether the output time of the current output view exceeds a threshold value, if so, executing step S230; if not, executing step S290, not doing operation, and continuously outputting the current output view.

Judging whether the output time of the current output view exceeds a threshold value, if so, indicating that the output time of the current output view is longer, and judging whether the view-finding view needs to be switched according to a preset condition; otherwise, the user sleeps for a specified time, namely, no further operation is carried out, and the current output view is continuously output. The threshold value may be set according to actual needs, and is not limited herein.

Step S230, judging whether only one standing character in the standing character view is speaking, if so, executing step S250; if not, go to step S240.

Step S240, determining whether there is at least one standing character in the panoramic view, if yes, performing step S260, and if no, performing step S270.

And S250, judging that the state information of the close-up view is effective, and cutting the close-up view from the video image sequence according to the bounding box information of the close-up view and outputting the cut-up view.

And step S260, judging that the state information of the standing character view is effective, and cutting and outputting the standing character view from the video image sequence according to the bounding box information of the standing character view.

And step S270, outputting the panoramic view. In this embodiment, the state information of the panoramic view is set to be valid, that is, in an actual operation, step S270 specifically determines whether the state information of the panoramic view is valid, and if so, cuts out the panoramic view from the video image sequence according to the bounding box information corresponding to the panoramic view and outputs the panoramic view; if not, the designated picture can be output, but the state information of the panoramic view is set to be valid in the embodiment, so that the panoramic view is directly output if the state information of the standing figure view is invalid, and the fluency and the logic of the video framing output can be maintained, thereby being more in line with the use habits of users.

Step S280 is further included after step S250, step S260 and step S270, and the specific contents are as follows:

step S280, if the current output view and the previous output view are not the same priority view, starting a timer of the current output view to count time again; otherwise, the timer keeps counting. Calculating the output time of the output view, and if the output time is the same priority view, keeping the timing by the timer; if the view is a view with different priority levels, timing again; this maintains the integrity of the system logic.

The technical scheme of the embodiment obtains parameter information of a panoramic view, a character view and a close-up view based on a video image sequence image; judging whether a character speaks according to lip characteristics (namely a lip track) of a sequence of characters, and accurately judging whether the character speaks, if only one standing character speaks, the state information of the close-up view is effective, acquiring the boundary frame information corresponding to the close-up picture of the character speaking in the current frame video image, namely cutting the close-up picture of the standing character speaking from the current frame video image according to the boundary frame information corresponding to the close-up picture, if the state information of the close-up view is invalid, and if at least one standing character in the shooting range is shot, the state information of the standing character view is effective, cutting the standing character view from the video image sequence according to the boundary frame information of the standing character view and outputting the cut-up character view; and if the state information of the standing figure view is invalid, cutting the panoramic view from the video image sequence according to the bounding box information of the panoramic view and outputting the panoramic view. The invention automatically switches the view according to the priority level and the state of the view, realizes the automatic switching between different views in small-sized application scenes and non-professional scenes, enriches the content of videos, meets the requirements of users on different scenes, and has low realization cost and simple deployment.

EXAMPLE III

The automatic video clipping and switching system provided by the embodiment of the invention can execute the automatic video clipping and switching method provided by any embodiment of the invention, has corresponding functional modules and beneficial effects of the execution method, and for detailed contents in the embodiment, reference can be made to corresponding contents in the first embodiment and the second embodiment of the invention.

Fig. 6 is a block diagram of a structure of an automatic video cropping switching system according to a third embodiment of the present invention, and as shown in fig. 6, the automatic video cropping switching system according to the present embodiment includes a view obtaining module 10, an output time determining module 20, and a view cropping output module 30, which includes the following specific contents:

the view acquiring module 10 is configured to acquire parameter information of view views of different priority levels based on the video image sequence images, where the parameter information includes bounding box information and state information.

An output time determination module 20 for determining that the output time of the current output view has exceeded a threshold.

A view cropping output module 30, configured to crop the close-up view from the video image sequence according to the bounding box information of the close-up view and output the close-up view if only one of the standing-up person views is speaking; otherwise, if at least one standing figure exists in the panoramic view, the state information of the standing figure view is effective, and the standing figure view is cut from the video image sequence and output according to the bounding box information of the standing figure view; otherwise, outputting the panoramic view.

The embodiment can perform real-time close-up shooting on students standing up to speak in a classroom, can automatically process any number of students standing up to speak, can complete automatic switching among the students standing up to speak, and realizes automatic switching among different visual angles. In some embodiments, as shown in fig. 7, the view acquiring module 10 specifically includes a detecting unit 11, a list maintaining unit 12, a speaking determining unit 13, and an information acquiring unit 14, and the specific contents are as follows:

a detecting unit 11, configured to perform person detection on the video image frame sequence, and acquire person detection information of each standing person in the current frame video image, where the person detection information includes: face feature information, lip feature information, and coordinate information.

A list maintenance unit 12 for maintaining a character position sequence list according to the character detection information.

And the speaking determining unit 13 is configured to determine that only one standing character is speaking according to the lip feature information in the character position sequence list.

And an information acquisition unit 14 for setting the obtained bounding box information corresponding to the close-up picture of the talking standing character as the bounding box information of the close-up view and setting the state information of the close-up view as effective.

In some embodiments, the utterance determination unit 13 is specifically configured to:

sequentially extracting K frames of lip characteristic information in the character position sequence information corresponding to each standing character from the current frame and back at intervals of N frames, and sending the lip characteristic information corresponding to the extracted K frames of video images into a speaking classifier to obtain the real-time score of each standing character;

carrying out average calculation on the real-time scores of the previous M times of calculation of each standing figure to obtain the speaking score of the current frame video image, and if the speaking score is greater than or equal to a preset threshold value, judging that the standing figure is speaking;

the number of the rising persons speaking is counted, and if the number of the rising persons speaking is equal to 1, only one rising person is determined to be speaking.

The embodiment can accurately judge whether the person speaks or not by judging whether the person speaks or not according to the lip characteristics (namely the lip track) of a sequence of the person. In this embodiment, if only one standing character is speaking, the boundary frame information corresponding to the close-up picture of the speaking standing character in the current frame video image is acquired, that is, the close-up picture of the speaking standing character is cut from the current frame video image according to the boundary frame information corresponding to the close-up picture.

In some embodiments, the view module 10 is further specifically configured to: obtaining a minimum view which directly cuts a bounding box containing all standing people from the panoramic view and is used as a standing people view; or acquiring the independent bounding box of each standing figure, and performing region fusion on the independent bounding boxes of all the standing figures to obtain the standing figure view. As a preferred embodiment, the view module 10 is specifically configured to:

acquiring an independent bounding box of each standing figure;

and sequentially performing telescopic splicing on the independent bounding boxes of each standing character to enable the length of each standing character to be consistent, wherein the telescopic proportion of each independent bounding box is consistent with the original length-width ratio of each independent bounding box.

In some embodiments, the view acquisition module 10 is further specifically configured to acquire a view of a person including all the persons in the panorama based on the video image sequence images according to a multi-person cropping algorithm.

In some embodiments, the view acquiring module 10 is specifically configured to acquire parameter information of the panoramic view, the standing character view, and the close-up view, which are sequentially arranged from low to high in priority level, based on the video image sequence images according to a preset period.

In some embodiments, as shown in fig. 7, the automatic switching system for video framing provided in this embodiment further includes a timing module 40, configured to start a timer of the current output view to count again if the current output view is not the same as the previous output view in the priority level; otherwise, the timer keeps counting.

The technical scheme of the embodiment obtains parameter information of a panoramic view, a standing figure view and a close-up view based on a video image sequence image; if the output time of the current output view exceeds the threshold value, if the state information of the close-up view is effective, cutting the close-up view from the video image sequence according to the boundary box information of the close-up view and outputting the close-up view; if the state information of the close-up view is invalid, judging whether the state information of the standing figure view is valid, if so, cutting the standing figure view from the video image sequence according to the bounding box information of the standing figure view and outputting the cut-up figure view; and if the state information of the standing figure view is invalid, cutting the panoramic view from the video image sequence according to the bounding box information of the panoramic view and outputting the panoramic view. The invention can carry out real-time close-up shooting on students who speak at the beginning in a classroom, can automatically process any number of students who speak at the beginning, can complete automatic switching among the students who speak at different beginnings, and realizes automatic switching among different visual angles.

Example four

Fig. 8 is a block diagram of a video player according to a third embodiment of the present invention. Fig. 8 illustrates a block diagram of an exemplary video player 80 suitable for use in implementing embodiments of the present invention. The video player 80 shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 8, the video player 80 is in the form of a general purpose computing device. The components of the video playback 80 may include, but are not limited to: one or more processors or processing units 82, a system memory 81, and a bus 83 that couples the various system components including the system memory 82 and the processing unit 81.

Bus 83 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

The video player 80 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by video player 80 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 82 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)811 and/or cache memory 814. The video player 80 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 812 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, and commonly referred to as a "hard drive"). Although not shown in FIG. 8, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 83 by one or more data media interfaces. Memory 81 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 813 having a set (at least one) of program modules 8131 may be stored, for example, in system memory 81, such program modules 8131 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination of which may comprise an implementation of a network environment. Program modules 8131 generally perform the functions and/or methodologies of the described embodiments of the invention.

The video player 80 may also communicate with one or more external devices 100 (e.g., keyboard, pointing device, display 90, etc.), with one or more devices that enable a user to interact with the video player 80, and/or with any devices (e.g., network card, modem, etc.) that enable the video player 80 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 84. Also, the video player 80 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 85. As shown, the network adapter 85 communicates with the other modules of the video player 80 over the bus 83. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the video player 80, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 82 executes various functional applications and data processing by running a program stored in the system memory 81, and implements, for example, an automatic switching method of video framing provided by an embodiment of the present invention, the automatic switching method of video including:

acquiring parameter information of a panoramic view, a standing figure view and a close-up view which are sequentially arranged from low priority to high priority based on a video image sequence image, wherein the parameter information comprises bounding box information and state information;

determining that an output time of a current output view has exceeded a threshold;

if only one standing character in the standing character views speaks, the state information of the close-up view is effective, and the close-up view is cut from the video image sequence according to the bounding box information of the close-up view and output; otherwise, if at least one standing figure exists in the panoramic view, the state information of the standing figure view is effective, and the standing figure view is cut from the video image sequence and output according to the bounding box information of the standing figure view; otherwise, outputting the panoramic view.

EXAMPLE five

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, is configured to perform a method for automatically switching videos, where the method includes:

Of course, the computer program of the computer-readable storage medium provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the automatic video switching method provided in any embodiment of the present invention.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. An automatic video clipping and switching method is characterized by comprising the following steps:

2. The method for automatically cropping and switching a video according to claim 1, wherein the obtaining of the standing character view specifically comprises:

obtaining a minimum view which directly cuts a bounding box containing all standing people from the panoramic view and is used as a standing people view; or

And acquiring the independent boundary frame of each standing figure, and performing region fusion on the independent boundary frames of all the standing figures to obtain a standing figure view.

3. The method according to claim 2, wherein the obtaining of the independent bounding box of each standing character and the region fusion of the independent bounding boxes of all standing characters to obtain the standing character view specifically comprises:

acquiring an independent bounding box of each standing figure;

4. The method for automatically cutting and switching videos according to claim 1, wherein the obtaining of the close-up view parameter information specifically includes:

carrying out person detection on the video image frame sequence to obtain person detection information of each standing person in the current frame video image, wherein the person detection information comprises: face feature information, lip feature information and coordinate information;

maintaining a character position sequence table according to the character detection information;

determining that only one standing character is speaking according to the lip characteristic information in the character position sequence list;

and taking the obtained bounding box information corresponding to the close-up picture of the talking standing character as the bounding box information of the close-up view, and setting the state information of the close-up view as effective.

5. The method of claim 4, wherein determining that only one standing character is speaking according to the lip feature information in the character position sequence list comprises:

6. The method according to claim 1, wherein the parameter information of the panoramic view, the standing character view, and the close-up view sequentially arranged from low to high in priority level based on the video image sequence image acquisition is specifically: and acquiring parameter information of the panoramic view, the standing figure view and the close-up view which are sequentially arranged from low priority to high priority based on the video image sequence image according to a preset period.

7. The method for automatically cutting and switching video according to claim 1, wherein said step C is followed by further comprising: if the current output view and the previous output view are not the same priority view, starting a timer of the current output view to count again; otherwise, the timer keeps counting.

8. The automatic video clipping and switching system is characterized in that the automatic video clipping and switching method comprises the following steps:

9. The system for automatically cropping and switching videos according to claim 8, wherein the view acquisition module is specifically configured to:

10. The system for automatically switching videos according to claim 9, wherein the view module specifically includes:

a detection unit configured to detect person detection information for each of the standing persons in the standing person view, wherein the person detection information includes: face feature information, lip feature information and coordinate information;

a list maintenance unit configured to maintain a character position sequence list according to the character detection information;

the speaking determining unit is used for determining that only one standing character is speaking according to the lip characteristic information in the character position sequence list;

and the information acquisition unit is used for taking the acquired bounding box information corresponding to the close-up picture of the talking standing character as the bounding box information of the close-up view and setting the state information of the close-up view as effective.

11. A video player, the video player comprising:

one or more processors;

a system memory to store one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method for automatic crop switching of video of claims 1-7.

12. A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method for automatic crop switching of video according to claims 1 to 7.