CN114257757B

CN114257757B - Automatic video clipping and switching method and system, video player and storage medium

Info

Publication number: CN114257757B
Application number: CN202111576101.XA
Authority: CN
Inventors: 张明; 董健
Original assignee: Ruimo Intelligent Technology Shenzhen Co ltd
Current assignee: Ruimo Intelligent Technology Shenzhen Co ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2023-07-28
Anticipated expiration: 2041-12-21
Also published as: CN114257757A

Abstract

The invention discloses an automatic cutting and switching method and system of video, a video player and a storage medium. The method comprises the following steps: acquiring parameter information of panoramic views, standing character views and close-up views with priority levels orderly arranged from low to high based on the video image sequence images; when the output time of the current output view exceeds a threshold value, if only one of the standing character views is speaking, cutting out a close-up view from the video image sequence according to the boundary box information of the close-up view and outputting the cut-up view; otherwise, if at least one standing character exists in the panoramic view, cutting the standing character view from the video image sequence according to the boundary frame information of the standing character view and outputting the cut standing character view; otherwise, outputting the panoramic view. The invention can carry out real-time close-up shooting on students who speak vertically in a classroom, can automatically process any number of students who speak vertically, can finish automatic switching among students who speak vertically, and can realize automatic switching among different visual angles.

Description

Automatic video clipping and switching method and system, video player and storage medium

Technical Field

The embodiment of the invention relates to the technical field of videos, in particular to an automatic cutting and switching method and system of videos, a video player and a storage medium.

Background

The functions of the recorded broadcast classroom application in teaching are increasingly prominent in the accumulation of fine courses, the construction of school resources and the improvement of teaching and research level. In the intelligent class recording system, close-up shooting is required for the students to answer questions at the beginning. The prior art mainly adopts an optical zooming method or a simple digital clipping method to simply carry out close-up shooting on standing students, but the following problems exist in the adoption of the optical zooming method: 1. the close-up shooting has dynamic processes of entering and exiting, and the dynamic processes can influence the display effect of the output video; 2. during close-up shooting, the visual angle of the lens is very small, the perception capability of the whole class is lost, for example, when other students stand up, the visual angle of the lens is limited, and the system cannot handle the situation; 3. the cost is high, and the cost for using the high-cost optical zoom lens is high in common classrooms, places or areas with limited education expenses. While the conventional simple digital clipping method can avoid the problems, the execution logic is very simple, the standing situation of a plurality of students cannot be processed, and the intelligent degree is very low.

Disclosure of Invention

The invention provides an automatic video cutting and switching method and system, a video player and a storage medium, which are used for realizing real-time close-up shooting of students who speak vertically in a classroom, automatically processing any number of students who speak vertically, automatically switching among different students who speak vertically, and automatically switching among different visual angles.

In a first aspect, an embodiment of the present invention provides an automatic cropping switching method of a video, where the automatic cropping switching method of a video includes:

A. sequentially arranging parameter information of a panoramic view, an upright character view and a close-up view from low to high based on a video image sequence image acquisition priority, wherein the parameter information comprises boundary box information and state information;

B. determining that the output time of the current output view has exceeded a threshold;

C. if only one of the standing character views is speaking, the state information of the close-up view is valid, and the close-up view is cut out from the video image sequence according to the boundary box information of the close-up view and output; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view.

In a second aspect, an embodiment of the present invention provides an automatic cropping switching system for a video, where the automatic cropping switching system for a video includes:

the view acquisition module is used for sequentially arranging parameter information of a panoramic view, an upright character view and a close-up view from low to high based on the image acquisition priority of the video image sequence, wherein the parameter information comprises boundary box information and state information;

an output time determining module, configured to determine that an output time of a current output view has exceeded a threshold;

the view clipping output module is used for clipping the close-up view from the video image sequence and outputting the close-up view according to the boundary frame information of the close-up view if only one of the view of the person is speaking; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view.

In a third aspect, an embodiment of the present invention further provides a video player, including:

One or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the automatic crop switching method for video as described above.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the automatic cropping switching method of video as described above.

When the output time of the current output view exceeds a threshold value, if only one standing person in the standing person views is speaking, the state information of the close-up view is valid, and the close-up view is cut out from a video image sequence according to the boundary box information of the close-up view and output; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view. The invention can carry out real-time close-up shooting on students who speak vertically in a classroom, can automatically process any number of students who speak vertically, can finish automatic switching among students who speak vertically, and can realize automatic switching among different visual angles.

Drawings

Fig. 1 is a method flowchart of a method for automatically cropping and switching a video according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a sub-method of an automatic cropping switching method for video according to a first embodiment of the present invention;

FIG. 3 is a flow chart of another method for automatic cropping switching of video according to the second embodiment of the present invention

FIG. 4 is a flowchart of a sub-method of an automatic cropping switching method for video according to a second embodiment of the present invention;

FIG. 5 is a flow chart of another sub-method of the automatic cropping switching method of a video provided in the second embodiment of the present invention;

FIG. 6 is a block diagram of an automatic cropping switching system for video according to a third embodiment of the present invention;

FIG. 7 is a block diagram of an automatic cropping switching system for another video provided in a third embodiment of the present invention;

fig. 8 is a block diagram of a video player according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of a method for automatically cutting and switching video according to an embodiment of the present invention, where the method is applicable to a system that needs to switch between different views of a class record, and the method may be performed by a video player, and specifically includes the following steps:

and 110, acquiring parameter information of a panoramic view, an upright character view and a close-up view, which are arranged in sequence from low priority to high priority, based on the video image sequence image, wherein the parameter information comprises boundary box information and state information.

In the embodiment, different view finding views are set with priority levels, the view finding views are sequentially divided into a panoramic view, a standing character view and a close-up view according to the priority levels from low to high, and boundary frame information and state information corresponding to the view finding views of each level are obtained, wherein the state information indicates whether the corresponding view finding view is effective, and if the state information is effective, the fact that the corresponding view finding view meets preset state conditions is indicated; if the state information is invalid, the corresponding view finding view does not meet the preset state condition.

Step S120, judging that the output time of the current output view exceeds a threshold value, if yes, executing step S130; if not, step S140 is executed, and the current output view is continuously output.

Judging whether the output time of the current output view exceeds a threshold value, if so, indicating that the output time of the current output view is longer, and judging whether the view needs to be switched according to a preset switching condition; if not, the operation is not performed, and the current output view is continuously output. The threshold may be set according to actual needs, and is not limited herein.

Step S130, if only one standing person in the standing person views is speaking, the state information of the close-up views is valid, and the close-up views are cut out from the video image sequence according to the boundary box information of the close-up views and output; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view.

Specifically, as shown in fig. 2, step S130 includes steps S131 to S135, and the specific contents are as follows:

step S131, judging whether only the standing person in the standing person view is speaking, if so, executing step S134; if not, go to step S132.

Step S132, judging whether at least one standing person exists in the panoramic view, if so, executing step S133, and if not, executing step S135.

And step S133, judging that the state information of the standing character view is valid, cutting the standing character view from the video image sequence according to the boundary box information of the standing character view, and outputting the cut standing character view.

And step S134, judging that the state information of the close-up view is effective, and cutting out the close-up view from the video image sequence according to the boundary box information of the close-up view and outputting.

And step S135, outputting the panoramic view.

In this embodiment, the view may be divided into a panoramic view, an upright character view and a close-up view, where the priority levels are sequentially increased, if the output time of the current output view exceeds a threshold value, the state information of different view views is sequentially determined from high to low according to the priority levels, that is, the state information of the close-up view, the upright character view and the panoramic view is sequentially determined, firstly, whether the state information of the close-up view is valid is determined, if the state information of the close-up view is valid, that is, only one character in the upright character view (including the minimum view of all upright characters in the shooting range) is speaking, then the close-up view is cut out from the video image sequence according to the boundary frame information of the close-up view and output, that is, the close-up lens of the speaking upright character is output; if the state information of the close-up view is invalid, judging whether the state information of the standing character view is valid, and if the state information of the character view is valid, namely, at least one standing character in the panoramic view, cutting the standing character view from the video image sequence according to the boundary frame information of the standing character view and outputting the cut-out character view; if the state information of the standing character view is invalid, judging whether the state information of the panoramic view is valid, and if the state information of the panoramic view is valid, cutting the panoramic view from the video image sequence according to the boundary box information of the panoramic view and outputting the cut panoramic view. In this embodiment, the state information of the panoramic view is valid all the time, and if the state information of the standing character view is invalid, the panoramic view is output, so that the logic of video output is stronger. The embodiment can be applied to a classroom recording and broadcasting system, can carry out real-time close-up shooting on students who speak vertically in a classroom, can automatically process any number of students who speak vertically, and can finish automatic switching among different students who speak vertically; when no student speaks or more than two students speak, the view of the standing character can be switched, and when no standing character exists, the view of the standing character can be automatically switched into the panoramic view.

When the output time of the current output view exceeds a threshold value, if only one standing person in the standing person views is speaking, the state information of the close-up view is valid, and the close-up view is cut out from a video image sequence according to the boundary box information of the close-up view and output; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view. The invention can carry out real-time close-up shooting on students who speak vertically in a classroom, can automatically process any number of students who speak vertically, can complete automatic switching among students who speak vertically, realizes automatic switching among different visual angles, and has low realization cost and simple deployment.

Example two

Fig. 3 is a flowchart of a method for automatically cutting and switching video according to a second embodiment of the present invention, where the method may be applied to a classroom recording and playing system, and the method may be executed by a video player, and specifically includes the following steps:

and 210, acquiring parameter information of a panoramic view, an upright character view and a close-up view, which are orderly arranged from low priority to high priority, based on the images of the video image sequence according to a preset period, wherein the parameter information comprises boundary box information and state information.

In the embodiment, different view finding views are set in priority, boundary frame information and state information corresponding to each view finding view are obtained according to a preset period, wherein the state information indicates whether the corresponding view finding view is in effect, and if the state information is in effect, the corresponding view finding view is indicated to meet a preset state condition; and if the state information is invalid, the corresponding view finding view does not meet the preset state condition. The preset period can be set according to actual needs, and depends on the response time of switching, if the system load is too short, the response time of picture switching is too long, and the value range of the preset period is generally 0.1 s-0.5 s.

In some embodiments, the panoramic view is specifically an image shot by a camera, the bounding box information of the panoramic view is a shot picture of the whole camera, and the state information of the panoramic view is valid all the time. The obtaining of the standing character view specifically comprises the following steps: acquiring a minimum view of a bounding box which contains all standing figures and is directly cut from the panoramic view as a standing figure view; or acquiring independent boundary boxes of each standing person, and carrying out region fusion on the independent boundary boxes of all the standing persons to obtain a standing person view. If the number of the standing characters in the panoramic image is zero, the state information of the standing character view is invalid, otherwise, the state information of the standing character view is valid, and the standing character view is a view corresponding to the minimum bounding box comprising all the standing characters or a view obtained by fusing independent bounding box areas of all the standing characters; for the close-up view, if only one of the standing characters in the standing character view is judged to be speaking based on the depth convolution neural network, the state information of the close-up view is valid, the boundary frame information of the close-up view is the boundary frame information corresponding to the close-up picture of the standing character, namely the close-up view is the close-up picture of the speaking standing character; otherwise, the state information of the close-up view is invalid. In some embodiments, obtaining an independent bounding box of each standing character, and performing region fusion on the independent bounding boxes of all the standing characters to obtain a standing character view specifically includes:

a1, acquiring an independent boundary box of each standing character;

and a2, sequentially performing expansion and contraction splicing on the independent boundary boxes of each standing character to enable the lengths of the standing characters to be consistent, wherein the expansion and contraction proportion of the independent boundary boxes is consistent with the original length-width ratio of the independent boundary boxes. This allows the resulting view of the standing character to be aesthetically pleasing to the composition, and the telescoping ratio of the independent bounding box is consistent with its original aspect ratio to avoid stretching the picture.

In some embodiments, as shown in fig. 4, the acquisition of the close-up view parameter information specifically includes steps S211 to S214, and the specific contents are as follows:

step S211, performing person detection on the video image frame sequence to obtain person detection information of each standing person in the current frame video image, where the person detection information includes: face feature information, lip feature information and coordinate information.

In the embodiment, a face detection algorithm is utilized to detect the faces of the people in the video image frame sequence, so that the face characteristic information and lip characteristic information of each standing person in the current frame are obtained, and the coordinate information of each standing person is obtained from the video image of the current frame.

In some embodiments, extracting lip feature information from the face detection result includes steps a to c, which are specifically as follows:

And a, detecting key points of the face in the face detection result, and detecting the position of the lips.

And b, cutting out an image block by taking the lip position as the center, and scaling to a fixed size to obtain a lip image.

And c, inputting the lip image into a convolutional neural network to obtain lip characteristic information.

The processing of the face detection and lip feature information is close to real-time processing, and a lightweight network architecture and a rapid image processing algorithm can be used, so that the processing efficiency of equipment is improved, and the response speed is improved.

And step S212, maintaining a person position sequence table according to the person detection information.

And constructing a character position sequence information for each standing character, wherein the character position sequence information comprises face characteristic information, lip characteristic information, coordinate information and a timestamp of a current frame image, and all character position sequence information forms the character position sequence table. Preferably, the person position sequence table is maintained according to person detection information detected by each frame of video image, and specifically includes: matching the face characteristic information with the face characteristic information of the person in the person position sequence table, and if the matched person exists, updating the person position sequence information corresponding to the person; if no matched person exists, a new person position sequence information is constructed for the person; or matching the coordinate information detected by the current frame image with the coordinate information of the person in the person position sequence table, and if the matched person exists, updating the position sequence information corresponding to the person; if there is no matching character, new character position sequence information is constructed for the character. According to the embodiment, matching is carried out according to the face characteristic information, so that the detection accuracy can be improved, and the accuracy of positioning close-up shooting is further improved; and for places with fixed seats, particularly for places with fixed and orderly arranged seats, the position sequence list of the person is updated according to the coordinate information, and on the premise of ensuring the accuracy, the system has small burden and higher response speed.

Step S213, determining that only one standing character is speaking according to lip characteristic information in the character position sequence table.

Each standing character corresponds to a sequence of lip characteristic information in the character position sequence table, namely, the lip track of the lip change process is recorded, whether the corresponding character is speaking or not can be accurately judged according to the sequence of lip characteristic information, and then only one person of the standing character speaking is determined according to the judging result.

In some embodiments, as shown in fig. 5, step S213, determining that only one standing character is speaking according to the lip feature information in the character position sequence table includes steps S2131 to S2133, specifically includes the following:

step S2131, sequentially extracting K frames from the lip feature information in the character position sequence information corresponding to each simultaneous character in a mode of starting from the current frame and N frames at intervals, and sending the lip feature information corresponding to the extracted K frame video images into a speaking classifier to obtain the real-time score of each simultaneous character.

The real-time score of the lip track of each standing character is calculated once every time, the lip characteristic information is extracted back from the current frame in a mode of N frames at intervals, the lip characteristic information corresponding to the extracted K frame video images is spliced in sequence and then sent into a speaking classifier, and the real-time score of the lip track of the standing character is obtained.

And step S2132, calculating an average value of the real-time scores calculated for the previous M times of each standing person to obtain a speaking score of the video image of the current frame, and judging that the standing person is speaking if the speaking score is greater than or equal to a preset threshold value.

The lip track of each standing character is composed of a series of real-time score sequences, the higher the value of the lip track is, the higher the possibility that the corresponding character is speaking is, the real-time score calculated M times before each standing character is averaged to be used as the speaking score of the video image of the current frame, and if the speaking score of the video image of the current frame of the standing character is greater than or equal to a preset threshold value, the standing character is judged to be speaking. The value of M is only required to be 1s of the number of frames shot by the camera, and if the value of M is too small, the calculation load is too large, and the system function is responded; if the value of M is too large, the real-time performance is lowered.

Step S2133, counting the number of standing characters in speaking, and if the number of standing characters in speaking is equal to 1, determining that only one standing character is speaking.

According to the embodiment, whether the person is speaking or not can be judged according to the lip characteristics of the person, the number of the persons who are speaking is counted, and if the number of the persons who are speaking is equal to 1, namely, only one person who is speaking, a close-up shot of the person is cut out from the current frame video image according to the boundary frame information corresponding to the current frame video image.

Step S214 sets the acquired bounding box information corresponding to the close-up picture of the speaking standing person as the bounding box information of the close-up view, and sets the state information of the close-up view to be valid.

If the video image does not have the speaking standing character, the state information of the close-up view is invalid; if more than two persons speaking in a standing way exist in the video image, the state information of the close-up view is invalid; the status information of the close-up view is valid only when the number of standing characters speaking in the video image is 1. When the number of the speaking standing characters in the video image is 1, obtaining the boundary frame information corresponding to the close-up picture of the speaking standing character in the video image of the current frame, wherein the boundary frame information corresponding to the close-up picture of the standing character is the boundary frame information of the close-up view; the status information of the close-up view is set to valid.

According to the embodiment, whether the person speaks or not is judged according to the lip characteristics (namely the lip track) of a sequence of the person, whether the person speaks or not can be accurately judged, if only one standing person speaks, the boundary frame information corresponding to the close-up picture of the standing person speaking in the video image of the current frame is obtained, namely the close-up picture of the standing person speaking in the video image of the current frame is cut out from the video image of the current frame according to the boundary frame information corresponding to the close-up picture, the standing speaker in the video is positioned through the visual processing technology, so that a microphone array of a class recording and playing system is omitted, the method has remarkable significance for class recording and playing and low-cost class rooms, and the method has a positioning and shooting function under the condition of not increasing cost.

Step S220, judging whether the output time of the current output view exceeds a threshold value, if so, executing step S230; if not, go to step S290, do not operate, continue to output the current output view.

Judging whether the output time of the current output view exceeds a threshold value, if so, indicating that the output time of the current output view is longer, and judging whether the view finding view needs to be switched according to preset conditions; otherwise, the user is not allowed to go to sleep for the designated time, i.e. the user does not need to do further operation and continues to output the current output view. The threshold may be set according to actual needs, and is not limited herein.

Step S230, judging whether only the standing character in the standing character view is speaking, if so, executing step S250; if not, go to step S240.

Step S240, judging whether at least one standing person exists in the panoramic view, if so, executing step S260, and if not, executing step S270.

And step S250, judging that the state information of the close-up view is effective, and cutting out the close-up view from the video image sequence according to the boundary box information of the close-up view and outputting the cut-up view.

Step S260, judging that the state information of the standing character view is valid, cutting the standing character view from the video image sequence according to the boundary box information of the standing character view, and outputting the cut standing character view.

Step S270, outputting the panoramic view. In this embodiment, the state information of the panoramic view is set to be valid, that is, in actual operation, step S270 specifically includes determining whether the state information of the panoramic view is valid, if so, cutting the panoramic view from the video image sequence according to the bounding box information corresponding to the panoramic view, and outputting the cut panoramic view; if not, the designated screen can be output, but in the embodiment, the state information of the panoramic view is set to be valid, so if the state information of the standing character view is invalid, the panoramic view is directly output, so that the smoothness and the logic of video framing output can be kept, and the method and the device are more suitable for the use habit of users.

Step S280 is further included after step S250, step S260 and step S270, and the specific contents are as follows:

step S280, if the current output view and the last output view are not the same priority level view, starting a timer of the current output view to reckon; otherwise, the timer continues to keep counting. Calculating the output time of the output view, and if the output time is the same priority view, continuously keeping the timer to count; if the view is the view with different priority levels, re-timing; this maintains the integrity of the system logic.

According to the technical scheme, parameter information of a panoramic view, a character view and a close-up view is obtained through images based on a video image sequence; in this embodiment, whether a person speaks or not is accurately determined by determining whether the person speaks according to a lip feature (i.e., a lip track) of a sequence of the person, if only one person speaks, the state information of the close-up view is valid, and the boundary frame information corresponding to the close-up picture of the person speaking in the current frame video image is obtained, that is, the close-up picture of the person speaking is cut out of the current frame video image according to the boundary frame information corresponding to the close-up picture, if the state information of the close-up view is invalid, and if at least one person in the shooting range is valid, the state information of the view of the person is valid, and if the boundary frame information of the view of the person is cut out from the video image sequence and output; if the state information of the standing character view is invalid, the panoramic view is cut out from the video image sequence according to the boundary box information of the panoramic view and output. According to the invention, the view is automatically switched according to the priority level and the state of the view, so that the automatic switching between different view views in a small application scene and a non-professional scene is realized, the content of video is enriched, the requirements of users on different scenes are met, and the realization cost is low and the deployment is simple.

Example III

The automatic video cropping switching system provided by the embodiment of the invention can execute the automatic video cropping switching method provided by any embodiment of the invention, has corresponding functional modules and beneficial effects of the executing method, and the detailed content in the embodiment can refer to the corresponding content in the first embodiment and the second embodiment of the invention.

Fig. 6 is a block diagram of an automatic video cropping switching system according to a third embodiment of the present invention, and as shown in fig. 6, the automatic video cropping switching system according to the present embodiment includes a view acquisition module 10, an output time determination module 20, and a view cropping output module 30, and the specific contents are as follows:

the view acquisition module 10 is configured to acquire parameter information of viewfinder views with different priority levels based on a video image sequence image, where the parameter information includes bounding box information and state information.

An output time determination module 20 is configured to determine that the output time of the current output view has exceeded a threshold.

A view clipping output module 30, configured to clip and output a close-up view from a video image sequence according to the bounding box information of the close-up view if only one of the upright character views is speaking; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view.

According to the embodiment, the students who speak vertically can be photographed in real time in the class, any number of the students who speak vertically can be automatically processed, automatic switching among the students who speak vertically can be completed, and automatic switching among different viewing angles is achieved. In some embodiments, as shown in fig. 7, the view acquisition module 10 specifically includes a detection unit 11, a list maintenance unit 12, a talk determination unit 13, and an information acquisition unit 14, and the specific contents are as follows:

a detecting unit 11, configured to perform person detection on a sequence of video image frames, and obtain person detection information of each standing person in a current frame video image, where the person detection information includes: face feature information, lip feature information and coordinate information.

A list maintenance unit 12 for maintaining a person position sequence list according to the person detection information.

A speaking determining unit 13 for determining that only one standing person is speaking based on the lip characteristic information in the person position sequence list.

An information acquisition unit 14 for setting the acquired bounding box information corresponding to the close-up picture of the speaking standing person as the bounding box information of the close-up view and setting the state information of the close-up view to be valid.

In some embodiments, the speech determination unit 13 is specifically configured to:

sequentially extracting K frames from lip feature information in character position sequence information corresponding to each simultaneous character in a mode of starting from a current frame and separating N frames at intervals, and sending the lip feature information corresponding to the extracted K frame video images into a speaking classifier to obtain real-time scores of each simultaneous character;

calculating the average value of the real-time scores calculated for the previous M times of each standing character to obtain the speaking score of the video image of the current frame, and judging that the standing character is speaking if the speaking score is greater than or equal to a preset threshold value;

counting the number of standing characters in speaking, and if the number of standing characters in speaking is equal to 1, determining that only one standing character is speaking.

The present embodiment can accurately determine whether a person is speaking by determining whether the person is speaking based on a sequence of lip features (i.e., lip trajectories) of the person. In this embodiment, if only one standing person is speaking, the bounding box information corresponding to the close-up frame of the speaking standing person in the current frame video image is obtained, that is, the close-up frame of the speaking standing person is cut out from the current frame video image according to the bounding box information corresponding to the close-up frame.

In some embodiments, the view module 10 is also specifically configured to: acquiring a minimum view of a bounding box which contains all standing figures and is directly cut from the panoramic view as a standing figure view; or acquiring independent boundary boxes of each standing person, and carrying out region fusion on the independent boundary boxes of all the standing persons to obtain a standing person view. As a preferred embodiment, the view module 10 is specifically configured to:

obtaining an independent boundary box of each standing character;

and sequentially performing expansion and contraction splicing on the independent boundary boxes of each standing character to ensure that the lengths of the standing characters are consistent, wherein the expansion and contraction proportion of the independent boundary boxes is consistent with the original length-width ratio of the independent boundary boxes.

In some embodiments, the view acquisition module 10 is also specifically configured to acquire a person view including all of the persons in the panorama from a multi-person cropping algorithm based on the video image sequence image.

In some embodiments, the view acquisition module 10 is specifically configured to acquire parameter information of a panoramic view, an upright character view, and a close-up view, which are sequentially arranged from low to high in priority, based on the video image sequence image in a preset period.

In some embodiments, as shown in fig. 7, the automatic switching system for video framing provided in this embodiment further includes a timing module 40, configured to start a timer of the current output view to count again if the current output view is not the same priority view as the previous output view; otherwise, the timer continues to keep counting.

According to the technical scheme, parameter information of a panoramic view, an erect person view and a close-up view is obtained through images based on a video image sequence; if the output time of the current output view exceeds the threshold value, cutting out the close-up view from the video image sequence according to the boundary box information of the close-up view and outputting if the state information of the close-up view is valid; if the state information of the close-up view is invalid, judging whether the state information of the standing character view is valid, and if the state information of the standing character view is valid, cutting the standing character view from the video image sequence according to the boundary frame information of the standing character view and outputting the cut-up character view; if the state information of the standing character view is invalid, the panoramic view is cut out from the video image sequence according to the boundary box information of the panoramic view and output. The invention can carry out real-time close-up shooting on students who speak vertically in a classroom, can automatically process any number of students who speak vertically, can finish automatic switching among students who speak vertically, and can realize automatic switching among different visual angles.

Example IV

Fig. 8 is a block diagram of a video player according to a third embodiment of the present invention. Fig. 8 shows a block diagram of an exemplary video player 80 suitable for use in implementing embodiments of the present invention. The video player 80 shown in fig. 8 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 8, the video player 80 is in the form of a general purpose computing device. Components of video playback 80 may include, but are not limited to: one or more processors or processing units 82, a system memory 81, and a bus 83 that connects the various system components, including the system memory 82 and the processing units 81.

Bus 83 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Video player 80 typically includes a variety of computer system readable media. Such media can be any available media that can be accessed by video player 80 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 82 may include computer system readable media in the form of volatile memory such as Random Access Memory (RAM) 811 and/or cache memory 814. Video player 80 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 812 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, commonly referred to as a "hard disk drive"). Although not shown in fig. 8, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be coupled to bus 83 via one or more data medium interfaces. Memory 81 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

Programs/utilities 813 having a set (at least one) of program modules 8131 may be stored in, for example, system memory 81, such program modules 8131 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 8131 generally perform the functions and/or methodologies of the described embodiments of the present invention.

The video player 80 may also communicate with one or more external devices 100 (e.g., keyboard, pointing device, display 90, etc.), one or more devices that enable a user to interact with the video player 80, and/or any devices (e.g., network card, modem, etc.) that enable the video player 80 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 84. Also, video player 80 may communicate with one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet, via network adapter 85. As shown, the network adapter 85 communicates with other modules of the video player 80 via the bus 83. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with video player 80, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 82 executes various functional applications and data processing by running a program stored in the system memory 81, for example, to implement an automatic switching method of video framing provided by the embodiment of the present invention, the automatic switching method of video including:

acquiring parameter information of a panoramic view, an upright character view and a close-up view, which are arranged in sequence from low priority level to high priority level, based on a video image sequence image, wherein the parameter information comprises boundary box information and state information;

determining that the output time of the current output view has exceeded a threshold;

if only one of the standing character views is speaking, the state information of the close-up view is valid, and the close-up view is cut out from the video image sequence according to the boundary box information of the close-up view and output; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view.

Example five

A fifth embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program for executing an automatic video switching method when executed by a processor, comprising:

Of course, the computer readable storage medium provided by the embodiments of the present invention, the computer program thereof is not limited to the method operations described above, and related operations in the automatic video switching method provided by any embodiment of the present invention may also be performed.

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. The automatic video cropping switching method is characterized by comprising the following steps of:

A. acquiring parameter information of a panoramic view, an upright character view and a close-up view, which are arranged in sequence from low priority level to high priority level, based on a video image sequence image, wherein the parameter information comprises boundary box information and state information;

C. if only one of the standing character views is speaking, the state information of the close-up view is valid, and the close-up view is cut out from the video image sequence according to the boundary box information of the close-up view and output; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view;

The acquiring of the close-up view parameter information specifically comprises the following steps:

performing person detection on the video image frame sequence to obtain person detection information of each standing person in the current frame video image, wherein the person detection information comprises: face feature information, lip feature information and coordinate information;

maintaining a person position sequence table according to the person detection information;

determining that only one standing character is speaking according to lip characteristic information in the character position sequence table;

taking the obtained boundary box information corresponding to the close-up picture of the speaking standing character as boundary box information of a close-up view, and setting the state information of the close-up view as valid;

the determining that only one standing character is speaking according to lip characteristic information in the character position sequence table comprises:

2. The automatic cropping switching method of video according to claim 1, wherein the obtaining of the standing character view is specifically:

acquiring a minimum view of a bounding box which contains all standing figures and is directly cut from the panoramic view as a standing figure view; or alternatively

And acquiring independent boundary boxes of each standing person, and carrying out region fusion on the independent boundary boxes of all the standing persons to obtain a standing person view.

3. The method for automatically cropping and switching video according to claim 2, wherein the obtaining the independent bounding boxes of each standing character, and performing region fusion on the independent bounding boxes of all standing characters, specifically comprises:

obtaining an independent boundary box of each standing character;

4. The automatic cropping switching method of video according to claim 1, wherein the parameter information of the panoramic view, the standing character view, and the close-up view, which are sequentially arranged from low to high, based on the image acquisition priority of the video image sequence is specifically: and acquiring parameter information of the panoramic view, the standing character view and the close-up view, which are arranged in sequence from low priority to high priority, based on the video image sequence images according to a preset period.

5. The automatic cropping switching method of video according to claim 1, wherein after the step C, further comprises: if the current output view and the last output view are not the same priority view, starting a timer of the current output view to reckon; otherwise, the timer continues to keep counting.

6. An automatic video cropping switching system is characterized in that the automatic video cropping switching method comprises the following steps:

the view acquisition module is used for acquiring parameter information of a panoramic view, an upright character view and a close-up view, which are arranged in sequence from low priority to high priority, based on the video image sequence images, wherein the parameter information comprises boundary box information and state information;

the view clipping output module is used for clipping the close-up view from the video image sequence and outputting the close-up view according to the boundary frame information of the close-up view if only one of the view of the person is speaking; otherwise, if at least one standing character exists in the panoramic view, the state information of the standing character view is valid, and the standing character view is cut out from the video image sequence according to the boundary box information of the standing character view and output; otherwise, outputting the panoramic view;

The view acquisition module specifically comprises:

the detecting unit is used for carrying out person detection on the video image frame sequence to obtain person detection information of each standing person in the video image of the current frame, wherein the person detection information comprises: face feature information, lip feature information and coordinate information;

a list maintenance unit for maintaining a person position sequence list according to the person detection information;

a speaking determining unit for determining that only one standing person is speaking according to lip feature information in the person position sequence table;

an information acquisition unit configured to set, as the bounding box information of the close-up view, the acquired bounding box information corresponding to the close-up picture of the speaking standing person, and set the state information of the close-up view to be valid;

the speaking determining unit is specifically configured to:

7. The automatic cropping switching system of claim 6, wherein the view acquisition module is specifically configured to:

8. A video player, the video player comprising:

one or more processors;

a system memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the automatic cropping switching method of video according to claims 1-5.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the automatic cropping switching method of video according to claims 1-5.