CN113052085A

CN113052085A - Video clipping method, video clipping device, electronic equipment and storage medium

Info

Publication number: CN113052085A
Application number: CN202110328319.7A
Authority: CN
Inventors: 赵飞; 吴伯川; 贾兆柱; 王麒铭; 栾鹏龙
Original assignee: New Oriental Education Technology Group Co ltd
Current assignee: New Oriental Education Technology Group Co ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-29

Abstract

The disclosure relates to a video clipping method, a video clipping device, an electronic device and a storage medium, belonging to the field of computer information, wherein the method comprises the following steps: acquiring a teaching video, wherein the teaching video is obtained by shooting a classroom scene in a classroom; determining the starting time and the ending time of a video clip irrelevant to teaching contents in the teaching video according to image information and/or audio information in the teaching video; and deleting the video clip from the teaching video according to the starting time and the ending time. The video clips irrelevant to teaching in the teaching video are automatically deleted through the image information and/or the audio information in the teaching video, on one hand, the time wasted when students watch the course supplementing video can be reduced, on the other hand, the waste caused by network resources and storage resources can be reduced, in addition, manual processing is not needed, the processing speed is improved, and the labor cost is also reduced.

Description

Video clipping method, video clipping device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer information, and in particular, to a video editing method and apparatus, an electronic device, and a storage medium.

Background

In an online class scene, students have the requirement of reviewing the supplementary lesson video due to class shortage or review. The unprocessed lesson supplementing video can contain a large amount of content with low relevance to the teaching content, and if the content contains too much redundant information, the time for the students to watch the lesson supplementing video can be wasted, so that the network resource and the storage resource are greatly wasted, and the network resource and the storage resource are also greatly wasted. However, in the related art, even if the lesson-supplementing video is processed, the video needs to be edited manually, which is not only inefficient but also high in labor cost.

Disclosure of Invention

To solve the problems in the related art, the present disclosure provides a video clipping method, apparatus, electronic device, and storage medium.

To achieve the above object, a first aspect of the present disclosure provides a video clipping method, the method including:

acquiring a teaching video, wherein the teaching video is obtained by shooting a classroom scene in a classroom;

determining the starting time and the ending time of a video clip irrelevant to teaching contents in the teaching video according to image information and/or audio information in the teaching video;

and deleting the video clip from the teaching video according to the starting time and the ending time.

Optionally, determining a start time and an end time of a video clip irrelevant to teaching content in the teaching video according to the image information and the audio information in the teaching video includes:

determining whether the teacher in the image information is in a roll call gesture;

under the condition that the teacher is determined to be in the roll-call posture, audio information corresponding to the roll-call posture of the teacher is transcribed into characters through a voice recognition transcription technology, and the characters are compared with a student list to obtain a first comparison result;

under the condition that the first comparison result represents that the characters are matched with the names of the students in the student list, taking the moment when the teacher in the image information is in the roll-calling posture as the starting time of a roll-calling stage; and the number of the first and second electrodes,

and according to the first comparison result, taking the time when the characters obtained by transcription are matched with the names of the students in the student list for the last time as the end time of the roll call stage.

Optionally, determining a start time and an end time of a video clip irrelevant to teaching content in the teaching video according to image information in the teaching video includes:

under the condition that the teacher is determined to be in the roll-call posture, inputting the image corresponding to the roll-call posture of the teacher into a pre-trained lip language recognition model to obtain characters corresponding to the teacher lip language, and comparing the characters corresponding to the teacher lip language with a student list to obtain a lip language comparison result;

under the condition that the lip language comparison result represents that characters corresponding to the teacher lip language are matched with the names of the students in the student list, the moment when the teacher in the image information is in the roll-call posture is determined to be used as the starting time of a roll-call stage; and the number of the first and second electrodes,

and according to the comparison result of the lip languages, taking the time when the last time of the characters corresponding to the teacher lip language is matched with the name of the student in the student list as the end time of the roll call stage.

under the condition that the teacher is determined to be in the roll-call posture, inputting the image corresponding to the roll-call posture of the teacher into a pre-trained lip language recognition model to obtain lip language features, and transcribing the audio information corresponding to the roll-call posture of the teacher through a voice recognition transcription technology to obtain voice features;

according to the time sequence, respectively obtaining the lip language features and the time feature sequence of the voice features;

fusing the time characteristic sequence of the lip language characteristic and the time characteristic sequence of the voice characteristic, and then transcribing to obtain characters corresponding to the speaking content of the teacher;

and comparing the words corresponding to the teacher speaking content with a student list, and determining the starting time and the ending time of the roll call stage according to the comparison result.

Optionally, determining whether the teacher in the image information is in the roll call gesture comprises:

determining whether a teacher in the image information holds a mobile phone or a paper list or not according to a skeleton point detection technology;

determining whether the sight of the teacher in the image information is switched between a mobile phone or a paper list and other positions for many times according to a sight detection technology;

and under the condition that the teacher holds the mobile phone or the paper list and the sight line of the teacher is switched between the mobile phone or the paper list and other positions for many times, determining that the teacher in the image information is in the roll calling posture.

Optionally, the determining, according to the sight line detection technology, whether the sight line of the teacher in the image information is switched between a mobile phone or a paper list and other locations for multiple times includes:

the method comprises the steps of obtaining a plurality of teacher sight angles in a plurality of sampling frames in the image information according to a sight detection technology;

calculating cosine distances of sight angles of the teachers in the two sampling frames aiming at every two adjacent sampling frames in the plurality of sampling frames, if the cosine distances are smaller than a sight concentration threshold value, determining that the teachers in the two sampling frames look at the same position, and if the cosine distances are larger than the sight concentration threshold value, determining that the teachers in the two sampling frames look at different positions;

under the condition that the teacher looks at the same position in the two adjacent sampling frames, recording the sight line angle of the teacher in the two sampling frames;

determining the angle range of the teacher sight angle of the teacher looking at the mobile phone or the paper list according to the distribution condition of the teacher sight angle obtained by recording;

generating a feature vector according to the teacher sight angle in each sampling frame and the angle range, wherein one feature value in the feature vector represents whether the teacher sight angle in one sampling frame is in the angle range;

and inputting the feature vectors into a pre-trained binary model to obtain a result which is output by the binary model and represents whether the sight of the teacher is switched between a mobile phone or a paper list and other positions for many times.

determining a suspected starting moment of a preparation stage before class according to image information in the teaching video;

judging whether no person speaks within a first preset time length after the suspected starting moment or not according to the audio information after the suspected starting moment;

under the condition that no person speaks within a first preset time length after the suspected starting time is determined, taking the suspected starting time as the starting time of a preparation stage before class; and the number of the first and second electrodes,

and taking the time of starting speaking of the teacher detected based on the audio information after the start time as the end time of the pre-class preparation phase.

Optionally, the image information in the teaching video includes a teaching projection image, and determining a suspected start time of a preparation stage before class according to the image information in the teaching video includes:

under the condition that the teaching projection image in the image information is determined to keep the same picture for more than a second preset time, setting the moment when the teaching projection image starts to keep the picture as a first suspected starting moment; alternatively, the first and second electrodes may be,

under the condition that the fact that the teacher stands at the same position in the image information is determined to exceed a third preset time length, the time when the teacher stands at the position is determined to be a second suspected starting time; alternatively, the first and second electrodes may be,

and under the condition that the state of the student in the image information is determined to be matched with a preset pre-class preparation state, taking the moment of initially determining that the state of the student is matched with the pre-class preparation state as a third suspected starting moment.

Optionally, determining a start time and an end time of a video clip irrelevant to teaching content in the teaching video according to image information and/or audio information in the teaching video includes:

starting class break detection in a preset time period, wherein the starting time of the preset time period is a first interval time from the starting time of the teaching video, and the ending time of the preset time period is a second interval time from the ending time of the teaching video;

wherein the break detection comprises determining the starting time of the break period according to the detection result of at least one of the following detection modes:

detecting the time of the students entering and exiting the classroom according to the image information; detecting the time for switching the projected image in the image information into the projected image irrelevant to the teaching content; detecting the time when the noise in the classroom is greater than a preset threshold value according to the audio information;

the break detection also comprises the following detection result of at least one detection mode to determine the ending time of the break period: detecting the time for switching the projection image in the image information into the projection image related to teaching; and detecting the time when the noise in the classroom is lower than the preset threshold value according to the audio information.

when the postures or positions of students in the classroom, which are more than the preset number, are determined to change according to the image information, audio information corresponding to the changes of the postures or positions of the students in the classroom, which are more than the preset number, is subjected to sound classification;

under the condition that the result of the sound classification represents that multiple persons speak, separating audio information corresponding to the change of postures or positions of students in the classroom, wherein the number of the students is larger than the preset number, obtaining the speaking content of the teacher, and transcribing the speaking content into characters through a voice recognition transcription technology;

matching the characters with a preset dictionary;

if the matching is successful, determining that the time when the postures or positions of the students in the classroom are changed, wherein the number of the postures or positions of the students is larger than the preset number, is the starting time of the class mortending stage; and the number of the first and second electrodes,

determining the moment when the result of the sound classification represents the finish of the multi-person speaking as the finish time of the class mortgage stage; or determining the end time of the classroom mortgage stage at the moment when the postures or positions of students less than the preset number change in the classroom according to the image information.

Optionally, the image information further includes a projection image, and determining a start time and an end time of a video clip irrelevant to teaching content in the teaching video according to the image information and the audio information in the teaching video includes:

when the situation that a projector in the classroom plays videos is determined, comparing the projected images in the image information and/or the corresponding audio information when the projector plays the videos with a pre-established white list to obtain a second comparison result;

and if the second comparison result representation is not matched with the white list, taking the moment when the video starts to be played as the starting time for playing the teaching unrelated video, and taking the moment when the video is played as the ending time for playing the teaching unrelated video.

acquiring the movement track of students in the classroom according to the image information, and judging whether music is played in the classroom and whether the speaking content of a teacher is matched with a pre-calibrated game dictionary or not according to the audio information;

under the condition that the movement track is matched with a preset student movement track representing a game interaction scene, if music is played in the classroom or the speaking content of the teacher is matched with the game dictionary, determining that the current video clip is a game interaction stage;

determining the starting time of the game interaction stage according to at least one of the time when the students start to move, the time when the music starts to be played in the classroom and the time when the speaking content of the teacher is initially matched with the game dictionary;

according to the audio information, determining the end time of the game interaction stage based on the time when the playing of music in the classroom is stopped or the speaking content of the classroom is detected to be initially matched with the game dictionary for the last time; or determining the time when the student returns to the lecture position according to the moving track, and determining the end time of the game interaction stage.

Optionally, determining a start time and an end time of a video clip irrelevant to teaching content in the teaching video according to the audio information in the teaching video includes:

transcribing the words spoken by the teacher in the audio information into characters by a voice recognition transcription technology;

matching the characters with a preset sensitive dictionary;

and if the words matched with the preset sensitive word dictionary exist, taking the starting time of the sentences in which the words appear as the starting time of the non-standard teaching behavior, and taking the ending time of the sentences in which the words appear as the ending time of the non-standard teaching behavior.

Optionally, the deleting the video segment from the teaching video according to the start time and the end time includes:

displaying the video clip generation thumbnail to a user, and reminding the user whether to determine to delete the video clip;

and under the condition of receiving a deletion determining instruction sent by the user, deleting the video segment from the teaching video according to the starting time and the ending time.

A second aspect of the present disclosure provides a video clipping device, the device comprising:

the acquisition module is used for acquiring a teaching video, wherein the teaching video is obtained by shooting a classroom scene in a classroom;

the determining module is used for determining the starting time and the ending time of a video clip irrelevant to teaching contents in the teaching video according to image information and/or audio information in the teaching video;

and the deleting module is used for deleting the video clip from the teaching video according to the starting time and the ending time.

A third aspect of the present disclosure provides an electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any one of the first aspect of the present disclosure.

A fourth aspect of the disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first aspects of the disclosure.

Through the technical scheme, the video clips irrelevant to teaching in the teaching video are automatically deleted through the image information and/or the audio information in the teaching video, on one hand, the time wasted when students watch the course supplementing video can be reduced, on the other hand, the waste caused by network resources and storage resources can also be reduced, in addition, manual processing is not needed, the processing speed is improved, and the labor cost is also reduced.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method of video clipping in accordance with an exemplary embodiment.

Fig. 2 is a schematic diagram illustrating an implementation scenario of a video clipping method according to an exemplary embodiment.

FIG. 3 is a block diagram illustrating a video clipping device according to an example embodiment.

FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

FIG. 1 is a flow diagram illustrating a method of video clipping, according to an exemplary embodiment, whose execution subject may be an edge computing force box, as shown in FIG. 1, the method comprising:

s101, obtaining a teaching video, wherein the teaching video is obtained by shooting a classroom scene in a classroom.

S102, determining the starting time and the ending time of a video clip irrelevant to teaching contents in the teaching video according to image information and/or audio information in the teaching video.

S103, deleting the video clip from the teaching video according to the starting time and the ending time.

Fig. 2 is a schematic diagram of an implementation scenario of a video editing method according to an exemplary embodiment, as shown in fig. 2, two monitoring cameras may be configured in a classroom, a first camera 21 points in a blackboard direction and can shoot images of a teacher and the blackboard, and also can shoot images of a projector, and a second camera 22 may be disposed above the blackboard and points in a student direction for shooting images in the student direction; a projector 23 and a microphone 24 may also be provided, which may be arranged near the podium for picking up the teacher's voice. The execution subject edge force calculation box 20 of the method acquires teaching video including image information and audio information through the two monitoring cameras 21, the camera 22, the projector 23 and the microphone 24, wherein the image information can include three video streams, video streams shot by the two monitoring cameras and video streams of the projector respectively. After the edge algorithm box 20 performs the video clipping method, the processed teaching video can be uploaded to a target server for the students to review. For the lesson supplementing video watched by the students again, the image information of the lesson supplementing video may be a video stream obtained by processing the video stream captured by the first camera 21 in the teaching video according to the embodiment of the present disclosure.

It should be understood by those skilled in the art that, due to the high requirements for images in the field of artificial intelligence algorithms, and further the high requirements for focal length, view angle width, angle, and height of the camera, when a conventional image algorithm is used, the images in the classroom captured by the wide-angle camera may cause image distortion, which results in errors in the start time and end time of the video clip determined in step S102 and unrelated to the teaching content, and the content related to the teaching content is deleted.

Therefore, when shooting a classroom scene in a classroom, it is preferable to use a high-definition camera with a viewing angle of about 50 degrees, and to define the height, position, and angle of the camera, for example, as shown in fig. 2, the second camera 22 for shooting the direction of a student needs to be disposed above the blackboard 2 meters from the ground and on the center line of the blackboard; or, in step S102, when determining the start time and the end time of a video clip in the teaching video that is irrelevant to the teaching content according to the image information in the teaching video, an image algorithm that can be adapted to a wide-angle camera is adopted, so that the requirements on the position, definition, and even number of cameras in an implementation scene can be reduced, but the requirements on the image algorithm are high, and further the calculation requirement of the edge calculation box is improved.

In the embodiment of the disclosure, video clips irrelevant to teaching in the teaching video are automatically deleted through image information and/or audio information in the teaching video, so that on one hand, the time wasted when students watch the course supplementing video can be reduced, on the other hand, the waste caused by network resources and storage resources can be reduced, in addition, manual processing is not needed, the processing speed is improved, and the labor cost is also reduced.

In the embodiment of the present disclosure, for example, a section in which a teacher calls a roll, a section in which the teacher prepares before class, a section in which the teacher has a break, a section in which the teacher plays a video that is not related to teaching contents, a section in which a game interacts with the teacher, a section in which an order is maintained, and a section in which the teaching contents are inappropriate may be deleted by some embodiments.

For example, for a video clip that a teacher calls, in some optional embodiments, determining a start time and an end time of a video clip that is irrelevant to teaching contents in the teaching video according to image information and audio information in the teaching video includes:

The teacher in the image information may be obtained by shooting with the first camera 21 facing the direction of the blackboard as shown in fig. 2, and the student list is obtained from the course information pulled by the edge force calculation box 20 from the educational administration system or the attendance system, for example, when the edge force calculation box 20 detects that the current time is the time of last class of the target course, the edge force calculation box requests the educational administration system or the attendance system to send the course information of the target course, and the course information includes the student list of the target course; or requesting the attendance system to send the student list of the target course according to the course information. The embodiment of the present disclosure does not limit how the student name is obtained.

The teacher can keep a fixed posture when the teacher calls the name, for example, the teacher holds the teacher and looks at the mobile phone, and then pronounces the name of each student one by one.

In still other optional embodiments, a lip language recognition technology may be further adopted, and when it is determined that the teacher is in the roll-call posture, an image corresponding to the roll-call posture of the teacher is input into a pre-trained lip language recognition model to obtain characters corresponding to the teacher lip language, and the characters corresponding to the teacher lip language are compared with a student list to obtain a lip language comparison result;

However, in the related art, the recognition for lip language is near field recognition, that is, the recognized face needs to occupy the subject position photographed by the camera. In the implementation scenario of the present disclosure, the camera that captures the teacher image is far from the teacher's face, and since, when shooting the teaching video, it is not possible to shoot a larger teacher's face image by adjusting the focal length of the first camera 21 for the students to be able to normally watch. In order to solve the problem that the face image of a teacher is small in the shot image in the related art, the problem can be solved in a super-resolution mode.

Thus, in one possible embodiment, image information may be input into a pre-trained target hyper-segmentation model to obtain a target teacher face image;

and performing lip language recognition according to the face image of the target teacher.

Most of existing hyper-resolution methods adopt a down-sampling mode to acquire a low-quality image from a high-definition image, then the low-quality image is used as input, and the high-definition image is used as a label to train a GAN (generic adaptive Networks, Generative countermeasure network) model.

Therefore, in a possible implementation, the training of the target hyper-resolution model can be implemented by setting 2 cameras for collecting images of the teacher, and the internal parameters, the installation height and the overlooking angle of the 2 cameras are all the same, wherein one camera can be a first camera 21 for shooting a teaching image with a normal focal length, the teaching image comprises a first face image of the teacher, and the other camera draws the face of the teacher in an optical zooming manner to shoot a second larger face image of the teacher; and training by taking the first face image as input and the second face image as output to obtain a target hyper-resolution model.

Further, it is necessary to align the input and output of the training data at the pixel level when the object hyper-molecular model is trained. Therefore, the distance between the 2 cameras can be determined, the teacher can carry out binocular distance measurement according to the installation height and the overlooking angle, after point clouds in a three-dimensional space are aligned, the point clouds are converted into coordinate points in a two-dimensional space to be aligned, the first face image and the second face image are aligned in a pixel level mode, and the aligned first face image and the aligned second face image are respectively used as input and output to be trained.

Furthermore, because the image quality distortion degree of the first surface image and the second surface image is influenced by the internal parameters of the cameras, a first convolution kernel consisting of 2 × 2 internal parameter distortion parameters of the 2 cameras and a second convolution kernel consisting of 3 × 3 internal parameter distortion parameters of the 2 cameras can be extracted, and the first convolution kernel and the second convolution kernel are respectively acted on three channels of the first surface image and the second surface image to respectively obtain six new channels; and training to obtain a target hyper-resolution model by taking the six new channels and the three channels of the first image as input and the six new channels and the three channels of the second image as output so as to reduce the influence of image quality distortion of the camera on the training of the target hyper-resolution model.

While the face image that needs to be recognized for lip language recognition in the related art is a front face image, in the implementation scenario of the present disclosure, the collected teacher image is often a top view. In order to obtain a front face image of the teacher, the target teacher face image may be input to a perspective correction model to obtain a front view target teacher face image.

Further, the input parameters of the perspective correction model comprise three-channel images and two-dimensional space coordinates of each pixel point in the images, and the output parameters comprise three-channel images and three-dimensional space coordinates of each pixel point in the images, so that three-dimensional coordinates corresponding to the two-dimensional coordinates of each pixel point are predicted, and the three-dimensional coordinates are used for assisting the correction of the head portrait of the teacher.

In some optional embodiments, the image corresponding to the roll call gesture may be input to a pre-trained lip language recognition model to obtain lip language features, and the audio information corresponding to the roll call gesture of the teacher may be transcribed by a voice recognition transcription technology to obtain voice features; respectively obtaining the time characteristic sequences of the lip language characteristics and the voice characteristics according to the time sequence;

and fusing the time characteristic sequence of the lip language characteristic and the time characteristic sequence of the voice characteristic, and then transcribing to obtain the characters corresponding to the speaking content of the teacher so as to improve the accuracy of character transcription.

In some optional embodiments, determining whether the teacher in the image information is in the roll call gesture comprises:

The skeleton point detection technology mainly detects key points of a human body, such as joints, five officers and the like, describes skeleton information of the human body through the key points, and determines whether a teacher in the image information holds a mobile phone or a paper list or not, wherein the determination can be made by inputting the skeleton information of the human body into a pre-trained neural network model and according to a result output by the neural network model.

Optionally, determining whether the teacher in the image information is holding a cell phone or a paper list may further comprise determining whether the teacher in the image information is looking at an object on a computer display or a desktop. Further, the teacher's hand-held object, the object of sight, may be determined by object detection techniques in the art.

By adopting the scheme, whether the teacher is in the roll calling posture can be determined more accurately through the skeleton point detection technology and the sight line detection technology.

Further, the determining whether the teacher's sight line in the image information is switched between a mobile phone or a paper list and other positions for multiple times according to the sight line detection technology includes:

The cosine distance is used for reflecting relative difference in the sight line direction of the teacher, the relative difference of the sight line angles of the teacher in the adjacent sampling frames is calculated, a sight line concentration threshold value is set, and when the difference of the sight line angles of the teacher in the two sampling frames is large, the positions of the teacher, which look at the time corresponding to the two sampling frames, are confirmed to be different positions. Because the positions of the student lists are relatively fixed, the positions of the student lists can be regarded as the positions of the student lists by counting the sight line angle distribution of the teachers looking at the same position, and the angle range of the sight line angle density larger than a certain threshold value can be regarded as the angle range of the teachers looking at the paper lists. Further, for example, a time when the teacher looks at the student list, that is, a time corresponding to the sampling frame when the gaze angle in the sampling frame is within an angle range in which the gaze angle density is greater than a certain threshold value, may be recorded as 0, and a time when the teacher does not look at the student list may be recorded as 1. Each time, 20 sampling frames can be counted to form a 20-dimensional feature vector formed by feature values of 0 or 1, and the feature vector is input into a pre-trained binary model to determine whether the teacher switches between the mobile phone or paper list and other positions for many times.

By adopting the scheme, the teacher sight angle in the sampling frames can be detected through the sight line detection technology, the sight line angle range of the teacher sight angle looking at the student list is determined by calculating the cosine distance of the teacher sight angles of the sampling frames, then the time when the sampling frames look at the student list and the time when the sampling frames do not look at the student list are counted, and whether the teacher switches between the mobile phone or the paper list and other positions for multiple times in the video of the period corresponding to the sampling frames is determined according to the pre-trained binary model.

For a video clip prepared before class, in some optional embodiments, determining a start time and an end time of a video clip irrelevant to teaching contents in the teaching video according to image information and audio information in the teaching video comprises:

It should be noted that the audio information may be picked up by the microphone 24 disposed near the platform as shown in fig. 2, and since the microphone 24 is at a certain distance from the students and the noises of the teacher are automatically removed from the audio information with small volume for better picking up the voices of the teacher, it may be determined that no one speaks when only some of the students make a short-pitch meeting.

In addition, the suspected start time of the pre-lesson preparation stage may be limited to a certain time after the lesson start time, for example, after the edge calculation box 20 detects that the current time is half an hour after the lesson of the target lesson, the step of determining the suspected start time of the pre-lesson preparation stage from the image information in the teaching video does not need to be performed when the edge calculation box 20 detects that the current time is half an hour after the lesson is opened.

By adopting the scheme, through the image processing technology and the audio processing technology, videos of corresponding preparation stages irrelevant to teaching, such as a teacher finishing a teaching plan before a class or debugging PPT, can be deleted.

As shown in fig. 2, the projection image may be acquired by connecting the edge computing box 20 to the projector 23, and the projector 23 acquires signals from the hdmi interface, or may acquire the projection image by using a camera pointing to a picture projected by the projector 23. The students in the image information may be what was captured by the second camera 22 in fig. 2. The state of the students matching the preset pre-class preparation state may be when more than a preset number of students are on the seat and in a state of exercising, turning books, and joining ears.

By adopting the scheme, when PPT is not played or stops for a long time without change in a page, a teacher does not move for a long time at a certain position, and any condition that a student is in a preparation state before class is detected, the detected time is taken as a suspected starting time. Further, when a plurality of the three cases are detected, the earliest detected time may be set as the pseudo start time.

For a video clip of a break between classes, in some optional embodiments, determining a start time and an end time of a video clip of the teaching video, which is not related to teaching content, according to image information and/or audio information in the teaching video includes:

If the time of a lesson is 40 minutes, the preset time period may be 15 th to 25 th minutes from the beginning of the lesson, that is, the first interval duration and the second interval duration are both 15 minutes. The method for determining that the noise in the classroom is greater than or less than the preset threshold value may be detecting the number of people whose speaking sounds exceed the preset decibel in the audio information, and when the number exceeds the corresponding preset threshold value, determining that the noise in the current classroom is greater than the preset threshold value; or, the words of the person whose speaking voice exceeds the preset decibel in the audio information may be subjected to voice transcription, the relevance between the transcription result and the content in class is matched, and when the relevance is low, it is determined that the noise in the current classroom is greater than the preset threshold.

By adopting the scheme, the time period of courseware rest can be determined according to the image information and/or the audio information of the teaching video, so that the video clips corresponding to the courseware rest time are deleted from the teaching video, the time waste of students when watching the course supplementing video is reduced, the waste of network resources and storage resources is reduced, the processing speed is improved, and the labor cost is also reduced.

In still other optional embodiments, for a video clip of a classroom mortgage, determining a start time and an end time of a video clip of the teaching video that is not related to teaching content according to image information and audio information in the teaching video includes:

matching the characters with a preset dictionary;

The audio information is separated to obtain the speaking content of the teacher, and the separation can be carried out by identifying the sound with the maximum volume in the audio information; alternatively, the voice print of the teacher may be stored in advance, and the sound matching the voice print of the teacher stored in advance in the identification audio information may be separated. The preset dictionary may include words such as "quiet", "start class", and the like, for example.

By adopting the scheme, the video clips corresponding to the teacher maintenance teaching order can be deleted from the teaching video, the time of the students for watching the lesson supplementing video is reduced, the waste of network resources and storage resources is reduced, the processing speed is improved, and the labor cost is reduced.

In some optional embodiments, for a video clip playing a video unrelated to a classroom, the image information further includes a projection image, and determining a start time and an end time of the video clip unrelated to teaching content in the teaching video according to the image information and the audio information in the teaching video includes:

By adopting the scheme, the projected images or the corresponding audio can be compared with the white list through the image processing technology and/or the audio processing technology to determine whether the video played by the teacher is the video related to the classroom, and when the played video is unrelated to the classroom, for example, when the teacher plays an animation, the time corresponding to the video unrelated to the classroom is deleted from the teaching video, so that the time wasted when the student watches the supplementary video is reduced, the waste of network resources and storage resources is reduced, the processing speed is improved, and the labor cost is also reduced.

For video clips of game interaction, in further optional embodiments, determining a start time and an end time of a video clip of the teaching video, which is not related to teaching content, according to image information and audio information in the teaching video includes:

In a low-grade classroom scene, the situation of classroom interaction games often occurs. Although the classroom interaction game has a promoting effect on the learning effect of students on site in class, students who review and supplement classes after class hardly help, and need to be edited. By adopting the scheme, when the students are in the game process according to the tracks of the students, or when the students play music in a classroom or instruct or participate in the game according to the audio information, the video clips corresponding to the game interaction process are deleted from the teaching video, so that the time wasted when the students watch the course supplementing video is reduced, the waste of network resources and storage resources is reduced, the processing speed is increased, and the labor cost is also reduced.

In still other optional embodiments, for an inappropriate video clip of teaching content, determining a start time and an end time of a video clip of the teaching video, which is unrelated to the teaching content, according to audio information in the teaching video includes:

matching the characters with a preset sensitive dictionary;

Wherein, the sensitive word dictionary can contain as many as possible, such as expletive vocabulary, etc., and can be continuously completed. By adopting the scheme, the process of irregular teaching behavior of a teaching teacher can be removed from an original teaching video through an audio processing technology, the time of the students who watch the video for supplementing lessons is reduced, the waste of network resources and storage resources is reduced, the processing speed is increased, and the labor cost is reduced.

By adopting the scheme, intelligent processing and manual processing can be combined, and the situation that the video clips including classroom contents are deleted by mistake when the intelligent processing is carried out can be avoided.

FIG. 3 is a block diagram illustrating a video clipping device 30, which device 30 may be an edge computing power box, or a portion thereof, according to an exemplary embodiment, the device 30 including:

the acquisition module 31 is configured to acquire a teaching video, where the teaching video is obtained by shooting a classroom scene in a classroom;

a determining module 32, configured to determine, according to image information and/or audio information in the teaching video, a start time and an end time of a video clip that is irrelevant to teaching content in the teaching video;

and a deleting module 33, configured to delete the video segment from the teaching video according to the start time and the end time.

Optionally, the determining module 32 further includes:

the first determining submodule is used for determining whether the teacher in the image information is in a roll calling posture;

the first comparison submodule is used for transcribing the audio information corresponding to the roll calling posture of the teacher into characters through a voice recognition transcription technology under the condition that the teacher is determined to be in the roll calling posture, and comparing the characters with a student list to obtain a first comparison result;

the second determining submodule is used for determining the moment when the teacher in the image information is in the roll-calling posture as the starting time of a roll-calling stage under the condition that the first comparison result represents that the characters are matched with the names of the students in the student list;

and the third determining submodule is used for taking the time of matching the transcribed characters with the student names in the student list for the last time as the end time of the roll call stage according to the first comparison result.

Optionally, the determining module 32 further includes:

the lip language comparison submodule is used for inputting the image corresponding to the roll call posture of the teacher into a pre-trained lip language recognition model under the condition that the teacher is determined to be in the roll call posture to obtain characters corresponding to the teacher lip language, and comparing the characters corresponding to the teacher lip language with a student list to obtain a lip language comparison result;

a fourth determining submodule, configured to determine, when the lip language comparison result indicates that the text corresponding to the teacher lip language matches the student name in the student list, a time when the teacher in the image information is in the roll-call posture as a start time of a roll-call stage;

and the fifth determining submodule is used for taking the time when the last time the characters corresponding to the teacher lip language are matched with the student names in the student list as the end time of the roll call stage according to the lip language comparison result.

Optionally, the determining module 33 further includes:

the characteristic acquisition module is used for inputting the image corresponding to the roll calling posture of the teacher into a pre-trained lip language recognition model to obtain lip language characteristics and transcribing the audio information corresponding to the roll calling posture of the teacher by a voice recognition transcription technology to obtain voice characteristics under the condition that the teacher is determined to be in the roll calling posture;

the feature sorting module is used for respectively obtaining the lip language features and the time feature sequences of the voice features according to a time sequence;

the transcription submodule is used for fusing the time characteristic sequence of the lip language characteristic and the time characteristic sequence of the voice characteristic and then transcribing to obtain characters corresponding to the speaking content of the teacher;

and the sixth determining submodule is used for comparing the characters corresponding to the teacher speaking content with a student list and determining the starting time and the ending time of the roll calling stage according to the comparison result.

Optionally, the first determining submodule is specifically configured to:

Optionally, the first determining sub-module is further configured to:

Optionally, the determining module 32 further includes:

a seventh determining submodule, configured to determine a suspected start time of a preparation stage before a class according to image information in the teaching video;

the judgment submodule is used for judging whether no person speaks within a first preset time length after the suspected starting moment or not according to the audio information after the suspected starting moment;

an eighth determining submodule, configured to, when it is determined that no person speaks within a first preset time period after the suspected start time, use the suspected start time as a start time of a preparation stage before class;

a ninth determining sub-module for taking a time, after the start time, at which the teacher starts speaking detected based on the audio information as an end time of the pre-class preparation phase.

Optionally, the seventh determining sub-module is configured to:

Optionally, the determining module 32 is further configured to:

matching the characters with a preset dictionary;

Optionally, the image information further includes a projection image, and the determining module 32 is further configured to:

Optionally, the determining module 32 is further configured to:

matching the characters with a preset sensitive dictionary;

Optionally, the deleting module 33 further includes:

the prompting sub-module is used for displaying the video clip generation thumbnail to a user and prompting the user whether to determine to delete the video clip;

and the deleting submodule is used for deleting the video segment from the teaching video according to the starting time and the ending time under the condition of receiving a deletion determining instruction sent by the user.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 4 is a block diagram illustrating an electronic device 40 according to an example embodiment. As shown in fig. 4, the electronic device 40 may include: a processor 41 and a memory 42. The electronic device 40 may also include one or more of a multimedia component 43, an input/output (I/O) interface 44, and a communications component 45.

Wherein, the processor 41 is used for controlling the overall operation of the electronic device 40 to complete all or part of the steps of the video clipping method. The memory 42 is used to store various types of data to support operation at the electronic device 40, such as instructions for any application or method operating on the electronic device 40, as well as application-related data, such as contact data, messaging, course information, pictures, audio, video, and so forth. The Memory 42 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia components 43 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may further be stored in the memory 42 or transmitted through the communication component 45. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 44 provides an interface between the processor 41 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 45 is used for wired or wireless communication between the electronic device 40 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 45 may thus comprise: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 40 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the video editing method described above.

In another exemplary embodiment, there is also provided a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the video clipping method described above. For example, the computer readable storage medium may be the memory 42 described above including program instructions that are executable by the processor 41 of the electronic device 40 to perform the video clipping method described above.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of video clipping, the method comprising:

2. The method of claim 1, wherein determining a start time and an end time of a video clip of the teaching video that is not related to teaching content according to image information and audio information in the teaching video comprises:

3. The method of claim 1, wherein determining a start time and an end time of a video clip of the teaching video that is not related to teaching content according to image information in the teaching video comprises:

4. The method of claim 1, wherein determining a start time and an end time of a video clip of the teaching video that is not related to teaching content according to image information and audio information in the teaching video comprises:

5. The method of claim 2, wherein determining whether the teacher in the image information is in a roll call gesture comprises:

6. The method of claim 5, wherein determining whether the teacher's gaze in the image information switches between a cell phone or paper list and other locations multiple times according to gaze detection techniques comprises:

7. The method of claim 1, wherein determining a start time and an end time of a video clip of the teaching video that is not related to teaching content according to image information and audio information in the teaching video comprises:

8. The method of claim 7, wherein the image information in the teaching video comprises a teaching projection image, and wherein determining the suspected start time of the pre-session preparation phase from the image information in the teaching video comprises:

9. The method of claim 1, wherein determining a start time and an end time of a video clip of the teaching video, which is irrelevant to teaching contents, according to image information and/or audio information in the teaching video comprises:

10. The method of claim 1, wherein determining a start time and an end time of a video clip of the teaching video that is not related to teaching content according to image information and audio information in the teaching video comprises:

matching the characters with a preset dictionary;

11. The method of claim 1, wherein the image information further comprises a projected image, and determining a start time and an end time of a video clip of the teaching video, which is not related to the teaching content, according to the image information and the audio information in the teaching video comprises:

12. The method of claim 1, wherein determining a start time and an end time of a video clip of the teaching video that is not related to teaching content according to image information and audio information in the teaching video comprises:

13. The method of claim 1, wherein determining a start time and an end time of a video clip of the teaching video that is not related to teaching content according to audio information in the teaching video comprises:

matching the characters with a preset sensitive dictionary;

14. The method of any one of claims 1-13, wherein said deleting the video clip from the instructional video based on the start time and the end time comprises:

15. A video clipping apparatus, characterized in that the apparatus comprises:

16. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 14.

17. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 14.