CN117319578A

CN117319578A - Surgical video editing method and system based on deep learning

Info

Publication number: CN117319578A
Application number: CN202311289118.6A
Authority: CN
Inventors: 陶然; 王东明; 张汉江; 武宇
Original assignee: Hangzhou Shendi Information Technology Co ltd; Hubei Luojia Laboratory
Current assignee: Hangzhou Shendi Information Technology Co ltd; Hubei Luojia Laboratory
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2023-12-29

Abstract

The invention provides a surgical video editing method machine system based on deep learning, which belongs to the technical field of video image processing, and comprises the following steps: acquiring an original operation video and a manually clipped video to construct an initial training set, and performing two-classification pretreatment on the initial training set to obtain a pretreated training set; a preset deep learning network is adopted as an initial classification network model, a preprocessing training set, dense light flow graphs of adjacent image frames in the preprocessing training set and a video sequence number classification label are input into the initial classification network model for training, and an operation video classification clipping model is obtained; inputting the surgical video to be clipped into a surgical video classification clipping model, and outputting an initial video clipping result; and carrying out post-processing on the initial video editing result based on the original surgical video frame rate to obtain a surgical video editing result. The invention adopts the end-to-end deep learning general model, introduces time domain information by using the dense optical flow diagram in video editing training, and enhances the judging stability and accuracy.

Description

Surgical video editing method and system based on deep learning

Technical Field

The invention relates to the technical field of video image processing, in particular to a surgical video editing method and system based on deep learning.

Background

In the operation process of a hospital, a large amount of operation videos can be continuously generated, and aiming at different organs and parts, the precious video resources can be used as data for the study of post-operation technology on one hand, and on the other hand, the successful operation can be used as a learning template and material for the interns of clinical operation.

With the popularization of the endoscope operation, the operation video shot by the endoscope generally contains more redundant fragments, such as a lens without a scalpel, a lens with a stationary scalpel and the like, which seriously undermines the teaching process and affects the teaching quality and the discussion after the operation, so that the fragments are manually removed in the later period. Currently, hospitals still adopt methods for manually cropping these medical videos. The mode is time-consuming and labor-consuming, has extremely low efficiency, and causes great trouble to medical staff and scientific researchers.

In order to improve the above, more efficient technical means are required to process these videos. At present, some means for automatically cutting out video through a traditional digital image processing method exist, however, the models have no higher accuracy, the setting of super parameters consumes manpower and is often poor in effect, and finally, the generated video cutting information is incoherent. In addition, the conventional method has poor generalization capability and can only be used to process one specific medical video. In addition, there is a video editing method based on machine learning, but such a method first requires a manual method to extract key frames. The process of extracting the key frames not only loses the information of the original video, but also greatly reduces the efficiency of video editing. Although this method considers timing information of video, it is a phenomenon that it is intermittent.

Disclosure of Invention

The invention provides a video editing method and system based on deep learning, which are used for solving the defects that in the prior art, editing operation videos consumes manpower and material resources, video after editing is inconsistent and editing accuracy is low.

In a first aspect, the present invention provides a deep learning-based surgical video editing method, including:

acquiring an original operation video and a manually clipped video to construct an initial training set, and performing two-classification pretreatment on the initial training set to obtain a pretreatment training set;

a preset deep learning network is adopted as an initial classification network model, and the preprocessing training set, dense light flow graphs of adjacent image frames in the preprocessing training set and video sequence number classification labels are input into the initial classification network model for training, so that an operation video classification clipping model is obtained;

inputting the video to be edited into the operation video classification editing model, and outputting an initial video editing result;

and carrying out post-processing on the initial video editing result based on the original surgical video frame rate to obtain a surgical video editing result.

According to the deep learning-based surgical video editing method provided by the invention, an initial training set is constructed by acquiring an original surgical video and a manually edited video, and the initial training set is subjected to two-class preprocessing to obtain a preprocessed training set, wherein the method comprises the following steps:

taking the video frame of the video after the manual editing as a positive sample;

screening video frames which are contained in the original operation video and are not contained in the video after manual editing as negative samples;

the pretreatment training set is constructed with the positive samples and the negative samples.

According to the surgical video editing method based on deep learning provided by the invention, the video frames which are contained in the original surgical video and are not contained in the manual editing video are screened as negative samples, and the method comprises the following steps:

inputting each video frame in the original operation video and the manually clipped video into a pre-trained classification network to obtain an output vector of each frame;

and comparing the distance between the output vectors of each frame through a preset threshold value to obtain video frames which are contained in the original operation video and are not contained in the video after manual editing, wherein the video frames are used as the negative samples.

According to the deep learning-based surgical video editing method provided by the invention, a preset deep learning network is adopted as an initial classification network model, a dense light flow graph of adjacent image frames in the pretreatment training set and a video sequence number classification label in the pretreatment training set are input into the initial classification network model for training, and a surgical video classification editing model is obtained, and the method comprises the following steps:

acquiring a dense light flow graph of adjacent image frames in the preprocessing training set;

adopting a residual network with a preset layer number as the initial classification network model, and dividing the preprocessing training set into a training set and a testing set according to a preset segmentation proportion;

taking the dense light flow graph of the adjacent image frames as a binary pair, and acquiring a video sequence number classification label of the video frames in the preprocessing training set, wherein the video sequence number classification label corresponds to the binary pair;

and inputting the training set, the testing set, the binary pair and the video sequence number classification label into the initial classification network model for training, and outputting the operation video classification clipping model.

According to the deep learning-based surgical video clipping method provided by the invention, the dense optical flow graph comprises optical flow information in the horizontal axis direction and optical flow information in the vertical axis direction.

According to the surgical video editing method based on deep learning provided by the invention, the surgical video to be edited is input into the surgical video classification editing model, and an initial editing video result is output, and the method comprises the following steps:

dividing the video to be clipped into video frames containing time sequence, and obtaining dense light flow diagrams corresponding to the video frames containing time sequence;

inputting binary pairs corresponding to the video frames containing time sequences and the dense optical flow diagrams into the operation video classification clipping model to obtain reserved video frames and clipping video frames;

and sorting the reserved video frames according to the time sequence numbers to obtain the initial video editing result.

According to the surgical video editing method based on deep learning provided by the invention, the initial editing video result is subjected to post-processing based on the original surgical video frame rate to obtain the surgical video editing result, and the surgical video editing method comprises the following steps:

collecting the original operation video frame rate;

dividing the initial video clip result into a plurality of video groups having the original surgical video frame rate;

determining a preset video frame duty ratio, and taking a video group exceeding the preset video frame duty ratio as a reserved video frame result;

and merging all the reserved video frame results, and outputting the surgical video editing result.

In a second aspect, the present invention also provides a deep learning-based surgical video editing system, including:

the preprocessing module is used for acquiring an original operation video and a manually clipped video to construct an initial training set, and performing two-class preprocessing on the initial training set to obtain a preprocessed training set;

the training module is used for adopting a preset deep learning network as an initial classification network model, inputting the preprocessing training set, dense light flow graphs of adjacent image frames in the preprocessing training set and a video sequence number classification label into the initial classification network model for training, and obtaining an operation video classification clipping model;

the editing module is used for inputting the video to be edited into the operation video classification editing model and outputting an initial editing video result;

and the post-processing module is used for carrying out post-processing on the initial video editing result based on the original surgical video frame rate to obtain a surgical video editing result.

In a third aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the deep learning based surgical video editing method as described in any one of the above when executing the program.

In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a deep learning based surgical video editing method as described in any of the above.

According to the deep learning-based surgical video editing method and system, the end-to-end deep learning general model is adopted, and the time domain information is introduced by using the dense optical flow diagram in video editing training, so that the judging stability and accuracy are enhanced, and the video editing is more accurate.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a deep learning based surgical video editing method provided by the invention;

FIG. 2 is a second flow chart of the deep learning based surgical video editing method according to the present invention;

FIG. 3 is a schematic diagram of a deep learning based surgical video editing system provided by the present invention;

fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Aiming at a plurality of defects of editing operation videos in the prior art, such as consumption of manpower and material resources, incontinuous video after editing, low editing accuracy and the like, the embodiment of the invention takes kidney operation videos as an example, and provides an operation video editing method based on deep learning.

Fig. 1 is a schematic flow chart of a deep learning-based surgical video editing method according to an embodiment of the present invention, as shown in fig. 1, including:

step 100: acquiring an original operation video and a manually clipped video to construct an initial training set, and performing two-classification pretreatment on the initial training set to obtain a pretreatment training set;

step 200: a preset deep learning network is adopted as an initial classification network model, and the preprocessing training set, dense light flow graphs of adjacent image frames in the preprocessing training set and video sequence number classification labels are input into the initial classification network model for training, so that an operation video classification clipping model is obtained;

step 300: inputting the video to be edited into the operation video classification editing model, and outputting an initial video editing result;

step 400: and carrying out post-processing on the initial video editing result based on the original surgical video frame rate to obtain a surgical video editing result.

Specifically, as shown in fig. 2, in the embodiment of the present invention, a kidney operation video and a video after manual editing are input first, the video is divided into different frames, each frame is classified, a positive sample and a negative sample are generated respectively, and a dense light flow graph corresponding to each video frame is generated. And forming a binary pair by each video frame and the corresponding dense optical flow map, training the classified neural network by each pair of marked binary pairs, inputting videos to be clipped, outputting serial numbers of video frames marked as reserved and cut, and judging the number of video frames marked as reserved in each group according to the original operation video frame rate, for example, every 30 frames as a group. If the number of frames marked as "reserved" in a group is higher than 15, then 30 frames will be reserved, and finally all reserved video frames are combined into new video, and the final video is output.

According to the invention, the end-to-end deep learning general model is adopted, and the time domain information is introduced by using the dense optical flow diagram in the video editing training, so that the judging stability and accuracy are enhanced, and the video editing is more accurate.

Based on the above embodiment, an initial training set is constructed by acquiring an original surgical video and a manually clipped video, and a two-classification preprocessing is performed on the initial training set to obtain a preprocessed training set, including:

Wherein, screening the video frames contained in the original surgical video and not contained in the manually clipped video as negative samples comprises:

Specifically, the training set is first preprocessed. The training set of the embodiment of the invention is kidney operation demonstration video shot by an endoscope. The first video is the original kidney surgery video, and the other is the manually clipped surgery video. But due to the setting of the editing software, the resolution of the original surgical teaching video640 of480, and the resolution of the manual editing operation teaching video is 640->1280. Clipping of surgical video is essentially a problem of two categories of video, namely tagging positive or negative samples for each frame of input video. To implement the classification problem, the preprocessing process needs to distinguish between positive samples and negative samples, i.e., video frames that need to be preserved and video frames that need to be cropped. In this process, since the manually clipped video has already extracted the video frames of the positive samples, only the video frames that are contained in the original kidney surgery video but not in the manually clipped video need be found as negative samples. However, since the resolution of the video finally clipped by different video clipping software is different, the resolution of the original kidney operation video and the resolution of the video manually clipped are different, and the video needs to be processed according to other means. In this embodiment, each frame in the original kidney operation video and the manually clipped video is passed through a pre-trained network, which is usually a mainstream deep learning neural network, and then the output vector of each frame is compared. The principle that the vectors output by the pictures based on the similar characteristics are similar is that the video frames which are contained in the original kidney operation video but not in the video after manual editing can be obtained by setting the adjusted threshold value and comparing the distances between the output vectors corresponding to different frames, and the video frames are negative sample sets. The embodiment of the invention obtains 1410391 positive samples and 387917 negative samples based on the training set.

The pretreatment realizes the calibration of the positive sample and the negative sample, and the whole process does not need manual labeling or classification.

The invention realizes that the classification network architecture of deep learning ensures the classification reliability under the condition of sufficient data sets.

Based on the above embodiment, a preset deep learning network is adopted as an initial classification network model, and the preprocessing training set, dense light flow graphs of adjacent image frames in the preprocessing training set and video sequence number classification labels are input into the initial classification network model for training, so as to obtain an operation video classification clipping model, which comprises the following steps:

Wherein the dense light flow graph includes lateral axis direction light flow information and longitudinal axis direction light flow information.

Specifically, the embodiment of the invention adopts ResNet150 as a network architecture of the classification network, and randomly takes 70% of positive and negative samples as a training set of the classification network, and the remaining 30% of positive and negative samples are taken as a test set for testing the classification accuracy of the network. Because the static part of the scalpel in the original kidney operation video needs to be deleted, the positive and negative sample pairs cannot be fed into the classification network as in the training of the traditional classification network, and the time information of the video needs to be introduced. In this embodiment, in addition to each frame in the positive and negative samples, a dense optical flow diagram between each frame and the next frame needs to be fed into the neural network. The dense optical flow sheet is a two-layer image containing optical flow information in the vertical axis direction and the horizontal axis direction, respectively. By taking the dense optical flow diagram between each frame and the current frame and the next frame as a binary pair, the spatial domain and time domain information are reserved, so that the identification accuracy of the neural network is improved. Because each binary pair fed in is labeled with a positive or negative sample in the preprocessing step, the training process is also a supervised training process, and the recognition accuracy is improved. The test results of this example on the test set showed a classification accuracy of up to 98.1%.

The method utilizes the dense light flow graph to introduce time domain information in the process of training the classified neural network, so that the robustness of judgment is enhanced, and a foundation is laid for outputting video consistency. Meanwhile, the proposed training and testing model is end-to-end, the universal framework avoids the artificial design of a feature extraction and selection method, reduces the dependence on professional knowledge and improves the generalization capability and the self-adaptation capability of the model.

Based on the above embodiment, inputting the surgical video to be clipped to the surgical video classification clipping model, outputting an initial clipping video result, including:

Specifically, the video to be clipped is input into a trained classification network. The video to be clipped is a kidney surgery video containing both video frames to be retained and video frames to be clipped. After dividing the video into a set of time-ordered video frames, a dense light flow map corresponding to each frame is regenerated. By feeding binary pairs containing video frames and corresponding dense optical flows into the neural network, the classified neural networks will be judged as either "1" ("reserved") or "0" ("cropped") one by one. Then, the video sequence number judged as "1" ("reserved") is saved.

Based on the above embodiment, post-processing is performed on the initial video editing result based on the original surgical video frame rate to obtain a surgical video editing result, including:

collecting the original operation video frame rate;

Specifically, the embodiment of the invention also performs post-processing on the classification result, and in general, the frames judged to be "reserved" are spliced together to form the video after clipping, but the trained network may have a phenomenon of misjudgment on some video frames. Therefore, the embodiment of the invention adds a post-processing step to compensate the problem of possible misjudgment of the neural network.

In this embodiment, the test set video has a frame rate of 30, and the entire video is first divided into different groups of 30 video frames. As long as more than 50% of the video frames in the group are judged to be "reserved," then all frames in the group will be reserved. The post-processing ensures that all required video frames remain in the video after the clip, and ensures the information consistency of the video after the clip, and no obvious phenomenon of few frames or frame skipping occurs.

Further, the video after the clip is output. After all the video frames to be reserved are obtained, the video frames are combined into a new video. The output video is a coherent video that retains useful information, removes redundant information.

It can be understood that the invention can be used for editing operation videos in other fields by changing a training set and a testing set into organs such as heart besides processing kidney operation videos. For hospitals that produce thousands of gigabytes per year, significant cost and labor savings can be realized using this video editing technique.

The deep learning-based surgical video editing system provided by the invention is described below, and the deep learning-based surgical video editing system described below and the deep learning-based surgical video editing method described above can be referred to correspondingly with each other.

Fig. 3 is a schematic structural diagram of a deep learning-based surgical video editing system according to an embodiment of the present invention, as shown in fig. 3, including: a preprocessing module 31, a training module 32, a clipping module 33 and a post-processing module 34, wherein:

the preprocessing module 31 is used for acquiring an original operation video and a manually clipped video to construct an initial training set, and performing two-class preprocessing on the initial training set to obtain a preprocessed training set; the training module 32 is configured to use a preset deep learning network as an initial classification network model, input the preprocessing training set, dense light flow graphs of adjacent image frames in the preprocessing training set, and a video sequence number classification label into the initial classification network model for training, and obtain a surgical video classification clipping model; the editing module 33 is used for inputting the video to be edited into the operation video classification editing model and outputting an initial editing video result; post-processing module 34 is configured to post-process the initial video editing result based on the original surgical video frame rate to obtain a surgical video editing result.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform a deep learning based surgical video editing method comprising: acquiring an original operation video and a manually clipped video to construct an initial training set, and performing two-classification pretreatment on the initial training set to obtain a pretreatment training set; a preset deep learning network is adopted as an initial classification network model, and the preprocessing training set, dense light flow graphs of adjacent image frames in the preprocessing training set and video sequence number classification labels are input into the initial classification network model for training, so that an operation video classification clipping model is obtained; inputting the video to be edited into the operation video classification editing model, and outputting an initial video editing result; and carrying out post-processing on the initial video editing result based on the original surgical video frame rate to obtain a surgical video editing result.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the deep learning based surgical video editing method provided by the above methods, the method comprising: acquiring an original operation video and a manually clipped video to construct an initial training set, and performing two-classification pretreatment on the initial training set to obtain a pretreatment training set; a preset deep learning network is adopted as an initial classification network model, and the preprocessing training set, dense light flow graphs of adjacent image frames in the preprocessing training set and video sequence number classification labels are input into the initial classification network model for training, so that an operation video classification clipping model is obtained; inputting the video to be edited into the operation video classification editing model, and outputting an initial video editing result; and carrying out post-processing on the initial video editing result based on the original surgical video frame rate to obtain a surgical video editing result.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A deep learning based surgical video editing method, comprising:

2. The deep learning-based surgical video editing method according to claim 1, wherein obtaining an original surgical video and manually editing the video to construct an initial training set, performing two-class preprocessing on the initial training set to obtain a preprocessed training set, and comprising:

3. The deep learning based surgical video editing method according to claim 2, wherein filtering video frames contained in the original surgical video and not contained in the manually edited video as negative samples, comprises:

4. The deep learning-based surgical video editing method according to claim 1, wherein a preset deep learning network is adopted as an initial classification network model, the preprocessing training set, dense light flow diagrams of adjacent image frames in the preprocessing training set and video sequence number classification labels are input into the initial classification network model for training, and a surgical video classification editing model is obtained, and the method comprises the steps of:

5. The deep learning based surgical video clipping method of claim 4, wherein the dense optical flow map comprises lateral axis directional optical flow information and longitudinal axis directional optical flow information.

6. The deep learning based surgical video editing method of claim 1, wherein inputting a surgical video to be edited into the surgical video classification editing model, outputting an initial editing video result, comprises:

7. The deep learning based surgical video editing method according to claim 1, wherein the post-processing of the initial editing video result based on the original surgical video frame rate to obtain a surgical video editing result comprises:

collecting the original operation video frame rate;

8. A deep learning based surgical video editing system, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the deep learning based surgical video editing method of any of claims 1 to 7 when the program is executed.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the deep learning based surgical video editing method of any of claims 1 to 7.