CN113742527A

CN113742527A - Method and system for retrieving and extracting operation video clips based on artificial intelligence

Info

Publication number: CN113742527A
Application number: CN202111310650.2A
Authority: CN
Inventors: 刘杰; 王玉贤; 刘润文; 吴少南; 沈小江; 王昕�
Original assignee: Chengdu Yurui Innovation Technology Co ltd
Current assignee: Chengdu Yurui Innovation Technology Co ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2021-12-03

Abstract

The invention relates to a method and a system for retrieving and extracting operation video clips based on artificial intelligence, which comprises the steps of dividing a video, and inputting a plurality of pictures of the video clips extracted at equal intervals into an operation stage identification model and an operation event identification model for identification; outputting the identification results of the operation stage and the operation event to the start-stop time of the complete video according to the start-stop time of the video clip, and storing the corresponding time of the identification results in a video retrieval and video extraction system; and loading the recognition result and the corresponding time data in the video retrieval and video extraction system and displaying the recognition result and the corresponding time data in a progress bar of the video playing system. According to the method and the device, the operation stages and the operation events in the operation video are identified, and the corresponding time points are displayed on the progress bar of the played time axis, so that medical personnel can quickly position the operation stages or the operation events needing to be paid attention to when watching the operation video, the time is greatly saved, and the learning efficiency is improved.

Description

Method and system for retrieving and extracting operation video clips based on artificial intelligence

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a system for retrieving and extracting operation video clips based on artificial intelligence.

Background

According to the statistical annual report of China, 6171.58 and 6930.44 thousands of people can perform various operations. With the increasing medical quality demands of people, the demands on the surgical technicians are also expanding. On one hand, medical students need to learn surgical skills and treatment methods of sudden events in surgery as soon as possible; secondly, in order to improve the operation quality, the surgeon can repeatedly check the operation videos of himself or others, so that the surgical skill of the surgeon can be improved.

However, long-time operation video watching may take a lot of time for the doctor, and long-time operation video watching may reduce the attention of the doctor and the student, and thus may miss a part of the important steps or interested operation time. Especially, the operation video with ultra-long duration (such as the laparoscopic pancreaticoduodenectomy operation, which is generally as long as 6-8 hours), so that the doctor is greatly burdened when reviewing the video, and the learning efficiency is extremely low. The video cannot be quickly searched when a specific operation process, a certain operation event or a certain type of operation event occurs in the operation cannot be quickly positioned, so that the interested time can be quickly found for key browsing. In addition, according to the management needs or business application needs, a hospital or an administrative management organization needs to extract key information in the operation process, that is, extract key video clips or pictures in the operation process by an automatic method, so as to be used for learning research, knowledge base construction, safety behavior evaluation samples and selection of main knife technical capability evaluation samples in the operation process, however, the existing technology cannot meet the requirements.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a method and a system for retrieving and extracting an operation video clip based on artificial intelligence, and solves the problems existing when medical care personnel watch an operation video at present.

The purpose of the invention is realized by the following technical scheme: a method for retrieving and extracting surgical video clips based on artificial intelligence, the method comprising:

dividing a video into a plurality of video segments, extracting a plurality of pictures of the video segments at equal intervals, inputting the pictures into an operation stage identification model and an operation event identification model, and respectively identifying an operation stage and an operation event in the pictures after image feature extraction;

outputting the identification results of the operation stage and the operation event to the start-stop time of the complete video according to the start-stop time of the video clip, and storing the corresponding time of the identification results in a video retrieval system and a video extraction system;

and loading the recognition results and the corresponding time data in the video retrieval system and the video extraction system, displaying the recognition results and the corresponding time data in a progress bar of the video playing system, and marking the operation stage and the time period of the operation event.

The method also comprises the step of extracting corresponding videos or pictures to manufacture a surgery knowledge base according to the surgical events and the occurrence time of the surgical stages in the videos, so that doctors or experts can quickly browse to carry out safety evaluation and main knife skill evaluation.

The method comprises the following steps of dividing a video into a plurality of video segments, extracting a plurality of pictures of the video segments at equal intervals, inputting the pictures into an operation stage identification model and an operation event identification model, and respectively identifying the operation stage and the operation event in the pictures after image feature extraction, wherein the steps comprise:

dividing a video into a plurality of video clips, extracting N pictures of the video clips at equal intervals, and storing each picture as a four-dimensional tensor of a (N, C, H, W) format, wherein N represents the number of extracted frames of each video clip, C represents the number of channels of each picture, H is the width of the picture, and W represents the length of the picture;

putting the pictures represented by the four-dimensional tensor into a ResNet network consisting of a plurality of 2D convolutions, a ReLU activation layer, a batch normalization layer and a full connection layer to extract image characteristics, and storing the format of the image characteristics as (M, S), wherein M represents the number of input pictures, and S represents the length of a preset characteristic vector;

and inputting the image feature vectors with the format of (M, S) into an LSTM network by M times, and identifying the operation stage and the operation event of continuous video frames.

The video retrieval system and the video extraction system are constructed by the following method:

performing deep learning-based reasoning on the input picture by adopting an operation stage identification model and an operation event identification model to obtain operation stage information and operation event information;

outputting an operation stage and an operation event occurrence time period for a user through customized video playing software to be displayed on a progress bar, and realizing the construction of a video retrieval system;

and extracting corresponding video clips in the video or extracting the video into static pictures according to a certain frame rate by using the start and stop time points of the key operation process identified by the operation stage identification model and the operation event identification model according to customized extraction software, so as to realize the construction of a video extraction system.

The method also comprises the step of constructing an operation stage identification model and an operation event identification model before dividing the video; the construction steps of the surgery stage identification model and the surgery event identification model comprise:

establishing an operation stage theoretical model and an operation event theoretical model according to expert experience, guidelines and theories, and carrying out boundary division on the operation stage and the operation event according to the operation stage theoretical model and the operation event theoretical model on the collected operation video;

and collecting a large amount of video data into pictures according to the requirement of resolution, and labeling the collected pictures with operation stage time periods and operation event time periods.

And randomly distributing the labeled surgical phase data and surgical event data into a training set, a verification set and a test set according to corresponding proportions, and training, verifying and testing images of the training set, the verification set and the test set through a ResNet network and an LSTM network to complete construction of a surgical phase identification model and a surgical event identification model.

The collecting a large amount of video data into pictures according to the requirement of resolution, and labeling the collected pictures in the operation stage time period and the operation event time period comprises the following steps:

collecting operation video data according to the requirement that the resolution of each operation video is not lower than a preset value and not lower than a preset frame per second, and storing the operation video data in a picture form;

uniformly transcoding the collected pictures into the same format through ffmpeg software, and completing primary labeling of the operation stage time period and the operation event time period through labeling software Anvil;

and manually labeling the video data subjected to preliminary labeling by a professional, and modifying the picture with unqualified preliminary labeling to obtain a picture with qualified labeling.

The training, verifying and testing of the training set, the verifying set and the testing set pictures through the ResNet network and the LSTM network to complete the construction of the operation stage identification model and the operation event identification model comprises the following steps:

storing the pictures of the training set, the verification set and the test set into a four-dimensional tensor of an (N, C, H, W) format, wherein N represents the number of frames of each video clip, C represents the number of channels of each picture, H is the width of the picture, and W represents the length of the picture;

putting the picture represented by the four-dimensional tensor into a ResNet network consisting of a plurality of 2D convolutions, a ReLU activation layer, a batch normalization layer and a full connection layer to extract image characteristics, and storing the format of the image characteristics as (M, S);

inputting the image characteristic vectors with the format of (M, S) into an LSTM network by M times, identifying the operation stage and operation event of continuous video frames and recording the start time and the end time;

putting the recognition result into a cross entropy loss function

The loss is calculated, and the model parameters are updated in a gradient descending mode, so that an operation stage identification model and an operation event identification model are constructed.

A system for retrieving and extracting operation video clips based on artificial intelligence comprises an identification module, a video retrieving and extracting module and a video playing module;

the identification module is used for dividing the video into a plurality of video segments, extracting a plurality of pictures of the video segments at equal intervals, inputting the pictures into the operation stage identification model and the operation event identification model, and respectively identifying the operation stage and the operation event in the pictures after image feature extraction;

the video retrieval and extraction module is used for outputting the identification results of the operation stage and the operation event to correspond to the start-stop time of the complete video according to the start-stop time of the video clip, and then storing the corresponding time of the identification results in the video retrieval and extraction unit;

the video playing module is used for loading the identification results and the corresponding time data in the video retrieval system and the video extraction system, displaying the identification results and the corresponding time data in a progress bar of the video playing system, and marking an operation stage and a time period when an operation event occurs.

The system further comprises a construction module, wherein the construction module is used for constructing the video retrieval and extraction unit, the operation stage identification model and the operation event identification model.

The system also comprises a video collection and labeling module, wherein the video collection and labeling module is used for collecting a large amount of video data into pictures according to the requirement of resolution, and labeling the collected pictures in the operation stage time period and the operation event time period.

The invention has the following advantages: a method and a system for retrieving and extracting operation video clips based on artificial intelligence are disclosed, wherein operation events combined with operation stages in an operation video are identified, and corresponding time points are displayed on a progress bar of a played time axis, so that medical staff can quickly locate the operation stage or the operation event which needs to be concerned when watching the operation video, the time for watching the operation video is greatly saved, and the learning efficiency is improved; meanwhile, the operation knowledge base and the like are constructed to serve students of medical colleges and surgeons, and the device has higher social benefit and economic benefit.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

fig. 2 is a schematic diagram of a display effect of a progress bar in a video playing system.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present invention relates to a method for retrieving and extracting a surgical video clip based on artificial intelligence, which specifically includes the following steps:

s1, constructing an operation stage identification model and an operation event identification model; the method specifically comprises the following steps:

s11, establishing a theoretical model of the operation stage and a theoretical model of the operation event according to expert experience, guidelines and theories, and carrying out boundary division on the operation stage and the operation event according to the theoretical model of the operation stage and the theoretical model of the operation event on the collected operation video;

and S12, collecting a large amount of video data into pictures according to the requirement of resolution, and labeling the collected pictures with operation stage time periods and operation event time periods. The method specifically comprises the following steps:

collecting operation video data according to the requirement that the resolution of each operation video is not lower than 720 x 560 and not lower than 21 frames per second, and storing the operation video data in the form of pictures;

uniformly transcoding the collected pictures into the same mpeg-4 format through ffmpeg software, and finishing primary labeling on the operation stage time period and the operation event time period through labeling software Anvil Video indication Research Tool;

and 6 qualified surgical specialist physicians trained in the early stage are responsible for quality control, manual annotation is carried out on the video data subjected to the preliminary annotation, and the pictures subjected to the preliminary annotation are modified to obtain pictures subjected to the qualified annotation.

And S13, randomly distributing the labeled surgical phase data and surgical event data into a training set, a verification set and a test set according to the ratio of 8:1:1, and training, verifying and testing images of the training set, the verification set and the test set through a ResNet network and an LSTM network to complete the construction of a surgical phase recognition model and a surgical event recognition model.

Wherein, all models are developed by using an Anaconda and Qt Creator platform, and the image processing uses an NVIDIA Tesla V100 graphics processor.

S2, dividing the video into a plurality of video segments, extracting a plurality of pictures of the video segments at equal intervals, inputting the pictures into the operation stage identification model and the operation event identification model, and respectively identifying the operation stage and the operation event in the pictures after image feature extraction;

further, it specifically includes:

s21, dividing a video into a plurality of video clips, extracting N pictures of the video clips at equal intervals, storing each picture as a four-dimensional tensor in an (N, C, H, W) format, wherein N represents the number of extracted frames of each video clip, C represents the number of channels of each picture, H is the width of the picture, and W represents the length of the picture;

s22, putting the pictures represented by the four-dimensional tensor into a ResNet network consisting of a plurality of 2D convolutions, a ReLU active layer, a batch normalization layer and a full connection layer to extract image features, and storing the format of the image features as (M, S), wherein M represents the number of input pictures, and S represents the length of a preset feature vector;

further, the convolution layer or the full link layer has a calculation formula as follows:

where y represents the computational output, n represents the number of neurons,

represents the weight of the ith neuron,

representing input data of the ith neuron, b adding an offset to the result of the computation, when convolution is performed

And

is a two-dimensional matrix when calculated for full connectivity

And

is a one-dimensional vector.

The batch normalization layer calculation formula is as follows:

where BN represents the batch normalization computation output, x represents the input data, E [ x ]]Mean value, Var [ x ], representing the x tensor]The variance of the x tensor is represented,

a very small parameter is indicated to ensure that the denominator is not 0 and that γ and β are learnable coefficients.

The formula for the ReLU activation function is:

where ReLU represents the computational output, z represents the input tensor, and max () represents the maximum value taken therein.

And S23, inputting the image feature vectors with the format of (M, S) into the LSTM network in M times, and identifying the operation stage and the operation event of continuous video frames. The LSTM network is composed of a plurality of units, provides information that the user forgets that the user is too long away from the current time, and updates the current state through inputting a new video frame to complete recognition.

Further, the calculation formula of one unit in the LSTM network is as follows:

wherein h is_tRepresenting a hidden state at time t; c. C_tDenotes LSTM atCell calculation results at time t; x is the number of_tRepresenting input data at time t; h is_t-1Representing a hidden state at time t-1; i.e. i_tRepresenting the calculation result after the input operation; f. of_tRepresenting the calculation result after the forgotten operation; g_tRepresenting the result of the output operation; o_tRepresenting an output of a temporal information fusion operation on the nucleus; w_ii、W_hi、b_ii、b_hiRespectively representing the weight parameter and the offset of the input data feature extracted from the LSTM cell; w_if、W_hf、b_if、b_hfRespectively representing the weight parameter and the offset calculated in a forgetting gate in the LSTM cell; w_ig、W_hg、b_ig、b_hgRespectively representing the weight parameter and the offset calculated in the output gate in the LSTM cell; w_io、W_ho、b_io、b_hoRespectively representing the weight parameters and the offset calculated in the update gate in the LSTM cells; σ () is a sigmiod function calculation, wherein an indicates an exclusive-or operation.

S3, outputting the identification results of the operation stage and the operation event to the start and stop time of the complete video according to the start and stop time of the video clip, and storing the corresponding time of the identification results in the video retrieval system and the video extraction system;

s4, loading the recognition results and the corresponding time data in the video retrieval system and the video extraction system and displaying the recognition results and the corresponding time data in a progress bar of the video playing system; and mark the surgical stage and the time period during which the surgical event occurred.

As shown in fig. 2, the operation stage information and the operation event information are provided on the playing time axis, so that the surgeon can quickly retrieve the video clip to be watched; three progress bars are arranged below the playing time axis and correspond to the time axis, and if a certain event occurs or is in a certain stage in a period of time, the corresponding operation stage and the operation event are marked at the corresponding time position.

Meanwhile, according to the occurrence time of the surgical event and the surgical stage in the video, the corresponding video or picture is extracted by using the ffmpeg software to manufacture a surgical knowledge base, so that a doctor or an expert can quickly browse to carry out safety evaluation and main scalpel skill evaluation.

Further, the video retrieval system and the video extraction system are constructed by the following method:

Further, training, verifying and testing the images of the training set, the verifying set and the testing set through a ResNet network and an LSTM network, and completing the construction of the operation stage identification model and the operation event identification model comprises the following steps:

and inputting the image feature vectors with the format of (M, S) into an LSTM network by M times, and identifying the operation stage and the operation event of continuous video frames. The LSTM network is composed of a plurality of units, provides information that the user forgets that the user is too long away from the current time, and updates the current state through inputting a new video frame to complete recognition.

Putting the recognition result into a cross entropy loss function

The loss is calculated, and the model parameters are updated in a gradient descending mode, so that an operation stage identification model and an operation event identification model are constructed. Wherein CELoss represents the computation output and M represents the number of classes; y is_icIndicating an indicator variable (0 or 1), which is 1 if the category is the same as that of the sample, and is 0 otherwise; p is a radical of_icRepresenting the predicted probability of belonging to class c for the observed sample.

The invention also relates to a system for searching and extracting operation video clips based on artificial intelligence, which comprises an identification module, a video searching and extracting module and a video playing module;

Further, the system comprises a construction module, wherein the construction module is used for constructing the video retrieval and extraction unit, the operation stage identification model and the operation event identification model.

The system further comprises a video collecting and labeling module, wherein the video collecting and labeling module is used for collecting a large amount of video data into pictures according to the requirement of resolution, and labeling the collected pictures in the operation stage time period and the operation event time period.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for searching and extracting operation video clips based on artificial intelligence is characterized by comprising the following steps: the method comprises the following steps:

2. The method for retrieving and extracting the surgical video clip based on artificial intelligence of claim 1, wherein: the method also comprises the step of extracting corresponding videos or pictures to manufacture a surgery knowledge base according to the surgical events and the occurrence time of the surgical stages in the videos, so that doctors or experts can quickly browse to carry out safety evaluation and main knife skill evaluation.

3. The method for retrieving and extracting the surgical video clip based on artificial intelligence of claim 1, wherein: the method comprises the following steps of dividing a video into a plurality of video segments, extracting a plurality of pictures of the video segments at equal intervals, inputting the pictures into an operation stage identification model and an operation event identification model, and respectively identifying the operation stage and the operation event in the pictures after image feature extraction, wherein the steps comprise:

4. The method for retrieving and extracting the surgical video clip based on artificial intelligence of claim 1, wherein: the video retrieval system and the video extraction system are constructed by the following method:

5. The method for retrieving and extracting the surgical video clip based on artificial intelligence of claim 1, wherein: the method also comprises the step of constructing an operation stage identification model and an operation event identification model before dividing the video; the construction steps of the surgery stage identification model and the surgery event identification model comprise:

collecting a large amount of video data into pictures according to the requirement of resolution, and labeling the collected pictures with operation stage time periods and operation event time periods;

6. The method for retrieving and extracting operation video clip based on artificial intelligence as claimed in claim 5, wherein: the collecting a large amount of video data into pictures according to the requirement of resolution, and labeling the collected pictures in the operation stage time period and the operation event time period comprises the following steps:

7. The method for retrieving and extracting operation video clip based on artificial intelligence as claimed in claim 5, wherein: the training, verifying and testing of the training set, the verifying set and the testing set pictures through the ResNet network and the LSTM network to complete the construction of the operation stage identification model and the operation event identification model comprises the following steps:

putting the recognition result into a cross entropy loss function

8. A system for retrieving and extracting operation video clips based on artificial intelligence is characterized in that: the system comprises an identification module, a video retrieval and extraction module and a video playing module;

9. The system for retrieving and extracting video clip of surgery based on artificial intelligence as claimed in claim 8, wherein: the system further comprises a construction module, wherein the construction module is used for constructing the video retrieval and extraction unit, the operation stage identification model and the operation event identification model.

10. The system for retrieving and extracting video clip of surgery based on artificial intelligence as claimed in claim 8, wherein: the system also comprises a video collection and labeling module, wherein the video collection and labeling module is used for collecting a large amount of video data into pictures according to the requirement of resolution, and labeling the collected pictures in the operation stage time period and the operation event time period.