WO2023116351A1

WO2023116351A1 - Responsibility frame extraction method, video classification method, device and medium

Info

Publication number: WO2023116351A1
Application number: PCT/CN2022/134699
Authority: WO
Inventors: 蒋逸韬; 石思远; 崔晨
Original assignee: 上海微创卜算子医疗科技有限公司
Priority date: 2021-12-21
Filing date: 2022-11-28
Publication date: 2023-06-29

Abstract

Provided in the present application are a responsibility frame extraction method, a video classification method, an electronic device and a storage medium. The responsibility frame extraction method comprises: acquiring a video to be subjected to extraction; performing feature extraction on each frame of image in said video by using a backbone network of a static image classification neural network model, so as to acquire a feature matrix of each frame of image; performing a maximum pooling operation on the feature matrices of all the frames of image, so as to acquire a video feature matrix of said video; and extracting a preset number of responsibility frames according to the feature matrix of each frame of image and the video feature matrix.

Description

Responsible frame extraction method, video classification method, device and medium

technical field

The present application relates to the technical field of image processing, in particular to a responsible frame extraction method, a video classification method, electronic equipment and a storage medium.

Background technique

Ultrasound is a common means of disease medical imaging examination, which can be used for disease diagnosis of various tissues and organs. At the same time, ultrasound hardware has been continuously upgraded in terms of portability, and the product form of handheld ultrasound has realized the unity of function and portability, which is suitable for grassroots disease screening scenarios. However, due to the high granularity of ultrasound images, there are a large number of speckle noise, artifacts, attenuation and other problems, it is difficult to standardize and standardize ultrasound diagnosis, and it relies heavily on the clinical experience of sonographers. Grassroots medical institutions such as primary hospitals, community hospitals, and township clinics lack experienced sonographers, and it is difficult to make accurate benign and malignant judgments on ultrasound videos.

Clinically, when the sonographer conducts the initial diagnosis, review, and conveys diagnostic suggestions to the attending physician, the physician will use the responsibility frames (images with obvious benign and malignant indications) extracted from the video. An ideal artificial intelligence ultrasound system should be able to automatically provide the responsible frame for judging whether the video is benign or malignant. On the one hand, this function can further reduce the workload of doctors, and on the other hand, it can support doctors to judge whether to use the results of AI judgment. Therefore, how to extract responsible frames in video is particularly important.

Contents of the invention

The purpose of this application is to provide a responsible frame extraction method, video classification method, electronic equipment and storage medium, which can automatically find out in the video that contributes different important features to video classification (such as the classification of benign and malignant nodule videos). Responsibility frames to improve the accuracy of video classification, such as the classification of benign and malignant nodule videos.

In order to achieve the above object, the application provides a method for extracting responsible frames, including: obtaining the video to be extracted; using the skeleton network of the static image classification neural network model to perform feature extraction on each frame image in the video to be extracted, to obtain The feature matrix of each frame image; the maximum pooling operation is performed on the feature matrix of all frame images to obtain the video feature matrix of the video to be extracted; according to the feature matrix of each frame image and the video feature matrix, extract Preset number of responsible frames.

Optionally, extracting a preset number of responsible frames according to the feature matrix of each frame image and the video feature matrix includes: multiplying the feature value of each feature dimension in the video feature matrix by the The importance value of the feature dimension to obtain the video feature importance matrix; for each frame image, the feature value of each feature dimension in the feature matrix of the frame image is multiplied by the importance value of the feature dimension to obtain the A feature importance matrix of a frame image; extracting a preset number of responsible frames according to the video feature importance matrix and the feature importance matrix of each frame image.

Optionally, extracting a preset number of responsible frames according to the video feature importance matrix and the feature importance matrix of each frame image includes: step A1, using the video feature importance matrix as the current video Feature importance matrix; step B1, for each frame of image, subtracting the feature importance matrix of the frame image from the current video feature importance matrix to obtain the remaining feature importance matrix corresponding to the frame image; step C1 1. For each frame image, add the eigenvalues of each feature dimension in the remaining feature importance matrix corresponding to the frame image to obtain the remaining information entropy corresponding to the frame image; step D1, minimize the remaining information entropy image as the current responsibility frame; step E1, using the remaining feature importance matrix corresponding to the current responsibility frame as a new current video feature importance matrix; repeat the above steps B1 to step E1 until a preset number of responsibility is extracted frame.

Optionally, the subtracting the feature importance matrix of the frame image from the feature importance matrix of the current video to obtain the remaining feature importance matrix corresponding to the frame image includes: dividing the feature importance matrix of the current video The eigenvalue of each feature dimension in the matrix is subtracted from the eigenvalue of the corresponding feature dimension in the feature importance matrix of the frame image to obtain the eigenvalue difference of each feature dimension; for the eigenvalue difference of each feature dimension, If the eigenvalue difference of this feature dimension is less than 0, then use 0 as the eigenvalue of the corresponding feature dimension in the remaining feature importance matrix corresponding to the frame image; if the eigenvalue difference of this feature dimension is greater than or equal to 0, then set The eigenvalue difference of the feature dimension is used as the eigenvalue of the corresponding feature dimension in the remaining feature importance matrix corresponding to the frame image.

Optionally, extracting a preset number of responsible frames according to the feature matrix of each frame image includes: for each frame image, multiplying the feature value of each feature dimension in the feature matrix of the frame image by The contribution weight value of the feature dimension to obtain the feature entropy matrix of the frame image; perform a maximum pooling operation on the feature entropy matrix of all frame images to obtain the video feature entropy matrix of the video to be extracted; according to each frame of image The feature entropy matrix and the video feature entropy matrix are used to extract a preset number of responsible frames.

Optionally, the extracting a preset number of responsible frames according to the feature entropy matrix of each frame image and the video feature entropy matrix includes: for each frame image, the The eigenvalues of all feature dimensions are added to obtain the evaluation score of the frame image; the eigenvalues of all the feature dimensions in the video feature entropy matrix are added to obtain the evaluation score of the video to be extracted; The evaluation score of each frame image and the evaluation score of the video to be extracted extract a preset number of responsible frames, wherein the evaluation score of the video to be extracted is the same as that determined by the preset number of responsible frames The resulting set of images has the smallest difference in evaluation scores.

Optionally, the extraction of a preset number of responsible frames according to the evaluation score of each frame of image and the evaluation score of the video to be extracted includes: step A2, for each frame of image, calculating the Extract the difference between the evaluation score of the video and the evaluation score of the frame image to obtain the feature entropy difference of the frame image; step B2, determine the image with the smallest feature entropy difference as the responsibility frame; step C2, set all responsibility The frame and each non-responsible frame form an image set respectively, and calculate the evaluation score of each image set respectively; step D2, for each image set, calculate the evaluation score of the video to be extracted and the evaluation of the image set score difference to obtain the feature entropy difference of the image set; step E2, determine all the images in the image set with the smallest feature entropy difference as responsible frames; repeat the above steps C2 to step E2 until the preset number is extracted responsibility frame.

Optionally, the responsible frame extraction method further includes: using a target detection neural network model to extract the region of interest for each frame of image in the acquired video to be extracted, so as to obtain the region of interest corresponding to each frame of image Region image; use the skeleton network of the static image classification neural network model to extract the features of each frame of the region of interest image to obtain the feature matrix of each frame of the region of interest image; according to the feature matrix of each frame of the region of interest image, perform The extraction of malicious responsible frames until the malignant feature entropy corresponding to the set of malicious responsible frames formed by all the malicious responsible frames reaches a minimum value; until the benign feature entropy corresponding to the benign responsibility frame set composed of all the benign responsibility frames reaches the minimum value.

In order to achieve the above purpose, the present application also provides a video classification method, including: using the method for extracting responsible frames described above to extract a preset number of responsible frames from the acquired video; The feature matrix of the responsible frame is used to classify the video.

Optionally, classifying the video according to the feature matrix of the preset number of responsible frames includes: performing a maximum pooling operation on the feature matrix of the preset number of responsible frames to obtain a set of responsible frames. Feature matrix: video classification is performed according to the feature matrix of the responsible frame set.

Optionally, the performing video classification according to the feature matrix of the responsible frame set includes: inputting the feature matrix of the responsible frame set into a video classification model to perform video classification.

Optionally, the video classification model is a random forest classification model.

Optionally, the video classification method further includes displaying the classification result of the video and the extracted preset number of responsible frames.

In order to achieve the above purpose, the present application also provides an electronic device, including a processor and a memory, and a computer program is stored on the memory, and when the computer program is executed by the processor, the above-mentioned responsibility frame extraction is realized method or the video classification method described above.

In order to achieve the above purpose, the present application also provides a readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the method for extracting responsible frames or video Classification.

Compared with the prior art, the responsible frame extraction method, video classification method, electronic equipment and storage medium provided by this application have the following advantages:

(1) The responsible frame extraction method, electronic equipment and storage medium provided by the application, by first obtaining the video to be extracted; then using the skeleton network of the static image classification neural network model to perform features on each frame of the image in the video to be extracted Extract to obtain the feature matrix of each frame of image; finally, extract a preset number of responsible frames according to the feature matrix of each frame of image. As a result, multiple responsible frames with non-repetitive contribution features can be automatically extracted, and it is possible to extract responsible frames with diverse features without manually defining the extraction distance between frames. The extracted responsible frames can lay the foundation for subsequent video classification. A good foundation effectively eliminates the interference caused by noise frame images on video classification during the video classification process.

(2) The video classification method provided by this application extracts a preset number of responsible frames by using the above-mentioned responsible frame extraction method; and classifies the video according to the feature matrix of the extracted preset number of responsible frames . Since the video classification method provided by this application uses the above-mentioned responsible frame extraction method to extract a preset number of responsible frames, thus, the video classification method provided by this application has all the above-mentioned responsible frame extraction methods. advantage. In addition, since the video classification method provided by this application classifies videos based on the extracted preset number of responsible frames, it can effectively reduce the interference of noise frames in the video and effectively improve the accuracy of video classification .

Description of drawings

FIG. 1 is a schematic flow diagram of a responsibility frame extraction method in an embodiment of the present application;

Fig. 2 is a schematic diagram of an adjusted single frame image in the video to be extracted in a specific example;

Fig. 3 is a schematic diagram of obtaining the feature matrix of each frame image in the video to be extracted in a specific example of the present application;

Fig. 4 is a schematic diagram of acquiring video feature importance matrix and feature importance matrix of each frame image in a specific example of the present application;

FIG. 5 is a schematic diagram of a specific flow of extracting a responsibility frame in an embodiment of the present application;

6 is a schematic diagram of obtaining the remaining feature importance matrix in a specific example of the present application;

FIG. 7 is a schematic diagram of a specific flow of extracting a responsibility frame in another embodiment of the present application;

Figure 8a is a schematic diagram of a production video feature entropy matrix in a specific example of the present application;

Fig. 8b is a schematic diagram of selecting the first frame responsibility frame in a specific example of the present application;

Fig. 8c is a schematic diagram of selecting the second frame responsibility frame in a specific example of the present application;

FIG. 9 is a flowchart of a responsibility frame extraction method provided in an embodiment of the present application;

Fig. 10 is a schematic diagram of a medical image provided by a specific example of the present application;

Fig. 11 is a schematic diagram of the region of interest image extracted from Fig. 10;

FIG. 12 is a schematic flowchart of extracting malicious responsibility frames provided by an embodiment of the present application;

Fig. 13 is a schematic diagram of the relationship between the feature entropy of the responsible frame image set and the number of responsible frames provided by an embodiment of the present application;

FIG. 14 is a schematic diagram of a specific flow for extracting benign responsibility frames provided by an embodiment of the present application;

FIG. 15 is a schematic diagram of video classification using a random forest classifier provided in an embodiment of the present application;

FIG. 16 is a schematic diagram of an adjustment responsibility frame provided by an embodiment of the present application;

FIG. 17 is a schematic flow diagram of a video classification method in an embodiment of the present application;

FIG. 18 is a schematic block diagram of an electronic device in an embodiment of the present application.

Wherein, the reference signs are as follows:

Processor-101; communication interface-102; memory-103; communication bus 104.

Detailed ways

The responsible frame extraction method, video classification method, electronic equipment, and storage medium proposed in this application will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. The advantages and features of the present application will become clearer from the following description. It should be noted that the drawings are in a very simplified form and all use imprecise scales, which are only used to facilitate and clearly assist the purpose of illustrating the embodiments of the present application. In order to make the object, features and advantages of the present application more comprehensible, please refer to the accompanying drawings. It should be noted that the structures, proportions, sizes, etc. shown in the drawings attached to this specification are only used to match the content disclosed in the specification, for those who are familiar with this technology to understand and read, and are not used to limit the implementation of this application. Conditions, any modification of structure, change of proportional relationship or adjustment of size, under the same or similar situation as the effect and purpose that this application can produce, should still fall within the technical content disclosed in this application. within the range that can be covered.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is a relationship between these entities or operations. There is no such actual relationship or order between them. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

The core idea of this application is to provide a responsible frame extraction method, video classification method, electronic equipment and storage medium, which can automatically find out in the video that contributes to different important features for video classification (such as the classification of benign and malignant nodules) frame of responsibility to improve the accuracy of video classification (such as the classification of benign and malignant nodule videos).

It should be noted that the responsible frame extraction method and the video classification method of the embodiments of the present application can be applied to the electronic device of the embodiment of the present application, wherein the electronic device can be a personal computer, a mobile terminal, etc., and the mobile terminal can be a mobile phone , Tablet PC and other hardware devices with various operating systems. In addition, it should be noted that although this article takes the example of extracting a preset number of responsible frames from medical videos that can contribute different features to the classification of medical videos, as those skilled in the art can understand, this The application can also extract a preset number of responsible frames that can contribute different features to the classification of videos in other fields than medical videos, and this application does not limit this .

In order to realize the above idea, the present application provides a responsibility frame extraction method, please refer to FIG. 1 , which schematically shows a flow diagram of the responsibility frame extraction method provided by an embodiment of the present application. As shown in Figure 1, the responsibility frame extraction method includes the following steps:

Step S110, acquiring the video to be extracted.

Step S120, using the skeleton network of the static image classification neural network model to perform feature extraction on each frame of image in the video to be extracted, so as to obtain a feature matrix of each frame of image.

Step S130, extracting a preset number of responsible frames according to the feature matrix of each frame of image.

Specifically, the video to be extracted can be an ultrasound scan video (such as scan data of breast cancer, thyroid nodule, etc.), of course, as those skilled in the art can understand, the video to be extracted can also be other medical The medical video collected by an imaging device, for example, the medical video collected by an endoscope, etc. In addition, the video to be extracted may also be a non-medical video, which is not limited in this application. Therefore, the responsibility frame extraction method provided by this application can automatically extract multiple responsibility frames with non-repetitive contribution features, and realizes that it can extract Responsibility frames with diverse features can be extracted, which can lay a good foundation for subsequent video classification, and effectively eliminate the interference caused by noise frame images on video classification during the video classification process. For example, by extracting a preset number of responsible frames from an ultrasound video of a thyroid nodule, a good foundation can be laid for subsequent accurate judgment of whether the thyroid nodule in the ultrasound video is benign or malignant.

Further, since the neural network model requires an image of uniform size as input, before using the skeleton network of the static image classification neural network model to perform feature extraction on each frame image in the video to be extracted, the method It also includes adjusting the size of each frame of image in the video to be extracted, so as to adjust the size of each frame of image in the video to be extracted to a preset size. The preset size can be set according to specific conditions, which is not limited in this application. As an example, the size of the adjusted video to be extracted is 100×224×224×3 (number of frames×width×height×number of channels). Please refer to FIG. 2 , which schematically shows a schematic diagram of an adjusted single-frame image in a video to be extracted in a specific example.

In an exemplary implementation, the static image classification neural network model includes a skeleton network for feature extraction and a classification network for classification. Wherein, the skeleton network can use different convolutional neural networks, such as MobileNet network, DenseNet121 network, Xception network and so on. For more information about the MobileNet network, DenseNet121 network, and Xception network, you can refer to the existing technology, so this article will not repeat them. The classification network includes at least one fully connected layer, and the fully connected layer is used to perform nonlinear mapping regression on the features extracted by the classification network to obtain classification results. Please refer to FIG. 3 , which schematically shows a schematic diagram of acquiring the feature matrix of each frame of image in the video to be extracted in a specific example of the present application. As shown in Figure 3, the skeleton network in the static image classification neural network model performs multiple (for example N convolution) convolution operations on each frame image in the video to be extracted to obtain each The feature matrix of the frame image, the feature matrix of each frame image can be represented by a 1×k matrix, where k represents the feature dimension, which is determined by the structure of the static image classification neural network model.

Specifically, the static image classification neural network model is obtained through the following steps of training:

Obtaining an original training sample, the original training sample including an original sample image and a classification label corresponding to the original sample image;

expanding the original training samples to obtain expanded training samples;

Set the initial values of the model parameters of the static image classification neural network model;

The pre-built static image classification neural network model is trained according to the expanded training samples and the initial values of the model parameters of the static image classification neural network model until the preset training end condition is satisfied.

Due to the limited data of the original training samples, deep learning needs to learn on certain data to have a certain robustness. In order to increase the robustness, a data amplification operation is required to increase the performance of the static image classification neural network model. Generalization. Specifically, a random rigid transformation may be performed on the original sample image, specifically including: rotation, scaling, translation, flipping, and grayscale transformation. More specifically, the original sample image can be translated by -10 to 10 pixels, rotated by -10° to 10°, horizontally flipped, vertically flipped, scaled by 0.9 to 1.1 times, grayscale transformation, etc. to complete the training sample data Amplify. It should be noted that since the random rigid transformation performed on the original sample image will not affect its classification results, the classification label does not need to be transformed when performing sample expansion, that is, it is obtained by different transformations of the same original sample image. The classification labels corresponding to the obtained expanded sample images are all consistent with the classification labels corresponding to the original sample image.

The model parameters of the static image classification neural network model include two categories: feature parameters and hyperparameters. The feature parameter is a parameter for learning image features. Feature parameters include weight parameters and bias parameters. Hyperparameters are parameters that are artificially set during training. Only by setting appropriate hyperparameters can feature parameters be learned from samples. Hyperparameters can include learning rate, number of hidden layers, convolution kernel size, number of training iterations, and batch size for each iteration. The learning rate can be thought of as a step size. For example, in this application, the learning rate can be set to 0.001, and the number of training iterations is 100.

The preset training end condition is that the error value between the predicted classification result of the sample image in the expanded training sample and the corresponding classification label converges to a preset error value. In addition, the training process of the static image classification neural network model is a multi-cycle iterative process. Therefore, the training can be ended by setting the number of iterations, that is, the preset training end condition can also be that the number of iterations reaches the preset number of iterations.

Further, the training of the pre-built static image classification neural network model according to the expanded training samples and the initial values of the model parameters of the static image classification neural network model includes:

According to the expanded training samples and the initial values of the model parameters of the static image classification neural network model, the pre-built static image classification neural network model is trained using a stochastic gradient descent method.

Since the model training process is actually the process of minimizing the loss function, and obtaining derivatives can quickly and easily achieve this goal, this method of obtaining derivatives is the gradient descent method. Therefore, using the gradient descent method to train the static image classification neural network model can quickly and simply realize the training of the static image classification neural network model.

In the deep learning of this application, the gradient descent method is mainly used to train the static image classification neural network model, and then the back propagation algorithm is used to update and optimize the weight parameters and bias parameters in the static image classification neural network model. The gradient descent method is used to judge that the place where the slope of the curve is the largest is the direction to reach the optimal value faster. The backpropagation method uses the chain derivation method of probability to calculate the partial derivative to update the weight, and updates the parameters through continuous iterative training. to learn images. The method of updating the weight parameters and bias parameters of the backpropagation algorithm is as follows:

1. First carry out forward propagation, update the parameters through continuous iterative training to learn the image, and calculate the activation values of all layers (convolution layer, deconvolution layer), that is, the image is obtained after the convolution operation. ;

2. For the output layer (layer n _l ), calculate the sensitivity value

Among them, y is the real value of the sample,

is the predicted value of the output layer,

Indicates the partial derivative of the output layer parameters;

3. For each layer of l=n _l -1, n _l -2,..., calculate the sensitivity value

Among them, W ^l represents the weight parameter of the l-th layer, δ ^l+1 represents the sensitive value of the l+1-th layer, and f'(z ^l ) represents the partial derivative of the l-th layer;

4. Update the weight parameters and bias parameters of each layer:

Among them, W ^l and b ^l represent the weight parameter and bias parameter of layer l respectively,

is the learning rate, a ^l represents the output value of layer l, and δ ^l+1 represents the sensitive value of layer l+1.

Further, according to the initial value of the model parameters of the expanded training sample and the static image classification neural network model, the stochastic gradient descent method is used to train the pre-built static image classification neural network model, including:

Step 1, using the expanded training sample as the input of the static image classification neural network model, and obtaining the predicted classification result of the expanded sample image according to the initial value of the model parameter of the static image classification neural network model;

Step 2. Calculate a loss function value according to the predicted classification result of the expanded sample image and the classification label corresponding to the expanded sample image; and

Step 3, judging whether the loss function value converges to a preset error value, if yes, the training ends, if not, adjust the model parameters of the static image classification neural network model, and set the static image classification neural network model's The initial value of the model parameter is updated to the adjusted model parameter, and the step 1 is executed back.

When the loss function value does not converge to the preset error value, it means that the static image classification neural network model is not accurate, and it is necessary to continue training the static image classification neural network model. In this case, adjust the model parameters and change the initial value of the model parameters to Update to the adjusted model parameters, return to step 1, and enter the next iteration process.

It can be seen that the loss function is the objective function used to optimize the neural network, and the neural network can learn better by minimizing the loss function. Because the static image classification neural network model needs to learn image features in a certain situation, that is, it needs to define a suitable loss function to learn effective features. This application uses the binary classification network loss function L(W,b) as the loss function.

The binary classification network loss function L(W,b) is as follows:

In the formula, W and b represent the weight parameters and bias parameters of the static image classification neural network model, m is the number of training samples, m is a positive integer, x ⁱ represents the i-th training sample input, f _W,b (x ⁱ ) represents the predicted classification result of the i-th training sample, and y ⁱ represents the classification label of the i-th training sample.

In an exemplary implementation, the extraction of a preset number of responsible frames according to the feature matrix of each frame image includes:

Performing a maximum pooling operation on the feature matrices of all frame images to obtain the video feature matrix of the video to be extracted;

According to the feature matrix of each frame image and the video feature matrix, a preset number of responsible frames are extracted.

Specifically, performing the maximum pooling operation on the feature matrices of all frame images refers to taking the feature matrices of all frame images (for example, 100 frames) in the video to be extracted in the column direction (that is, the direction of the feature dimension) The largest eigenvalue, to obtain the eigenvalue of each feature dimension is the 1×k video feature matrix of the feature matrix of all frame images in the feature dimension of the largest eigenvalue, the obtained video feature matrix is integrated in each frame image Important information that can be contributed. Since the video is essentially a superposition of multiple frames of images, the feature information of the video is scattered in each frame of images, so the video feature matrix obtained by performing the maximum pooling operation on the feature matrices of all frame images represents the features of the video to be extracted.

Further, extracting a preset number of responsible frames according to the feature matrix of each frame image and the video feature matrix, including:

Multiplying the eigenvalue of each feature dimension in the video feature matrix by the importance value of the feature dimension to obtain the video feature importance matrix;

For each frame of image, multiply the eigenvalue of each feature dimension in the feature matrix of the frame image by the importance value of the feature dimension to obtain the feature importance matrix of the frame image;

According to the video feature importance matrix and the feature importance matrix of each frame image, a preset number of responsible frames are extracted.

Specifically, the importance value of each feature dimension can represent the importance of the feature of this feature dimension in the random forest classification model described below, defined by the random forest classification model, all positive numbers, of course, as in the art Personnel can understand that in some other implementations, the importance value of each feature dimension can also represent the importance of the features of this feature dimension in other classification models except the random forest classification model, and this application does not Not limited. Please refer to FIG. 4 , which schematically shows a schematic diagram of acquiring video feature importance matrix and feature importance matrix of each frame image in a specific example of the present application.

Further, please refer to FIG. 5 , which schematically shows a specific flowchart of extracting a responsibility frame provided by an embodiment of the present application. As shown in Figure 5, according to the video feature importance matrix and the feature importance matrix of each frame image, a preset number of responsible frames is extracted, including:

Step A1, using the video feature importance matrix as the current video feature importance matrix;

Step B1, for each frame of image, subtracting the feature importance matrix of the frame image from the current video feature importance matrix to obtain the remaining feature importance matrix corresponding to the frame image;

Step C1. For each frame of image, add the eigenvalues of each feature dimension in the remaining feature importance matrix corresponding to the frame of image to obtain the remaining information entropy corresponding to the frame of image;

Step D1, taking the image with the smallest remaining information entropy as the current responsible frame;

Step E1, using the remaining feature importance matrix corresponding to the current responsible frame as a new current video feature importance matrix;

The above steps B1 to E1 are repeated until a preset number of responsible frames are extracted.

Specifically, the eigenvalue of each feature dimension in the feature importance matrix can be regarded as the amount of information, and correspondingly, the eigenvalue of each feature dimension in the video feature importance matrix can be regarded as the entire video under this feature dimension The total amount of information contributed, the eigenvalue of each feature dimension in the feature importance matrix of each frame image is regarded as the single frame information amount contributed by the frame image under this feature dimension. For each frame of image, the video feature importance matrix is subtracted from the feature importance matrix of the frame image, and the obtained matrix is the remaining feature importance matrix corresponding to the frame image. For each frame of image, add the information amount (i.e., feature value) of each feature dimension in the remaining feature importance matrix corresponding to the frame of image, and the obtained sum is the residual information entropy after subtracting the frame of image from the video , find the frame that produces the smallest residual information entropy, that is, find the most important responsible frame. After finding the most important responsibility frame, regard its corresponding remaining feature importance matrix as a new video feature importance matrix, and then use the same method to find out the second important responsibility frame, and then use the found second important responsibility After each frame, the corresponding remaining feature importance matrix is regarded as a new video feature importance matrix, and the same method is used to find out the next important responsible frame until a preset number of responsible frames are found.

Since the video feature importance matrix subtracts the feature importance matrix of the most important responsible frame, the feature dimension that once contributed the largest amount of information subtracts the large amount of information contributed by a responsible frame, and the remaining information of this feature dimension will be becomes very small, so when the second important responsibility frame is selected, the feature dimension of the selected responsibility frame with a large amount of contribution information will be different from the most important responsibility frame selected for the first time. It can be seen that the responsibility frame extraction method provided by this embodiment Responsibility frames with diverse features can be extracted without defining the extraction distance between frames. In this embodiment, by performing reverse feature contribution calculations, combined with the idea of reducing information entropy, the responsible frames in the video that contribute different important features to video classification (such as the classification of benign and malignant nodules) are automatically found. The responsibility frame extraction method provided has strong versatility, can be applied to various CNN (convolutional neural network) models, and has good applicability and transferability.

In an exemplary implementation, the subtracting the feature importance matrix of the frame image from the current video feature importance matrix to obtain the remaining feature importance matrix corresponding to the frame image includes:

The eigenvalue of each feature dimension in the feature importance matrix of the current video is subtracted from the eigenvalue of the corresponding feature dimension in the feature importance matrix of the frame image to obtain the eigenvalue difference of each feature dimension;

For the eigenvalue difference of each feature dimension, if the eigenvalue difference of the feature dimension is less than 0, then use 0 as the eigenvalue of the corresponding feature dimension in the remaining feature importance matrix corresponding to the frame image; if the feature dimension’s If the eigenvalue difference is greater than or equal to 0, the eigenvalue difference of the feature dimension is used as the eigenvalue of the corresponding feature dimension in the remaining feature importance matrix corresponding to the frame image.

Please refer to FIG. 6 , which schematically shows a schematic diagram of obtaining the remaining feature importance matrix in a specific example of the present application. As shown in Figure 6, by subtracting the eigenvalue of each feature dimension in the feature importance matrix of the video feature importance matrix from the eigenvalue of the corresponding feature dimension in the feature importance matrix of a frame image, the corresponding eigenvalue of the frame image can be obtained The remaining feature importance matrix.

In another exemplary embodiment, the extraction of a preset number of responsible frames according to the feature matrix of each frame image includes:

For each frame of image, multiply the eigenvalue of each feature dimension in the feature matrix of the frame image by the contribution weight value of the feature dimension to obtain the feature entropy matrix of the frame image;

Performing a maximum pooling operation on the feature entropy matrices of all frame images to obtain the video feature entropy matrix of the video to be extracted;

According to the feature entropy matrix of each frame image and the video feature entropy matrix, a preset number of responsible frames are extracted.

Specifically, a video can be regarded as a collection of a series of frames, the information of the entire video is scattered in each frame, and the contribution of each frame image on each feature dimension is represented by its feature matrix, where the number of feature dimensions is determined by the skeleton network It is determined that each feature dimension represents an image feature in a depth space (such as the feature of malignant nodules or benign nodules), and by multiplying the contribution weight value on the feature matrix, the contribution weight value can be determined by the static image classification neural network model The channel weight difference of the classification network (such as the fully connected layer) is determined. For example, the classification network is used to classify benign and malignant, and then one channel of the classification network corresponds to the malignant category, and the other channel corresponds to the benign category, wherein the corresponding malignant The weight of the channel of the category is W ₁ , and the weight of the channel corresponding to the benign category is W ₀ . The output Y _pred predicted by the model in the basic CNN architecture can be expressed as:

Y _pred = Sigmoid([W ₀ ,W ₁ ] ^T *X+B)=[Y ₀ ,Y ₁ ]

In the formula, Sigmoid represents the activation function, X represents the feature matrix, Y ₀ represents the benign probability, and Y ₁ represents the malignant probability.

Denotes the malignant contribution of a single feature dimension i, where,

Indicates the intensity of the feature dimension i's contribution to the final malignancy, _{and xi} indicates the amount of information contributed by the image in the feature dimension i. If you want to focus on the deep spatial features that represent malignancy, you can use the following equation to describe the contribution of the i-th frame image in the j-th feature dimension:

Therefore, the entire feature entropy matrix of the i-th frame image can be expressed as:

In order to focus on the most representative depth features of the video, use MaxPooling (maximum pooling) to process the feature entropy matrix of all frame images to construct the video feature entropy matrix [FE] _video :

It should be noted that, as those skilled in the art can understand, in other embodiments, the feature value of each feature dimension in the video feature matrix above can also be directly multiplied by the contribution weight value of the feature dimension, to get the video feature entropy matrix.

Further, extracting a preset number of responsible frames according to the feature entropy matrix of each frame image and the video feature entropy matrix, including:

For each frame image, add the eigenvalues of all feature dimensions in the feature entropy matrix of the frame image to obtain the evaluation score of the frame image;

adding the eigenvalues of all feature dimensions in the video feature entropy matrix to obtain the evaluation score of the video to be extracted;

According to the evaluation score of each frame image and the evaluation score of the video to be extracted, a preset number of responsible frames are extracted, wherein the evaluation score of the video to be extracted is related to the preset number of responsible frames The resulting set of images has the smallest difference in evaluation scores.

Specifically, define FScore (assessment score) equal to the sum of the eigenvalues of all feature dimensions in the feature entropy matrix, then the FScore of the i-th frame image satisfies the following relationship:

A collection of images that extend FScore from a single frame image, where video can also be regarded as a collection of images. For a collection of images A (A=[frame _a , frame _b ,...frame _n ]), the evaluation of the collection of images A The score FScore satisfies the following relationship:

Since the difference between the evaluation score of the video to be extracted and the evaluation score of the image set composed of the finally extracted multiple responsible frames is the smallest, it can not only ensure that the image set composed of multiple responsible frames contains The information is as close as possible to the entire video, and at the same time, it can also ensure that the selected responsible frames can directly form complementary features.

Please continue to refer to FIG. 7 , which schematically shows a specific flowchart of extracting the responsibility frame provided by another embodiment of the present application. As shown in FIG. 7 , according to the evaluation score of each frame image and the evaluation score of the video to be extracted, a preset number of responsible frames are extracted, including:

Step A2. For each frame of image, calculate the difference between the evaluation score of the video to be extracted and the evaluation score of the frame image, so as to obtain the feature entropy difference of the frame image;

Step B2, determining the image with the smallest feature entropy difference as the responsible frame;

Step C2. Composing all responsible frames and each non-responsible frame into an image set respectively, and calculating the evaluation score of each image set respectively;

Step D2. For each image set, calculate the difference between the evaluation score of the video to be extracted and the evaluation score of the image set to obtain the feature entropy difference of the image set;

Step E2, determining all images in the image set with the smallest feature entropy difference as responsible frames;

The above steps C2 to E2 are repeated until a preset number of responsible frames are extracted.

Specifically, for each image set, the maximum pooling operation can be performed on the feature entropy matrix of all frame images in the image set to obtain the feature entropy matrix of the image set. Add the eigenvalues of all feature dimensions of , and the resulting sum is the evaluation score of the image set. Thus, when selecting the responsible frame topi of the i-th frame, all determined responsible frames (top1, tpo2,...topi-1) can be combined with each remaining image frame (that is, each frame except the responsible frame images) form an image set respectively. For example, taking the remaining frame image a as an example, the image set formed by all the determined responsible frames and the frame image a is [top1, tpo2,...topi-1, a], then the difference between the evaluation score of the video to be extracted and the evaluation score of the image set is calculated by the following formula:

Thus, by calculating the difference between the evaluation score of the video to be extracted and the evaluation score of each of the image sets, the feature entropy difference of each image set can be obtained, wherein the image set with the smallest feature entropy difference All the images in are responsible frames, that is, the remaining frame images in the image set with the smallest feature entropy difference are the responsible frames of the i-th frame. The responsible frame extraction method provided in this embodiment can realize the extraction of multiple responsible frames whose contribution features are not repeated for video classification (such as the classification of benign and malignant nodule videos) without adding additional training parameters. This embodiment can be applied to various CNN models, and has good applicability and portability.

Please refer to FIG. 8a to FIG. 8c, wherein FIG. 8a schematically shows a schematic diagram of obtaining a video feature entropy matrix in a specific example of the present application; FIG. 8b schematically shows the selection of the first Schematic diagram of a frame responsibility frame; FIG. 8c schematically shows a schematic diagram of selecting a second frame responsibility frame in a specific example of the present application. As shown in Figures 8a to 8c, in this specific example, the video to be extracted includes 3 frames of images, the total number of depth feature dimensions is 3, and the number of responsible frames to be extracted is 2. First, by performing the maximum pooling operation on the feature entropy matrix of the three frames of images, the video feature entropy matrix can be obtained. Through calculation, it can be known that the evaluation score FScore _video of the video to be extracted is 24, and the evaluation score FScore _frame1 of the first frame image is 16, the evaluation score FScore _frame2 of the second frame image is 14, and the evaluation score FScore _frame3 of the third frame image is 11. Further calculation shows that the evaluation score FScore _video of the video to be extracted is the same as the evaluation score of the first frame image The difference between the value FScore _frame1 is 8 (that is, the feature entropy difference of the first frame image is 8), the difference between the evaluation score FScore _video of the video to be extracted and the evaluation score FScore _frame2 of the second frame image is 10 (that is, the feature entropy difference of the first frame image is 10), the difference between the evaluation score FScore _video of the video to be extracted and the evaluation score FScore _frame3 of the third frame image is 13 (that is, the first frame The feature entropy difference of the image is 13), and since the feature entropy difference of the first frame image is the smallest, the first frame image is determined as the first responsible frame. Then, the first frame responsibility frame (i.e. the first frame image) and the second frame image form an image set [frame1, frame2], and the feature entropy matrix of the first frame responsibility frame and the feature entropy matrix of the second frame image are performed. The maximum pooling operation can obtain the feature entropy matrix of the image set [frame1, frame2]. Through calculation, it can be seen that the evaluation score FScore _{[frame1, frame2] of the image set [frame1, frame2]} is 16. Further calculations show that, to be The difference between the evaluation score FScore _video of the extracted video and the evaluation score FScore [frame1, frame2] of the image collection _{[frame1, frame2]} is 8 (that is, the feature entropy difference of the image collection is 8); the first The frame responsibility frame and the third frame image form another image set [frame1, frame3], which can be obtained by performing the maximum pooling operation on the feature entropy matrix of the first frame responsibility frame and the feature entropy matrix of the third frame image The feature entropy matrix of [frame1, frame3], through calculation, the evaluation score FScore [frame1, frame3] of the image set [ _{frame1, frame3]} is 16. Further calculation shows that the evaluation score FScore _video of the video to be extracted is the same as the The difference between the evaluation scores FScore _{[frame1, frame2] of the image set [frame1, frame3]} is 0 (that is, the feature entropy difference of the image set is 0). Since the feature entropy difference of the image set [frame1, frame3] composed of the first frame responsibility frame and the third frame image is smaller than the feature entropy difference of the image set [frame1, frame2] composed of the first frame responsibility frame and the second frame image , thus determining the third frame image as the responsible frame of the second frame.

On the other hand, the present application also provides a responsibility frame extraction method, please refer to FIG. 9, which schematically shows a flow chart of the responsibility frame extraction method provided by an embodiment of the present application, as shown in FIG. 9, the The method for extracting the responsibility frame comprises the following steps:

Step S210, using the object detection neural network model to extract the region of interest for each frame of medical image in the acquired medical video, so as to obtain the region of interest image corresponding to each frame of medical image.

Step S220 , using the skeleton network of the static image classification neural network model to perform feature extraction on each frame of the ROI image, so as to obtain a feature matrix of each frame of the ROI image.

Step S230, according to the feature matrix of each frame of the region of interest image, extract the malicious responsible frame until the first preset end condition is met; and/or perform the extraction of the benign responsible frame according to the feature matrix of each frame of the region of interest image , until the second preset end condition is met.

Please refer to Figure 10 and Figure 11. Due to the difference in equipment and inspection modes, the style of the information prompt bar outside the window will be different. Therefore, the responsibility frame extraction method provided by this application first adopts the target detection neural network model from the acquired medical The image of the region of interest is extracted from each frame of the medical image of the video, and then the malicious responsible frame and/or the benign responsible frame are extracted according to the feature matrix of each frame of the region of interest image, which can effectively reduce the malignant responsible frame and/or benign responsible frame. The interference of image noise in the process of extracting responsible frames further improves the efficiency and accuracy of extracting malicious responsible frames and/or benign responsible frames.

In an exemplary implementation, the target detection neural network model is used to extract the region of interest for each frame of medical image in the acquired medical video, so as to obtain the region of interest corresponding to each frame of medical image. Area images, including:

Using the target detection neural network model to extract the region of interest for each frame of medical image in the acquired medical video, to obtain the position information of the region of interest corresponding to each frame of medical image;

According to the position information of the region of interest corresponding to each frame of medical image, the corresponding region is cut out on each frame of medical image, so as to obtain the image of the region of interest corresponding to each frame of medical image.

Therefore, by using the target detection neural network model, the position information of the region of interest (that is, the ultrasound window) in each frame of medical image can be accurately obtained, so that for each frame of medical image, according to the region of interest of the medical image The location information of the medical image can be cropped to the corresponding ROI image.

In an exemplary embodiment, before using the skeleton network of the static image classification neural network model to perform feature extraction on each frame of the region-of-interest image, the method further includes:

For each frame of ROI image:

Taking the larger one of the width dimension and the height dimension of the image of the region of interest as the target side length dimension;

padding the ROI image to adjust the smaller one of the width dimension and the height dimension of the ROI image to the target side length dimension;

Enlarging or reducing the ROI image adjusted to the target side length size, so as to adjust the size of the ROI image to a preset size.

Because the static image classification neural network model needs images of uniform size as input, before using the skeleton network of the static image classification neural network model to perform feature extraction on each frame of the region of interest image, the region of interest needs to be The size of the image is adjusted, so as to adjust the size of the ROI image to a preset size. The preset size can be set according to specific conditions, which is not limited in this application. As a preference, in the preset size, the height dimension of the image is consistent with the width dimension, that is, the image of the region of interest adjusted to the preset size is a square image, for example, the preset size is 448*448. Therefore, by setting the height dimension and the width in the preset size to be consistent, it may be more convenient to adjust the size of the ROI image to the preset size. Specifically, the ROI image may be filled with a "zero pixel" filling method, so as to adjust the width and height dimensions of the ROI image to be consistent. It should be noted that, as those skilled in the art can understand, since the target detection neural network model also needs images of uniform size as input, each frame of medical video obtained by using the target detection neural network model Before extracting the region of interest from the image, it is also necessary to adjust the size of each frame of medical image to a target size to meet the input requirements of the target detection neural network model.

Thus, by adopting the method of parallel processing to simultaneously perform feature extraction on multiple frames of ROI images to simultaneously obtain feature matrices of multiple frames of ROI images (one frame of ROI images corresponds to a feature matrix), this can be further improved. The application provides the extraction efficiency of the responsibility frame extraction method. It should be noted that, as those skilled in the art can understand, the total number of frames of the region of interest image that can be processed in parallel each time is determined by the computing power of the GPU of the computer. The stronger the computing power of the GPU of the computer, the more The more total frames of ROI images that can be processed in parallel.

In an exemplary implementation, the extracting of the malicious responsible frame according to the feature matrix of each frame of the region-of-interest image until the first preset end condition is met includes:

For each frame of the ROI image, the ROI image is acquired according to the feature matrix of the ROI image and the difference between the malignant feature weight parameter and the benign feature weight parameter corresponding to the static image classification neural network model The malignant feature matrix of ;

According to the malignant feature matrix of each frame of the region of interest image, the malicious responsible frame is extracted until the first preset end condition is met.

Specifically, in the static image classification neural network model, the benign and malignant judgment of each frame of the region of interest image is based on the feature matrix of the region of interest image, and the output probability predicted by the static image classification neural network model Y _pred can be expressed as:

In formula (1), Y ₀ represents the probability that the ROI image belongs to the benign category, Y ₁ represents the probability that the ROI image belongs to the malignant category, and W ₁ represents the probability that the static image classification neural network model belongs to. The corresponding malignant feature weight parameters, W ₀ represents the benign feature weight parameters corresponding to the static image classification neural network model, and B ₀ and B ₁ represent the bias parameters corresponding to the static image classification neural network model.

It can be seen from the above formula (1) that the probability _Y1 of the ROI image belonging to the malignant category is only determined by the relative difference between the malignant feature weight parameter and the benign feature weight parameter and the feature matrix X of the ROI image, Therefore, the present application obtains the malignant feature matrix of the ROI image according to the feature matrix of the ROI image and the difference between the malignant feature weight parameters and the benign feature weight parameters corresponding to the static image classification neural network model , and then according to the malignant feature matrix of each frame of the region of interest image, the malicious responsible frame is extracted until the first preset end condition is met, so that the malicious responsible frame with a large amount of malicious contribution information can be accurately extracted. It should be noted that, as those skilled in the art can understand, the malignant feature weight parameter W ₁ is a matrix with k malignant feature weights, and the benign feature weight parameter W ₀ is a matrix with k benign feature weights. That is, each feature dimension corresponds to a malignant feature weight and a benign feature weight.

In an exemplary implementation, the sense is obtained according to the feature matrix of the image of the region of interest and the difference between the malignant feature weight parameter and the benign feature weight parameter corresponding to the static image classification neural network model. The malignant feature matrix of the ROI image, including:

According to the following formulas (2) and (3), the malignant feature matrix of the image of the region of interest is obtained:

In formula (2), [FM] _i represents the malignant feature matrix of the i-th frame ROI image, in formula (3),

Represents the eigenvalue of the jth feature dimension in the feature matrix of the i-th frame region of interest image,

Indicates the malignant feature weight of the jth feature dimension corresponding to the static image classification neural network model,

Indicates the benign feature weight of the jth feature dimension corresponding to the static image classification neural network model,

Indicates the malignant feature value of the jth feature dimension in the malignant feature matrix of the i-th frame ROI image.

It should be noted that, as those skilled in the art can understand,

means from 0 and

Take the larger one, that is, if

greater than 0, then

Pick

like

is less than 0, then

Take 0.

Further, in an exemplary implementation, the extracting of the malicious responsible frame according to the malignant feature matrix of each frame of the region-of-interest image until the first preset end condition is met includes:

For each frame of the ROI image, add the malignant eigenvalues of all the feature dimensions in the malignant feature matrix of the ROI image to obtain the total malignant eigenvalue of the ROI image;

According to the total malignant feature value of the region of interest image of each frame, the malicious responsible frame is extracted until the first preset end condition is met.

Specifically, in combination with formulas (2) and (3) above, it can be seen that the total malignant feature value of the i-th frame ROI image can be expressed by the following formula:

Therefore, by extracting the malicious responsible frame according to the total malignant feature value of the ROI image of each frame, not only the extraction efficiency of the malicious responsible frame can be further improved, but also the malicious responsible frame with similar characteristics can be effectively prevented from being extracted.

Please continue to refer to FIG. 12 , as shown in FIG. 12 , in an exemplary implementation, the malicious responsible frame is extracted according to the total malignant feature value of each frame of the region of interest image until the first preset is satisfied. End conditions, including:

Step A10, sorting the total malignant feature values of the ROI images of each frame, and determining the ROI image with the largest total malignant feature value as the malignant responsible frame;

Step A20, forming a first image set with all malicious responsible frames and each non-malignant responsible frame, and calculating the total malignant feature value of each first image set respectively, wherein the total malignant feature value of the first image set The value is equal to the sum of malignant feature values of all feature dimensions in the malignant feature matrix obtained after performing the maximum pooling operation on the malignant feature matrices of all frame ROI images in the first image set, and the non-malignant responsible frame is an image of a region of interest that is not determined to be a malicious frame;

Step A30, judging whether the malignant feature entropy corresponding to the first image set with the smallest total malignant feature value is greater than the malignant feature entropy corresponding to the malignant responsible frame set composed of all malignant responsible frames;

If not, then perform step A40, if so, then perform step A50;

Step A40, determine all frames of ROI images in the first image set with the smallest total malignant feature value as malignant responsible frames, and return to step A20;

Step A50, end the extraction of the malicious responsibility frame.

Specifically, performing the maximum pooling operation on the malignant feature matrices of all frames of the region of interest images in the first image collection means that the malignant feature matrices of all the frames of the region of interest images in the first image collection are listed in the column The direction (that is, the direction of the feature dimension) takes the maximum malignant eigenvalue, so that the malignant eigenvalue of each feature dimension is the maximum of the malignant feature matrix of all frame ROI images in the first image set in the feature dimension. A 1×k malignant feature matrix of malignant feature values, the malignant feature matrix obtained by the maximum pooling operation synthesizes the malignant information that can be contributed by each frame of the region of interest image in the first image set. That is, for an image set A (A=[frame _a , frame _b ,...frame _n ]), the total malignant feature value of the image set A is equal to satisfy the following relationship:

Therefore, the responsible frame extraction method provided by this application first identifies the image of the region of interest with the largest total malignant feature value as the first malicious responsible frame in the malicious responsible frame set (that is, the malicious responsible frame set), and then takes the remaining Each frame of the ROI image that is not determined to be a malicious responsibility frame and the first malicious responsibility frame form a first image set (every first image set at this time includes the first malicious responsibility frame and A region of interest image not determined as a malignant responsible frame), and calculate the total malignant feature value of each first image set, then the sense of not determined as a malignant responsible frame in the first image set with the largest total malignant feature value The ROI image is the second malicious frame in the malicious frame set. Then the first malicious responsibility frame, the second malicious responsibility frame and each frame of region-of-interest images that are not determined to be malicious responsibility frames are formed into a first image set respectively (every first frame of this moment) The image sets all include the first malicious frame, the second frame and an image of the region of interest that is not determined to be a frame), by calculating the total malignant feature value of each first image set, you can find out The first image set with the largest total malignant feature value, if the malignant feature entropy of the first image set with the largest total malignant feature value is greater than the malicious responsibility frame set composed of the first malicious responsibility frame and the second malicious responsibility frame The malicious feature entropy, then end the extraction of the malicious responsibility frame, and take the extracted first malicious responsibility frame and the second malicious responsibility frame as the final malicious responsibility frame; if the first The malignant feature entropy of the image set is less than or equal to the malignant feature entropy of the malignant responsibility frame set composed of the first malicious responsibility frame and the second malicious responsibility frame, then the first image set with the largest total malignant characteristic value The ROI image that is not determined to be a malicious frame is determined as the third malicious frame in the malicious frame set. Repeat the above steps until the malignant feature entropy of the first image set with the largest total malignant feature value is greater than the malignant feature entropy of the malignant responsible frame set composed of all malicious responsible frames. Since visually identical ROI images usually share similar malignant feature matrices, adding similar ROI images will not have a significant impact on the total malignant feature value of the image set, therefore, using the malignant The method of extracting responsibility frames will not repeatedly select similar vicious responsibility frames.

Further, in an exemplary implementation, the malignant feature entropy of the image set is calculated according to the following formulas (6) and (7):

H ₁ (A)＝-p ₁ (A)×log ₂ p ₁ (A) (6)

In the formula, H ₁ (A) represents the malignant feature entropy of image set A, MScoreA represents the total malignant feature value of image set A, and BScoreA represents the total benign feature value of image set A.

Specifically, when the entropy of the malignant feature increases, it means that the uncertainty of the prediction result begins to rise. Therefore, when the entropy of the malignant feature increases, it is necessary to stop continuously extracting new malicious responsibility frames. Therefore, the present application can automatically extract the required number of malignant frames that can contribute important features to the classification of medical videos based on the content of the acquired medical video by judging whether it is necessary to stop the extraction of malignant responsible frames according to whether the feature entropy has increased. Responsibility frame. Please refer to FIG. 13 , which schematically shows the relationship between the feature entropy and the number of responsible frames provided by an embodiment of the present application. As shown in Figure 13, when the number of image frames in the malicious responsible frame set is less than 5, the malignant feature entropy of the malicious responsible frame set continues to decrease, and when the number of image frames in the malicious responsible frame set exceeds 5, the malignant feature entropy begins increases, which means that the uncertainty of the prediction results increases. Therefore, for the example shown in Figure 13, when the number of malicious responsible frames in the malicious responsible frame set reaches 5 frames, there is no need to continue to extract new Malignant responsibility frame.

In an exemplary implementation, the extraction of benign responsible frames according to the feature matrix of each frame of the region-of-interest image until the second preset end condition is met includes:

For each frame of the ROI image, the ROI image is acquired according to the feature matrix of the ROI image and the difference between the benign feature weight parameter and the malignant feature weight parameter corresponding to the static image classification neural network model The benign feature matrix of ;

According to the benign feature matrix of each frame of the region of interest image, the benign responsible frame is extracted until the second preset end condition is met.

It can be seen from the above formula (1) that the probability _Y of the ROI image belonging to the benign category is only determined by the relative difference between the benign feature weight parameter and the malignant feature weight parameter and the feature matrix X of the ROI image, Therefore, the present application obtains the benign feature matrix of the image of the region of interest according to the feature matrix of the image of the region of interest and the difference between the benign feature weight parameter and the malignant feature weight parameter corresponding to the static image classification neural network model , and then extract the benign responsible frames according to the benign feature matrix of the ROI image in each frame, until the second preset end condition is met, so that the benign responsible frames with a large amount of benign contribution information can be accurately extracted.

In an exemplary implementation, the sensory information is obtained according to the feature matrix of the image of the region of interest and the difference between the benign feature weight parameters and the malignant feature weight parameters corresponding to the static image classification neural network model. The benign feature matrix of the ROI image, including:

According to the following formulas (8) and (9), the benign feature matrix of the image of the region of interest is obtained:

In formula (8), [FB] _i represents the benign feature matrix of the region-of-interest image of the i-th frame, in formula (9),

Indicates the benign eigenvalue of the jth feature dimension in the benign feature matrix of the region-of-interest image of the i-th frame.

It should be noted that, as those skilled in the art can understand,

means from 0 and

Take the larger one, that is, if

greater than 0, then

Pick

like

is less than 0, then

Take 0.

Further, in an exemplary implementation, the extraction of benign responsible frames according to the benign feature matrix of each frame of the region-of-interest image until the second preset end condition is met includes:

For each frame of the region of interest image, add the benign eigenvalues of all the feature dimensions in the benign feature matrix of the region of interest image to obtain the total benign eigenvalue of the region of interest image;

According to the total benign feature values of the ROI images of each frame, the benign responsible frame is extracted until the second preset end condition is satisfied.

Specifically, in combination with formulas (8) and (9) above, it can be known that the total malignant feature value of the i-th frame ROI image can be expressed by the following formula:

Therefore, by extracting benign responsible frames according to the total benign feature values of the ROI images in each frame, not only can the extraction efficiency of benign responsible frames be further improved, but also effectively prevent the extraction of benign responsible frames with similar features.

Please continue to refer to FIG. 14 , which schematically shows a specific flowchart of extracting benign responsibility frames provided by an embodiment of the present application. As shown in FIG. 14, in an exemplary implementation, the extraction of benign responsible frames is carried out according to the total benign feature value of the region-of-interest image of each frame until the second preset end condition is met, including:

Step B10, sorting the total benign feature values of the ROI images of each frame, and determining the ROI image with the largest total benign feature value as the benign responsible frame;

Step B20. Composing all benign responsible frames and each non-benign responsible frame into a second image set, and calculating the total benign feature value of each second image set, wherein the total benign feature of the second image set The value is equal to the sum of the benign eigenvalues of all the feature dimensions in the benign feature matrix obtained after performing the maximum pooling operation on the benign feature matrices of all frame ROI images in the second image collection, and the non-benign responsible frame is an image of a region of interest that has not been determined to be a benign responsible frame;

Step B30, judging whether the benign feature entropy corresponding to the second image set with the smallest total benign feature value is greater than the benign feature entropy corresponding to the benign responsible frame set composed of all benign responsible frames;

If not, then perform step B40, if so, then perform step B50;

Step B40, determine all ROI images in the second image set with the smallest total benign feature value as benign responsible frames, and return to step B20;

Step B50, ending the extraction of benign responsibility frames.

Specifically, performing the maximum pooling operation on the benign feature matrices of all frames of the region-of-interest images in the second image collection means that the benign feature matrices of all the frames of the region-of-interest images in the second image collection are listed in the column The direction (that is, the direction of the feature dimension) takes the maximum benign eigenvalue, so that the benign eigenvalue of each feature dimension is the maximum benign feature matrix of all frame ROI images in the second image collection in the feature dimension. A 1×k benign feature matrix of benign eigenvalues, the benign feature matrix obtained by the max pooling operation synthesizes the benign information that can be contributed by each frame of the region-of-interest image in the second image set. That is, for an image set A (A=[frame _a , frame _b ,...frame _n ]), the total benign eigenvalues of the image set A are equal to satisfy the following relationship:

Therefore, the responsible frame extraction method provided in this application firstly identifies the ROI image with the largest total benign feature value as the first benign responsible frame in the set of benign responsible frames (that is, the set of benign responsible frames), and then takes the remaining Each frame of the region-of-interest image that is not determined to be a benign responsible frame and the first benign responsible frame form a second image set (each second image set at this time includes the first benign responsible frame and A region of interest image not determined as a benign responsible frame), and calculate the total benign eigenvalues of each second image set, then the sense of not determined as a benign responsible frame in the second image set with the largest total benign eigenvalue The ROI image is the second benign responsible frame in the set of benign responsible frames. Then the first benign responsibility frame, the second benign responsibility frame and each frame of ROI images that are not determined to be benign responsibility frames form a second image set respectively (every second image set at this time) The image sets all include the first benign responsible frame, the second benign responsible frame and an image of the region of interest that is not determined to be a benign responsible frame), by calculating the total benign eigenvalues of each second image set, you can find out The second image set with the largest total benign feature value, if the benign feature entropy of the second image set with the largest total benign feature value is greater than the benign responsible frame set composed of the first benign responsible frame and the second benign responsible frame benign feature entropy, the extraction of the benign responsibility frame ends, and the extracted first benign responsibility frame and the second benign responsibility frame are taken as the final benign responsibility frame; if the second benign responsibility frame with the largest total benign feature value The benign feature entropy of the image set is less than or equal to the benign feature entropy of the benign responsible frame set composed of the first benign responsible frame and the second benign responsible frame, then the second image set with the largest total benign feature value The ROI image that is not determined as a benign responsible frame is determined as the third benign responsible frame in the benign responsible frame set. Repeat the above steps until the benign feature entropy of the second image set with the largest total benign feature value is greater than the benign feature entropy of the benign responsible frame set composed of all benign responsible frames. Since visually identical ROI images usually share similar benign feature matrices, adding similar ROI images will not have a significant impact on the total benign feature value of the image collection, therefore, the benign The extraction method of responsibility frame will not repeatedly select similar benign responsibility frames.

Specifically, in an exemplary implementation, the benign feature entropy of the image set is calculated according to the following formulas (12) and (13):

H ₀ (A)＝-p ₀ (A)×log ₂ p ₀ (A) (12)

In the formula, H ₀ (A) represents the benign feature entropy of image set A, MScoreA represents the total malignant feature value of image set X, and BScoreA represents the total benign feature value of image set A.

Specifically, when the entropy of benign features increases, it means that the uncertainty of the prediction results begins to rise. Therefore, when the entropy of benign features increases, it is necessary to stop continuously extracting new benign responsibility frames. Therefore, this application judges whether to stop the extraction of benign responsible frames according to whether the feature entropy has increased, and can automatically extract the required number of benign frames that can contribute important features to the classification of medical videos according to the content of the acquired medical video. Responsibility frame.

Please continue to refer to FIG. 16 , which schematically shows a software interface of a doctor adjusting responsibility frame provided by an embodiment of the present application. As shown in Figure 16, the extracted malicious responsibility frames and/or benign responsibility frames can be displayed in the responsibility aspect recommendation window in the software interface, and the acquired medical videos to be classified can also be displayed in the video playback window in the software interface. display, the doctor can access the adjacent frames near the responsible frame (that is, the image of the region of interest) near the responsible frame through the buttons of "previous frame" and "next frame", and can choose to accept or reject the current frame (that is, the image of the region of interest currently accessed ) as the responsibility frame, the system will automatically record the responsibility frame confirmed by the doctor.

The inventors of the present application collected a total of 13702 2D ultrasound breast nodule images (including 9177 images from 2457 patients with benign pathology and 4545 images from 991 patients with malignant pathology), and 2141 breast ultrasound images Videos (including 1227 videos from 560 patients with benign pathology and 914 videos from 412 patients with malignant pathology) were used for training and validation of the still image classification neural network model and the video classification model. The performance of the video classification method provided by this application was evaluated by using AUROC (area under the test operator curve), accuracy rate, sensitivity, and specificity indicators. The verification results of the 50-fold crossover (dividing the data set into 5 equal parts on average, taking one part for each round as the test set and the rest as the training set) are shown in Table 1 below, and the results of the test set are shown in Table 2 below.

Table 1. Five-fold cross-validation results for the classification of benign and malignant breast nodules

Table 2. Test set results for benign and malignant breast nodules classification

As can be seen from Table 1 and Table 2, the AUROC, accuracy, Sensitivity and specificity are significantly better than AUROC, accuracy, sensitivity and specificity for benign and malignant classification of breast nodules based on the responsibility frame artificially selected by doctors.

Corresponding to the above method for extracting responsible frames, the present application also provides a video classification method. Please refer to FIG. 17 , which schematically shows a flowchart of the video classification method provided by an embodiment of the present application. As shown in Figure 17, described video classification method comprises the steps:

Step S310, using the method for extracting responsible frames described above, extracting a preset number of responsible frames from the acquired medical video.

Step S320, classify the video according to the feature matrix of the preset number of responsible frames.

Since the video classification method provided by this application uses the above-mentioned responsible frame extraction method to extract a preset number of responsible frames, thus, the video classification method provided by this application has all the above-mentioned responsible frame extraction methods. advantage. In addition, since the video classification method provided by the present application is to classify videos according to the extracted preset number of responsible frames, it can effectively reduce the interference of noise frames in the medical video, and effectively improve video classification (such as Classification of benign and malignant nodule videos) accuracy.

Further, the video classification according to the feature matrix of the preset number of responsible frames includes:

Performing a maximum pooling operation on the feature matrices of the preset number of responsible frames to obtain the feature matrix of the responsible frame set;

Video classification is performed according to the feature matrix of the responsible frame set.

Specifically, by performing maximum pooling on all responsible frames in the column direction, the feature matrix contributed by all responsible frames can be obtained, that is, the feature matrix of the responsible frame set, and thus, according to the obtained feature of the responsible frame set matrix, video classification can be performed accurately.

In an exemplary implementation, the classification of the video according to the feature matrix of the responsible frame set includes:

Input the feature matrix of the responsible frame set into the video classification model to classify the video.

Thus, by inputting the feature matrix contributed by all the responsible frames into the pre-trained video classification model, the final video classification can be performed.

Further, the video classification model is a random forest classification model. The random forest classification model consists of multiple classification trees, and each classification tree classifies the input feature matrix. The random forest classification model votes according to the classification results of all classification trees, and finally makes a judgment of benign and malignant lesions. It should be noted that, as those skilled in the art can understand, in some other implementation manners, the video classification model may also be other classification models than the random forest classification model, which is not limited in this application. In addition, as those skilled in the art can understand, the random forest classification model is obtained through pre-training, specifically, a video training set can be used (the video training set includes the feature matrix and the corresponding classification label of the responsible frame set of the video) Train a pre-built random forest classification model to get a video classification model.

When using the method of information entropy reduction to extract responsibility frames, for different skeleton networks (such as MobileNet, DenseNet121, Xception), the number of frames of responsibility frames that can be extracted when the information entropy is reduced to 0 is different, so for different skeleton networks Feature extraction has a different impact on the performance of the classification model. For example, when the skeleton network adopts the MobileNet model and the number of the extracted responsibility frames is 5, the evaluation indicators of the classification model: ROC-AUC is 0.885 (95% CI: 0.830-0.939), and PR-AUC is 0.876 (95% CI: 0.831-0.927), Accuarcy is 0.82, F1-Score is 0.819, and all evaluation indicators are better than those when directly using video to predict benign and malignant. Among them, ROC (Receiver Operating Characteristic Curve) represents the receiver operating characteristic curve, AUC (Aera under the curve) represents the area under the curve, ROC-AUC represents the area under the ROC curve, CI (confidence interval) represents the confidence interval, PR- AUC represents the area under the precision and recall curves. When the skeleton network adopts the DenseNet121 model and the number of extracted responsible frames is 10, the evaluation indicators of the classification model: ROC-AUC is 0.891 (95% CI: 0.835-0.947), PR-AUC is 0.908 (95% CI: 0.876-0.940), Accuarcy is 0.85, and F1-Score is 0.838. Compared with directly using video to judge benign and malignant, ROC-AUC and PR-AUC are basically the same (difference 0.002), Accuarcy is 0.01 higher, and F1-Score Significant improvement, from 0.819 to 0.838. It can be seen that using different skeleton networks for feature extraction will have different effects on the prediction performance of the classification model, so the appropriate network model can be selected as the skeleton network according to the specific conditions of the classification model.

In an exemplary implementation, the video classification method further includes:

The classification result of the video and the extracted preset number of responsible frames are displayed.

Thus, by displaying the extracted preset number of responsible frames, the responsible frames on which video classification is based can be given, so that doctors can judge whether the obtained video classification results are accurate based on the extracted responsible frames , so as to further improve the accuracy of video classification. For example, when the video is an ultrasound video, by outputting a preset number of responsible frames extracted from the ultrasound video, it may help to further reduce the missed diagnosis rate and misdiagnosis rate during the ultrasound screening process.

Based on the same inventive concept, the present application also provides an electronic device. Please refer to FIG. 18 , which schematically shows a block structure diagram of the electronic device provided in an embodiment of the present application. As shown in FIG. 18, the electronic device includes a processor 101 and a memory 103, and a computer program is stored on the memory 103. When the computer program is executed by the processor 101, the above-mentioned responsibility frame extraction is realized. method or video classification method. Since the electronic device provided by this application and the method for extracting responsible frames described above belong to the same inventive concept, the electronic device provided by this application has all the advantages of the method for extracting responsible frames described above, so no further details are given here. .

As shown in FIG. 18 , the electronic device further includes a communication interface 102 and a communication bus 104 , wherein the processor 101 , the communication interface 102 , and the memory 103 communicate with each other through the communication bus 104 . The communication bus 104 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like. The communication bus 104 can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus. The communication interface 102 is used for communication between the electronic device and other devices.

The processor 101 mentioned in this application can be a central processing unit (Central Processing Unit, CPU), and can also be other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor, etc. The processor 101 is the control center of the electronic device, connecting various parts of the entire electronic device with various interfaces and lines.

The memory 103 can be used to store the computer program, and the processor 101 implements various functions of the electronic device by running or executing the computer program stored in the memory 103 and calling the data stored in the memory 103. Function.

The memory 103 may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The present application also provides a readable storage medium, wherein a computer program is stored in the readable storage medium, and when the computer program is executed by a processor, the method for extracting responsible frames or the video classification method described above can be implemented. Since the storage medium provided by this application and the method for extracting responsible frames described above belong to the same inventive concept, the storage medium provided by this application has all the advantages of the method for extracting responsible frames described above, so no further details are given here.

The readable storage medium in the embodiments of the present application may use any combination of one or more computer-readable media. The readable medium may be a computer readable signal medium or a computer readable storage medium. A computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connection with one or more wires, portable computer hard disk, hard disk, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. As used herein, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a data signal carrying computer readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. .

Computer program code for carrying out the operations of the present application may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via an Internet connection using an Internet service provider). ).

In summary, compared with the prior art, the responsible frame extraction method, video classification method, electronic equipment and storage medium provided by this application have the following advantages:

It should be noted that the devices and methods disclosed in the embodiments herein may also be implemented in other ways. The device embodiments described above are only illustrative, for example, the flowcharts and block diagrams in the accompanying drawings show the architecture, functions and operations of possible implementations of devices, methods and computer program products according to multiple embodiments herein . In this regard, each block in a flowchart or block diagram may represent a module, a program segment, or a portion of code that includes one or more programmable components for implementing specified logical functions. Executable instructions, the module, program segment or part of the code contains one or more executable instructions for realizing the specified logic function. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It is also to be noted that each block in the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or actions. implemented, or may be implemented by a combination of special purpose hardware and computer instructions.

In addition, the functional modules in the various embodiments herein can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.

The above description is only a description of the preferred embodiments of the present application, not any limitation to the scope of the present application. Any changes and modifications made by those of ordinary skill in the field of the present application based on the above disclosures belong to the protection scope of the present application. Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. Thus, if these modifications and variations belong to the scope of the present application and its equivalent technology, the present application also intends to include these modifications and variations.

Claims

A responsibility frame extraction method, characterized in that, comprising:

Obtain the video to be extracted;

Using the skeleton network of the static image classification neural network model to perform feature extraction on each frame of image in the video to be extracted, to obtain the feature matrix of each frame of image;

Performing a maximum pooling operation on the feature matrices of all frame images to obtain the video feature matrix of the video to be extracted;

According to the feature matrix of each frame image and the video feature matrix, a preset number of responsible frames are extracted.
The responsible frame extraction method according to claim 1, wherein the extraction of a preset number of responsible frames according to the feature matrix of each frame image and the video feature matrix includes:

Multiplying the eigenvalue of each feature dimension in the video feature matrix by the importance value of the feature dimension to obtain the video feature importance matrix;

For each frame of image, multiply the eigenvalue of each feature dimension in the feature matrix of the frame image by the importance value of the feature dimension to obtain the feature importance matrix of the frame image;

According to the video feature importance matrix and the feature importance matrix of each frame image, a preset number of responsible frames are extracted.
The responsible frame extraction method according to claim 2, wherein the extraction of a preset number of responsible frames according to the feature importance matrix of the video and the feature importance matrix of each frame image includes:

Step A1, using the video feature importance matrix as the current video feature importance matrix;

Step B1, for each frame of image, subtracting the feature importance matrix of the frame image from the current video feature importance matrix to obtain the remaining feature importance matrix corresponding to the frame image;

Step C1. For each frame of image, add the eigenvalues of each feature dimension in the remaining feature importance matrix corresponding to the frame of image to obtain the remaining information entropy corresponding to the frame of image;

Step D1, taking the image with the smallest remaining information entropy as the current responsible frame;

Step E1, using the remaining feature importance matrix corresponding to the current responsible frame as a new current video feature importance matrix;

The above steps B1 to E1 are repeated until a preset number of responsible frames are extracted.
The responsible frame extraction method according to claim 3, wherein the feature importance matrix of the frame image is subtracted from the feature importance matrix of the current video to obtain the remaining feature importance corresponding to the frame image Matrix, including:

The eigenvalue of each feature dimension in the feature importance matrix of the current video is subtracted from the eigenvalue of the corresponding feature dimension in the feature importance matrix of the frame image to obtain the eigenvalue difference of each feature dimension;

For the eigenvalue difference of each feature dimension, if the eigenvalue difference of the feature dimension is less than 0, then use 0 as the eigenvalue of the corresponding feature dimension in the remaining feature importance matrix corresponding to the frame image; if the feature dimension’s If the eigenvalue difference is greater than or equal to 0, the eigenvalue difference of the feature dimension is used as the eigenvalue of the corresponding feature dimension in the remaining feature importance matrix corresponding to the frame image.
The responsible frame extraction method according to claim 1, wherein the extraction of a preset number of responsible frames according to the feature matrix of each frame image includes:

For each frame of image, multiply the eigenvalue of each feature dimension in the feature matrix of the frame image by the contribution weight value of the feature dimension to obtain the feature entropy matrix of the frame image;

Performing a maximum pooling operation on the feature entropy matrices of all frame images to obtain the video feature entropy matrix of the video to be extracted;

According to the feature entropy matrix of each frame image and the video feature entropy matrix, a preset number of responsible frames are extracted.
The responsible frame extraction method according to claim 5, wherein the extraction of a preset number of responsible frames according to the feature entropy matrix of each frame of image and the video feature entropy matrix includes:

For each frame image, add the eigenvalues of all feature dimensions in the feature entropy matrix of the frame image to obtain the evaluation score of the frame image;

adding the eigenvalues of all feature dimensions in the video feature entropy matrix to obtain the evaluation score of the video to be extracted;

According to the evaluation score of each frame image and the evaluation score of the video to be extracted, a preset number of responsible frames are extracted, wherein the evaluation score of the video to be extracted is related to the preset number of responsible frames The resulting set of images has the smallest difference in evaluation scores.
The responsible frame extraction method according to claim 6, wherein the extraction of a preset number of responsible frames according to the evaluation score of each frame image and the evaluation score of the video to be extracted includes:

Step A2. For each frame of image, calculate the difference between the evaluation score of the video to be extracted and the evaluation score of the frame image, so as to obtain the feature entropy difference of the frame image;

Step B2, determining the image with the smallest feature entropy difference as the responsible frame;

Step C2, forming an image set with all responsible frames and each non-responsible frame, and calculating the evaluation score of each image set;

Step D2. For each image set, calculate the difference between the evaluation score of the video to be extracted and the evaluation score of the image set to obtain the feature entropy difference of the image set;

Step E2, determining all images in the image set with the smallest feature entropy difference as responsible frames;

The above steps C2 to E2 are repeated until a preset number of responsible frames are extracted.
The responsibility frame extraction method according to claim 1, further comprising:

Using the target detection neural network model to extract the region of interest for each frame of image in the acquired video to be extracted, so as to obtain the region of interest image corresponding to each frame of image;

The skeleton network of the static image classification neural network model is used to extract the features of each frame of the region of interest image to obtain the feature matrix of each frame of the region of interest image;

According to the feature matrix of the image of the region of interest of each frame, the malicious responsible frame is extracted until the malignant feature entropy corresponding to the malicious responsible frame set composed of all the malicious responsible frames reaches a minimum value; and/or

According to the feature matrix of the ROI image of each frame, the benign responsible frame is extracted until the benign feature entropy corresponding to the benign responsible frame set composed of all the benign responsible frames reaches a minimum value.
The responsible frame extraction method according to claim 8, wherein the extraction of the malicious responsible frame is performed according to the feature matrix of each frame of the region of interest image until the first preset end condition is met, including:

For each frame of the ROI image, the ROI image is acquired according to the feature matrix of the ROI image and the difference between the malignant feature weight parameter and the benign feature weight parameter corresponding to the static image classification neural network model The malignant feature matrix of ;

According to the malignant feature matrix of each frame of the region of interest image, the malicious responsible frame is extracted until the first preset end condition is met; and/or

According to the feature matrix of the region of interest image of each frame, the benign responsible frame is extracted until the second preset end condition is met, including:

For each frame of the ROI image, the ROI image is acquired according to the feature matrix of the ROI image and the difference between the benign feature weight parameter and the malignant feature weight parameter corresponding to the static image classification neural network model The benign feature matrix of ;

According to the benign feature matrix of each frame of the region of interest image, the benign responsible frame is extracted until the second preset end condition is met.
The responsible frame extraction method according to claim 9, characterized in that, according to the feature matrix of the image of the region of interest and the weight parameter of the malignant feature and the weight parameter of the benign feature corresponding to the static image classification neural network model Poor, obtain the malignant feature matrix of the ROI image, including:

According to the following formula, the malignant feature matrix of the image of the region of interest is obtained:

In the formula,
Represents the eigenvalue of the jth feature dimension in the feature matrix of the i-th frame region of interest image,
Indicates the malignant feature weight of the jth feature dimension corresponding to the static image classification neural network model,
Indicates the benign feature weight of the jth feature dimension corresponding to the static image classification neural network model,
Represents the malignant feature value of the jth feature dimension in the malignant feature matrix of the i-th frame region of interest image, [FM] i represents the malignant feature matrix of the i-th frame region of interest image; and/or

The acquisition of the benign feature matrix of the ROI image according to the feature matrix of the ROI image and the difference between the benign feature weight parameters and the malignant feature weight parameters corresponding to the static image classification neural network model includes:

Obtain the benign feature matrix of the ROI image according to the following formula:

In the formula,
Represents the eigenvalue of the jth feature dimension in the feature matrix of the i-th frame region of interest image,
Indicates the benign feature weight of the jth feature dimension corresponding to the static image classification neural network model,
Indicates the malignant feature weight of the jth feature dimension corresponding to the static image classification neural network model,
Indicates the benign eigenvalue of the jth feature dimension in the benign feature matrix of the i-th frame ROI image, and [FB] i indicates the benign feature matrix of the i-th frame ROI image.
The method for extracting the responsible frame according to claim 9, wherein the extraction of the malicious responsible frame is performed according to the malignant feature matrix of each frame of the region of interest image until the first preset end condition is met, including:

For each frame of the ROI image, add the malignant eigenvalues of all the feature dimensions in the malignant feature matrix of the ROI image to obtain the total malignant eigenvalue of the ROI image;

According to the total malignant feature value of each frame of the region of interest image, the malicious responsible frame is extracted until the first preset end condition is met; and/or

According to the benign feature matrix of each frame of the region of interest image, the benign responsible frame is extracted until the second preset end condition is met, including:

For each frame of the region of interest image, add the benign eigenvalues of all the feature dimensions in the benign feature matrix of the region of interest image to obtain the total benign eigenvalue of the region of interest image;

According to the total benign feature values of the ROI images of each frame, the benign responsible frame is extracted until the second preset end condition is met.
The method for extracting the responsible frame according to claim 11, wherein the extraction of the malicious responsible frame is performed according to the total malignant feature value of the region-of-interest image of each frame until the first preset end condition is met, including:

Step A10, sorting the total malignant feature values of the ROI images of each frame, and determining the ROI image with the largest total malignant feature value as the malignant responsible frame;

Step A20, forming a first image set with all malicious responsible frames and each non-malignant responsible frame, and calculating the total malignant feature value of each first image set respectively, wherein the total malignant feature value of the first image set The value is equal to the sum of malignant feature values of all feature dimensions in the malignant feature matrix obtained after performing the maximum pooling operation on the malignant feature matrices of all frame ROI images in the first image set, and the non-malignant responsible frame is an image of a region of interest that is not determined to be a malicious frame;

Step A30, judging whether the malignant feature entropy corresponding to the first image set with the smallest total malignant feature value is greater than the malignant feature entropy corresponding to the malignant responsible frame set composed of all malignant responsible frames;

If not, then perform step A40, if so, then perform step A50;

Step A40, determine all frames of ROI images in the first image set with the smallest total malignant feature value as malignant responsible frames, and return to step A20;

Step A50, ending the extraction of malicious responsibility frames; and/or

According to the total benign eigenvalues of the ROI images of each frame, the benign responsible frame is extracted until the second preset end condition is met, including:

Step B10, sorting the total benign feature values of the ROI images of each frame, and determining the ROI image with the largest total benign feature value as the benign responsible frame;

Step B20. Composing all benign responsible frames and each non-benign responsible frame into a second image set, and calculating the total benign feature value of each second image set, wherein the total benign feature of the second image set The value is equal to the sum of the benign eigenvalues of all the feature dimensions in the benign feature matrix obtained after the benign feature matrix of all frame ROI images in the second image set is subjected to the maximum pooling operation, and the non-benign responsible frame is an image of a region of interest that has not been determined to be a benign responsible frame;

Step B30, judging whether the benign feature entropy corresponding to the second image set with the smallest total benign feature value is greater than the benign feature entropy corresponding to the benign responsible frame set composed of all benign responsible frames;

If not, then perform step B40, if so, then perform step B50;

Step B40, determine all ROI images in the second image set with the smallest total benign feature value as benign responsible frames, and return to step B20;

Step B50, ending the extraction of benign responsibility frames.
The responsibility frame extraction method according to claim 8, wherein the malignant feature entropy of the image set is calculated according to the following formula:

H 1 (A)＝-p 1 (A)×log 2 p 1 (A)

In the formula, H 1 (A) represents the malignant feature entropy of the image set X, MScoreA represents the total malignant feature value of the image set A, BScoreA represents the total benign feature value of the image set A; and/or

Calculate the benign feature entropy of the image set according to the following formula:

H 0 (A)＝-p 0 (A)×log 2 p 0 (A)

In the formula, H 0 (A) represents the benign feature entropy of image set A, MScoreA represents the total malignant feature value of image set A, and BScoreA represents the total benign feature value of image set A.
The responsible frame extraction method according to claim 8, wherein the target detection neural network model is used to extract the region of interest for each frame of image in the acquired video to be extracted, so as to obtain each frame of image Corresponding ROI images, including:

Using the target detection neural network model to extract the region of interest for each frame of image in the acquired video to be extracted, to obtain the position information of the region of interest corresponding to each frame of image;

According to the position information of the region of interest corresponding to each frame of image, the corresponding region is cut out on each frame of image, so as to obtain the image of the region of interest corresponding to each frame of image.
A video classification method, characterized in that, comprising:

Using the responsible frame extraction method described in any one of claims 1 to 14 to extract a preset number of responsible frames from the acquired video;

performing a maximum pooling operation on the feature matrices of the preset number of responsible frames to obtain a feature matrix of the responsible frame set; and

Video classification is performed according to the feature matrix of the responsible frame set.
The video classification method according to claim 15, wherein the classification of the video according to the feature matrix of the responsible frame set includes:

Input the feature matrix of the responsible frame set into the video classification model to classify the video.
The video classification method according to claim 16, wherein the video classification model is a random forest classification model.
The video classification method according to claim 15, wherein the video classification method further comprises:

The classification result of the video and the extracted preset number of responsible frames are displayed.
An electronic device, characterized in that it includes a processor and a memory, and a computer program is stored in the memory, and when the computer program is executed by the processor, the responsibility described in any one of claims 1 to 14 is realized A frame extraction method or a video classification method according to any one of claims 15 to 18.
A readable storage medium, characterized in that a computer program is stored in the readable storage medium, and when the computer program is executed by a processor, the responsibility frame extraction method described in any one of claims 1 to 14 is realized Or the video classification method described in any one of claims 15 to 18.