CN112001224A

CN112001224A - Video acquisition method and video acquisition system based on convolutional neural network

Info

Publication number: CN112001224A
Application number: CN202010632493.6A
Authority: CN
Inventors: 王宇; 宗文
Original assignee: Beijing Aowei Video Technology Co ltd
Current assignee: Beijing Aowei Video Technology Co ltd
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2020-11-27

Abstract

The invention discloses a video acquisition method based on a convolutional neural network, which comprises the following steps: s101: receiving a video acquired by a video acquisition device; s102: extracting key parts from video frames of the video through a convolutional neural network model; s103: and adjusting the physical parameters of the video acquisition device according to the detection information and the positioning information of the key part.

Description

Video acquisition method and video acquisition system based on convolutional neural network

Technical Field

The present invention generally relates to the field of video image processing, and more particularly, to a video acquisition method and a video acquisition system based on a convolutional neural network.

Background

The object detection and tracking technique for video content aims to detect and locate an object of interest (generally a moving object) along a time axis in a video sequence, so as to form a motion track of the object. The target detection and tracking technology is widely applied in different fields, such as intelligent video monitoring, automatic driving, unmanned aerial vehicle navigation, industrial robots and the like.

The method for detecting the moving target mainly comprises a background difference method, a time difference method and an optical flow field. These methods are all space-domain based detection methods. The former two methods have low computational complexity and good real-time performance, but have application limitations, for example, the background difference method has a good effect under the condition of high background stability, but target detection errors can be generated if illumination condition changes, weather changes, camera shake and the like occur. The time difference rule has a good effect under the condition that the picture change between the detected related frames is not large, and the change of the illumination condition can also seriously influence the accuracy of target detection. The optical flow field has high calculation complexity, the detection accuracy is higher than that of the former two methods, but the problem of change of illumination conditions cannot be solved.

Early moving object tracking algorithms mainly search for a candidate object closest to an object to be tracked in a current frame through template matching, and classical algorithms include Kalman filtering (Kalman filter), Mean filtering (Mean Shift), and the like. In recent years, with the maturation of Deep Learning (Deep Learning) theory and the increasing improvement of various application frameworks, an object tracking algorithm based on Deep Learning and the essence of Deep Learning and Correlation filtering (Correlation Filter) as a classifier becomes a mainstream technology.

The statements in the background section are merely prior art as they are known to the inventors and do not, of course, represent prior art in the field.

Disclosure of Invention

In view of at least one of the drawbacks of the prior art, the present invention provides a video capture method based on a convolutional neural network, including:

s101: receiving a video acquired by a video acquisition device;

s102: extracting key parts from video frames of the video through a convolutional neural network model;

s103: and adjusting the physical parameters of the video acquisition device according to the detection information and the positioning information of the key part.

According to an aspect of the present invention, the video capturing method further comprises: building the convolutional neural network model by:

selecting a model and a training frame;

establishing a training set for marking the key parts;

and training the training set by utilizing the model and the training frame, and outputting the model topology and parameters of the convolutional neural network model.

According to one aspect of the invention, the model comprises one of an R-CNN model, a Fast R-CNN model, a Faster R-CNN model and a Mask R-CNN model, wherein the video acquisition method further comprises:

selecting a plurality of boxes in a predetermined area of the video frame as an alternative to the detection and localization information of the key part.

According to an aspect of the present invention, the step S102 includes: extracting the key parts from the multi-frame video frames of the video through the convolutional neural network model,

wherein the video capture method further comprises: when the same key part is detected in a preset number of continuous multiframes and the coordinate difference and/or the size difference of the same key part in the continuous multiframes are smaller than a threshold value, judging that the detection of the same key part has consistency,

wherein the step S103 includes: and when the detection of all the same key parts has consistency, adjusting the physical parameters of the video acquisition device according to the detection information and the positioning information of the key parts.

According to an aspect of the present invention, the video capturing method further comprises: judging whether different key parts accord with a preset position relation or not; when different key parts accord with a preset position relation, determining that the detection of the key parts is normal;

wherein the step S103 includes: and when different key parts accord with a preset position relation and the detection of all the same key parts has consistency, adjusting the physical parameters of the video acquisition device according to the detection information and the positioning information of the key parts.

According to one aspect of the invention, the critical sites include a hand of a surgeon and a surgical site of a patient, and the physical parameters include an angle and a focal length of the video capture device.

According to an aspect of the invention, said step S103 comprises: adjusting the angle of the video capture device by:

the length of the preset sliding time window is N3 frames, and a series of displacement vectors representing the motion trail of a key part are

Wherein 0 is not less than i<N3, the displacement vector of the key part after the angle adjustment of the video acquisition device is ADJ_k，ADJ_kSatisfies the following conditions:

according to one aspect of the invention, ADJ_kShould satisfy

Wherein gamma is<1。

According to an aspect of the invention, said step S103 comprises:

for calculated ADJ_kDetermining whether the critical part still falls within a specific field of view after being so adjusted:

if it falls within a specific field of view region, according to the calculated ADJ_kAdjusting the angle of the video acquisition device; otherwise, the value of gamma is increased or decreased to recalculate ADJ_kUntil Object_kCan fall within the particular field of view region.

According to an aspect of the invention, said step S103 comprises: and adjusting the focal length of the video acquisition device according to the proportion of the key part in the video frame.

The present invention also provides a video capture system, comprising:

the video acquisition device is configured to acquire videos in real time;

an image processing unit in communication with the video capture device to receive video captured by the video capture device and configured to perform the video capture method of any of claims 1-10 and adjust physical parameters of the video capture device.

According to one aspect of the invention, the image processing unit is further configured to encode the video collected by the video collecting device and output a code stream.

According to one aspect of the invention, the video acquisition system further comprises an ultrasound imaging device, the ultrasound imaging device is in communication with the image processing unit and is configured to acquire ultrasound images of a patient, and the image processing unit is configured to encode the ultrasound images and output a code stream.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure. In the drawings:

FIG. 1 is a diagram of the steps of CNN-based key surgical site detection and localization model generation

FIG. 2 is a schematic view of the adjustment of the pan/tilt angle of a camera

FIG. 3 remote ultrasound guidance system hardware configuration and functional block diagram based on convolutional neural network

FIG. 4 is a flowchart of the remote ultrasound guidance system based on a convolutional neural network

FIG. 5 is a remote ultrasonic guidance system of end cloud integration based on convolutional neural network

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description of the present invention, it should be noted that unless otherwise explicitly stated or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection, either mechanically, electrically, or in communication with each other; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, "above" or "below" a first feature means that the first and second features are in direct contact, or that the first and second features are not in direct contact but are in contact with each other via another feature therebetween. Also, the first feature being "on," "above" and "over" the second feature includes the first feature being directly on and obliquely above the second feature, or merely indicating that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature includes the first feature being directly above and obliquely above the second feature, or simply meaning that the first feature is at a lesser level than the second feature.

The following disclosure provides many different embodiments or examples for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Of course, they are merely examples and are not intended to limit the present invention. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples, such repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. In addition, the present invention provides examples of various specific processes and materials, but one of ordinary skill in the art may recognize applications of other processes and/or uses of other materials.

The invention relates to a remote operation system and a remote operation device, which are essentially a video transmission (including video storage in some applications), and the application fields of the remote operation system and the remote operation device comprise remote operation teaching, remote operation guidance and remote operation coordination. By applying the convolutional neural network to the remote operation system, the key operation part in the operation is automatically detected and positioned, and the physical parameters such as the holder angle, the focal length and the like of the camera are automatically adjusted through the detection and positioning information of the key operation part, so that the key operation part in a video picture is ensured not to be lost, and the picture quality is clear; meanwhile, the detection and positioning information of the key operation part is used as retrieval information, so that the structured storage of the operation video can be realized, the storage resource is saved, and the retrieval and the viewing of subsequent contents are facilitated.

Fig. 1 illustrates a convolutional neural network-based video acquisition method 100 according to an embodiment of the present invention, which is described in detail below with reference to the accompanying drawings.

In step S101: and receiving the video acquired by the video acquisition device. The video acquisition device is a camera, for example. The scene of the video may be a telesurgical scene with image elements of the surgeon, patient, surgical instruments, and surgical site. For convenience, a telesurgical scenario will be described below as an example. Those skilled in the art will readily appreciate that the concepts and aspects of the present invention may be applied to video processing of other scenes while remaining within the scope of the present invention.

In step S102: and extracting key parts from the video frames of the video through a convolutional neural network model. Examples of such critical sites include, but are not limited to, the hands of an operator, the surgical site of a patient, etc. The convolutional neural network model may be trained in advance, specifically for extracting key parts of a video in a specific scene, as will be described in detail below.

In step S103: and adjusting the physical parameters of the video acquisition device according to the detection information and the positioning information of the key part.

Taking the video of the remote operation scene as an example, the video is required to have key parts such as the hands of an operator and the operation part of a patient, and is preferably positioned in the center of the video picture. Therefore, the physical parameters of the video acquisition device, such as shooting angles, focal lengths and the like, are adjusted through the detection and positioning information of the key parts, so that the key operation parts in the video pictures are not lost, and the picture quality is clear. For example, when the key part cannot be detected, the shooting angle of the camera needs to be adjusted; when the definition of some key parts is low, the focal length of the camera needs to be adjusted, so that the camera focuses on the key parts to obtain clear display. The detection information includes, but is not limited to, the name or type of the extracted key location, or whether some key location is extracted. According to one embodiment, for each extracted key part, the convolutional neural network model identifies the name or type of the key part, and the name or type belongs to the detection information in the invention.

According to a preferred embodiment of the present invention, the convolutional neural network model is built in the following manner.

In step S1, a model and training frame are selected. The detection and positioning of the key parts belong to the category of target detection. Preferably, the model comprises one of an R-CNN model, a Fast R-CNN model and a Mask R-CNN model. The training framework may be selected according to different application requirements. For example, when pixel-level image segmentation is required, for example, an AI green screen technique capable of automatically completing segmentation of a foreground and a background of a picture in a video acquisition process, a Mask R-CNN frame may be adopted; the invention needs to complete the identification and positioning of key parts of the video input signal, Fast R-CNN can be selected, and good balance is achieved between the inference speed and the inference performance (comprehensive evaluation of multiple indexes such as misjudgment rate, missed judgment rate and the like).

In step S2, a training set is created labeling the key parts.

An original training set is first selected according to different telesurgical application scenarios. The original training set comprises an original positive sample and an original negative sample, wherein the original positive sample is to select pictures of the same operation content or images extracted from related video content, and corresponding key parts are marked; the original negative examples can be chosen arbitrarily. For an original positive sample, intercepting key operation parts and adding the key operation parts into a formal training positive sample set; for the original negative sample, the formal training negative sample set can be added by selecting different areas in the same image.

In step S3, the training set is trained using the model and the training framework, and a model topology and parameters of the convolutional neural network model are output. The convolutional neural network model obtained through the training can be used for identifying key parts in the application scene.

According to a preferred embodiment of the present invention, the video capturing method 100 further comprises: selecting a plurality of boxes in a predetermined area of the video frame as an alternative to the detection and localization information of the key part. For an R-CNN model and similar models, in order to determine the size and position information of an object to be determined or a key portion, some frames with different sizes, positions and aspect ratios need to be selected from a predetermined area of a video image as alternatives of detection and positioning information of the object to be determined, which are called Bounding boxes (Bounding boxes). Different bounding boxes may be set for different critical locations. For example, for the hands of an operator, a substantially rectangular bounding box may be provided; for a patient's head, a generally square bounding box may be provided, and for some patient's arms, an elongated bounding box may be provided. For early R-CNN models, the bounding box may be pre-selected. For the Faster R-CNN model, the bounding box is automatically obtained by adding a layer of Region pro-social Network (RPN) to the R-CNN model. Particularly, for the detection and positioning of the key surgical site, how to obtain the bounding box needs to be selected according to the actual application requirements. If the number of the key parts in the application is only one and the size and the position of the key parts in the video picture are strictly limited, a small number of bounding boxes can be preset; if the number of key locations in the application is more than one and there are no strict limitations on their size and location in the video picture, then the RPN is preferably used to automatically obtain the set of bounding boxes.

According to an embodiment of the present invention, in the step S102, a key portion is extracted from a single frame of video frame of the video through the convolutional neural network model, and a physical parameter of a video capture device is adjusted according to detection information and positioning information of the key portion.

Also preferably, in the case of a telesurgery video, a trained neural network model (such as an R-CNN model) is applied to all or some of the frames of the telesurgery input video signal to infer information about the critical surgical site, including target detection information and positioning information. Preferably, when the same key part is detected in a preset number of consecutive multiframes (for example, two or more frames), and the coordinate difference and/or the size difference of the same key part in the consecutive multiframes are smaller than a threshold value, it is determined that the detection of the same key part has consistency. For example, by extracting key parts from a single frame or multiple frames of the video through the convolutional neural network model, the category of the detected target can be labeled at the same time. For example, for a telesurgical system with K key surgical sites, it may be labeled Object_kK is 1, …, K. The identified location information for each key surgical site may be described by a rectangular box (x, y, w, h), where x and y are the coordinates of the upper left corner of the rectangle, and w and h are the width and height of the rectangle, respectively. The complete information descriptor of the key surgical site is thus Object_k(x_k，y_k，w_k，h_k). According to an embodiment of the present invention, the neural network model is not necessarily applied to every frame of the input video signal, and the neural network model may be extracted and applied according to the specific situation, for example, every (N1-1) frame (N1 is a system preset value), and applied to extract the key surgical site information. Without loss of generality, the R-CNN model or other models described above are applied by default for each frame of the input video signal in the subsequent description of the invention.

In the initial stage of accessing the input video signal by the remote operation system, the key operation position needs to be detected, and the detection of the accurate key operation position is very important for subsequent tracking and positioning, so that an R-CNN model or other models need to be applied to a multi-frame picture to screen all extracted key operation position information, and the screening method is as follows.

For a certain key operation site Object_kAt least Object in the video image detected in N2 continuous frames_kThe detection information of the target is consistent, so that the target can be determined to be correctly detected, and the tracking and positioning of the key operation position are started. N2 is a system preset value, for example, greater than or equal to 2. The "consistency of the detected information" includes that the coordinate difference and/or the size difference of the same key part in the continuous multiframes is smaller than a threshold value, namely one or two of the following two conditions:

object detected in a video image in which successive N2 frames are detected_kComplete information of

i is 0, …, and N2-1 should satisfy:

wherein d, tau 1 and tau 2 are system preset values, i and j meet the condition that i is not equal to j, i and j are epsilon [0, … and N2-1 ]. Wherein the first condition defines a coordinate difference and the second condition defines a size difference. The coordinate difference and the size difference may also be defined by other formulas.

Wherein the step S103 includes: and when the detection of all the same key parts has consistency, adjusting the physical parameters of the video acquisition device according to the detection information and the positioning information of the key parts. And if the detection of the same key part does not have consistency, not adjusting the physical parameters of the video acquisition device.

According to a preferred embodiment of the present invention, the video capturing method further comprises: judging whether different key parts accord with a preset position relation or not; and when different key parts accord with a preset position relation, determining that the detection of the key parts is normal. And when the judgment result does not accord with the preset position relation, the false detection is considered, and the detection result is rejected. The following examples are given for illustrative purposes.

If all or part of K key surgical sites in the system have relative position relations, such as Object_jAnd Object_kIf there must be simultaneous overlapping portions, it is necessary to determine whether the following conditions are satisfied during the screening process:

object satisfying (Eq.2)_jAnd Object_kCan be deemed to be detected correctly.

The step S103 includes: when different key parts accord with a preset position relation and the detection of all the same key parts has consistency, adjusting physical parameters of the video acquisition device, such as the angle and the focal length of the video acquisition device, according to the detection information and the positioning information of the key parts.

After the detection of the key operation parts is finished, the detection and the positioning of the key operation parts can be carried out on the input video signals, namely, an R-CNN model is applied to each frame of video picture, and an Object is output_k(x_k，y_k，w_k，h_k)。

According to a preferred embodiment of the present invention, the step S103 includes: the angle of the video capture device is adjusted in the following manner.

Presetting a sliding time window with the length of N3 frames, extracting Object from the ith frame in the time window_kThe complete information of

N3 is a system preset value, for example, greater than or equal to 2.

Representing Object within the time window_kA series of displacement vectors of the motion trajectory of

As shown in fig. 2 (N3 ═ 3), after the pan/tilt angle is adjusted, Object is set_kIs ADJ_kThen ADJ_kShould be as indicated by the dashed arrow in fig. 2 (as shown in fig. 2); in order to avoid over-adjustment, ADJ behind angle adjustment direction of holder_kShould satisfy:

in practical applications, a coefficient γ smaller than 1 may be selected such that

Meanwhile, an area can be pre-defined in the video picture according to the requirement of practical application as an Object_kThe effective area of (A) each time ADJ is calculated_kThereafter, the Object after the pan/tilt angle is adjusted in this way is checked_kWhether it still falls within the "active area": if so, the adjustment is effective, and the system sends a command for adjusting the angle of the holder to the camera; if not, then the value of γ is increased or decreased to recalculate ADJ_kUntil Object_kCan fall within the "active area".

In addition, the focal length of the video acquisition device can be adjusted according to the proportion of the key part in the video frame. The focal length of the camera can be automatically adjusted through the detection and positioning information of the key operation position. The purpose of automatically adjusting the focal length of the camera is to ensure that the key surgical site is presented in the video frame in a proper size (too small details cannot be clearly presented, and too large details may partially fall outside the video frame), and the following method can be selected:

presetting a sliding time window with the length of N4 frames, extracting Object from the ith frame in the time window_kThe complete information of

N4 is a system preset value, for example, greater than or equal to 2. If the following equation is satisfied for all i:

the system sends an instruction to the camera to perform zoom adjustment to zoom in and zoom out the picture.

Conversely, if the following equation is satisfied for all i:

the system sends an instruction to the camera to perform zoom adjustment to zoom out or zoom out the picture. Wherein S is an area of the video frame, for example, an area calculated by taking a pixel as a unit, θ 1 and θ 2 are both system preset values larger than zero and smaller than 1, and θ 1< θ 2.

The invention also relates to a video acquisition system comprising: the video acquisition device is configured to acquire videos in real time; an image processing unit (video intelligent terminal, as shown in fig. 3 and 5) communicating with the video capture device to receive the video captured by the video capture device, and configured to perform the video capture method as described above and adjust the physical parameters of the video capture device.

According to one aspect of the invention, the image processing unit is further configured to encode the video collected by the video collecting device and output a code stream. According to an embodiment of the present invention, the video acquisition system further includes an ultrasound imaging device, the ultrasound imaging device being in communication with the image processing unit and configured to acquire ultrasound images of a patient, and the image processing unit being configured to encode the ultrasound images and output a code stream.

[ EXAMPLES one ]

Fig. 3 and 4 show a system according to a first embodiment of the present invention, which is a remote ultrasound guidance system based on a convolutional neural network. Fig. 3 is a hardware configuration and functional block diagram of the system, and fig. 4 is a work flow diagram of the system. The system is a remote ultrasonic guidance system based on a convolutional neural network. The remote ultrasonic guidance system is an important scene in remote medical treatment and computer-aided remote teaching. In the remote ultrasound guidance system, it is necessary to clearly present the ultrasound probe, the human body examination site, and the examination technique of the doctor in the video screen. Therefore, two paths of video input signals are configured in the system, and one path of video is shot in real time through a pan-tilt camera to carry out an ultrasonic examination method; the other path of video is connected with the ultrasonic image output. And the two paths of videos are coded and transmitted in real time. The system may implement the convolutional neural network-based video acquisition method 100 as described above.

Wherein, the cloud deck camera shoots the ultrasonic examination method in real time. The key operation position in the video needs to be detected and tracked in real time subsequently, so that the pan-tilt camera preferably selects the camera shooting equipment with high image quality and non-compressed format.

In addition, the video intelligent terminal is a core device of the system, receives an output signal of the pan-tilt camera and an output signal of the ultrasonic image device, and performs real-time processing and transmission. Its functions include:

i) extracting detection and positioning information of a key operation part from an output signal of the pan-tilt camera in real time, and sending a command for adjusting the pan-tilt angle and/or the camera focal length to the pan-tilt camera by analyzing the detection and positioning information of the key operation part in real time;

ii) carrying out real-time high-definition video coding on the output signal of the pan-tilt camera and outputting a code stream signal to a terminal receiving device specified by an application;

and iii) carrying out real-time high-quality video coding on the output signal of the ultrasonic image equipment and outputting a code stream signal to a terminal receiving device appointed by an application.

The work flow of the remote ultrasonic guidance system based on the convolutional neural network is shown in figure 4.

First, according to the above embodiment of the present invention, offline model training is performed according to the procedure of detecting and positioning a key surgical site based on a convolutional neural network, and videos of ultrasound examination and ultrasound surgery are selected as an original training set and a human hand and an ultrasound probe are selected as a key surgical site. In the case of model training, it is preferable to perform model training of R-CNN on hardware having a high parallel computing capability, for example, a computer cluster in which multiple GPUs are arranged.

Then, processing a video input signal of the pan-tilt camera by using the trained R-CNN model, and detecting a human hand and an ultrasonic probe; and adjusting relevant parameters of the pan-tilt camera according to the detection result, wherein the relevant parameters comprise an angle and/or a focal length of the camera.

And then carrying out real-time high-quality video coding on the video input signal of the pan-tilt camera and outputting a code stream.

In addition, for the ultrasound image output by the ultrasound imaging device, real-time high-quality video encoding may be performed and a code stream (not shown in fig. 4) may be output.

[ example two ]

The second embodiment is a special case of end-cloud integration of the first embodiment, and the second embodiment is suitable for a remote ultrasonic guidance system in a high-bandwidth low-latency mobile communication network environment, as shown in fig. 5.

On the basis of fig. 4, a video communication terminal in the remote ultrasound guidance system is upgraded to a cloud video intelligent communication terminal, an output code stream is transmitted to a cloud platform through a high-bandwidth low-latency mobile communication network, and the cloud platform distributes the code stream to different terminal users including the mobile video communication terminal, a mobile phone, a computer and the like through different transmission networks.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video acquisition method based on a convolutional neural network comprises the following steps:

s101: receiving a video acquired by a video acquisition device;

2. The video capture method of claim 1, further comprising: building the convolutional neural network model by:

selecting a model and a training frame;

establishing a training set for marking the key parts;

3. The video capture method of claim 2, wherein the model comprises one of an R-CNN model, a Fast R-CNN model, a Mask R-CNN model, wherein the video capture method further comprises:

4. The video capturing method according to any one of claims 1 to 3, wherein the step S102 includes: extracting the key parts from the multi-frame video frames of the video through the convolutional neural network model,

5. The video capture method of claim 4, further comprising: judging whether different key parts accord with a preset position relation or not; when different key parts accord with a preset position relation, determining that the detection of the key parts is normal;

6. The video capture method of claim 5, wherein the critical sites comprise a hand of a surgeon and a surgical site of a patient, and the physical parameters comprise an angle and a focal length of the video capture device.

7. The video capturing method of claim 6, wherein the step S103 includes: adjusting the angle of the video capture device by:

Wherein 0 is not less than i<N3, saidThe displacement vector of the key part after the angle adjustment of the video acquisition device is ADJ_k，ADJ_kSatisfies the following conditions:

8. the video capture method of claim 7, wherein ADJ_kShould satisfy

Wherein gamma is<1。

9. The video capturing method according to claim 8, wherein the step S103 includes:

10. The video capturing method of claim 6, wherein the step S103 includes: and adjusting the focal length of the video acquisition device according to the proportion of the key part in the video frame.

11. A video capture system, comprising:

the video acquisition device is configured to acquire videos in real time;

12. The video capture system of claim 11, wherein the image processing unit is further configured to encode video captured by the video capture device and output a codestream.

13. The video acquisition system of claim 11 or 12, further comprising an ultrasound imaging device in communication with the image processing unit and configured to acquire ultrasound images of a patient, the image processing unit configured to encode the ultrasound images and output a codestream.