CN109886243B

CN109886243B - Image processing method, device, storage medium, equipment and system

Info

Publication number: CN109886243B
Application number: CN201910156660.1A
Authority: CN
Inventors: 郑贺; 姚建华; 韩骁; 黄俊洲
Original assignee: Tencent Healthcare Shenzhen Co Ltd
Current assignee: Tencent Healthcare Shenzhen Co Ltd
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2021-03-26
Anticipated expiration: 2039-03-01
Also published as: CN110458127B; CN110458127A; CN109886243A

Abstract

The application discloses an image processing method, an image processing device, a storage medium, equipment and a system, and belongs to the technical field of machine learning. The method comprises the following steps: acquiring a video image stream of a body part to be detected; sequentially carrying out focus detection on each frame of image in the video image stream; classifying a current frame image according to a first focus detection result of a preamble frame image and a second focus detection result of the current frame image; wherein the preamble frame image is at least one frame image located in front of the current frame image in time sequence. When the method is used for processing the image, the prediction result of the pre-frame image is considered in the prediction of the current frame image, so that the advantages of high efficiency and no accumulated error of a single-frame image detection method are aggregated, the accuracy of image classification is obviously improved by fusing the related information of other frame images, and the continuity of the prediction result is ensured.

Description

Image processing method, device, storage medium, equipment and system

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to an image processing method, an image processing apparatus, a storage medium, a device, and a system.

Background

Machine learning technology is used as the core of artificial intelligence, and its application range is currently spread in various fields, such as medical field. In the medical field, the machine learning technology is used for processing medical image images, so that whether a patient suffers from a certain disease can be identified. Taking colorectal cancer as an example, the current enteroscopy is widely applied to colon cancer screening, and after a medical image of a colorectal part of a patient is obtained, the medical image is processed by using a computer-aided detection technology to detect whether polyps exist on the intestinal wall, so that a doctor is assisted to identify whether the patient suffers from colorectal cancer according to the existence condition of the polyps.

Continuing with the example of colorectal cancer, in the related art, when polyp detection is performed by image processing, although a video image stream is acquired, a single frame image in the video image stream is input to a polyp detection model, that is, the polyp detection model first performs feature extraction on the frame image after receiving the single frame image, and then determines whether a polyp exists in the frame image based on the extracted features.

The image processing method has at least the following problems:

first, the above image processing method has a high requirement on the accuracy of the polyp detection model, and considering the complexity of the actual scene, the accuracy of the polyp detection model has a bottleneck. For example, occlusion, over-bright or dark light, motor blur, defocus blur, etc. may occur during enteroscopy, and for example, polyp size, morphology, color, etc. may also vary from patient to patient, camera to polyp distance, terminal model, etc., and the above-mentioned factors may all respond to the detection accuracy of the model.

Secondly, there is a problem with the consistency of the prediction results of the above-described image processing methods. The video image stream often has noise in the acquisition process, even if a camera does not move, fine difference usually exists between two adjacent frames of images due to the movement of intestinal tracts, and the difference sometimes causes two adjacent frames of images to generate quite different prediction results, namely visually the same region, and the polyp detection model gives different, incoherent and inconsistent prediction results.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, a storage medium, equipment and a system, and solves the problem of low detection accuracy in the related technology. The technical scheme is as follows:

in one aspect, an image processing method is provided, and the method includes:

acquiring a video image stream of a body part to be detected;

sequentially carrying out focus detection on each frame of image in the video image stream;

classifying a current frame image according to a first focus detection result of a preamble frame image and a second focus detection result of the current frame image;

wherein the preamble frame image is at least one frame image located in front of the current frame image in time sequence.

In another aspect, there is provided an image processing apparatus, the apparatus including:

the acquisition module is used for acquiring a video image stream of the body part to be detected;

the detection module is used for sequentially carrying out focus detection on each frame of image in the video image stream;

the processing module is used for classifying the current frame image according to a first focus detection result of the preorder frame image and a second focus detection result of the current frame image;

In a possible implementation manner, the detection module is further configured to input the current frame image into a detection model, and obtain a first segmented image output by the detection model, where each pixel in the first segmented image represents a probability value that a pixel at a corresponding position in the current frame image is a lesion; adjusting the first segmentation image, and post-processing the adjusted first segmentation image; calculating a connected component of at least one foreground region in the post-processed first segmentation image; sorting the at least one connected component by size and sorting the at least one connected component by similarity to a target shape; and when the maximum connected component is consistent with the connected component closest to the target shape, determining a foreground region indicated by the maximum connected component as a focus central point of the current frame image.

In a possible implementation manner, the detection module is further configured to obtain an adjusted second segmented image matched with an image of a previous frame, and obtain an average value of the first segmented image and the adjusted second segmented image to obtain the adjusted first segmented image; performing binarization processing on the adjusted first segmentation image by taking a specified numerical value as a threshold value; and removing noise points in the first segmentation image after the binarization processing and smoothing the foreground edge.

In a possible implementation manner, the processing module is further configured to stop tracking the predicted lesion center point corresponding to the predicted position coordinate in a next frame of image when the predicted position coordinate exceeds the image range of the current frame of image; or when the classifier judges the predicted focus central point corresponding to the predicted position coordinate as the background, stopping tracking the predicted focus central point corresponding to the predicted position coordinate in the next frame of image.

In a possible implementation manner, the processing module is further configured to stop tracking the focus central point when the number of tracking frames for any focus central point is greater than a first number; or, stopping tracking of a lesion center point when tracking of the lesion center point fails in a second number of consecutive images.

In a possible implementation manner, the processing module is further configured to connect neighboring lesion center points with euclidean distances smaller than a target threshold when the number of lesion center points given by the second prediction result and the third prediction result is at least two; and calculating the connected components of the central points of the at least two focuses, and determining the central point of the focus corresponding to the maximum connected component as the final central point of the focus of the current frame image.

In another aspect, a storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the image processing method described above.

In another aspect, an image processing apparatus is provided, which includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the image processing method described above.

In another aspect, an image processing system is provided, the system comprising: the system comprises image acquisition equipment, image processing equipment and display equipment;

the image acquisition equipment is used for acquiring images of the body part to be detected to obtain a video image stream of the body part to be detected;

the image processing apparatus includes a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement: acquiring a video image stream of a body part to be detected; sequentially carrying out focus detection on each frame of image in the video image stream; classifying a current frame image according to a first focus detection result of a pre-frame image and a second focus detection result of the current frame image, wherein the pre-frame image is at least one frame image positioned in front of the current frame image in time sequence;

the display device is used for displaying the result output by the image processing device.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

when the image processing is carried out, the prediction result of the pre-frame image is considered in the prediction of the current frame image, namely the prediction result of the pre-frame image and the image information of the current frame image are synthesized to finish the final prediction of the single frame image, so that the advantages of high efficiency and no accumulated error during the detection of the single frame image are aggregated, the accuracy of image classification is obviously improved by fusing the related information of other frame images, and the continuity of the prediction result is ensured.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment related to an image processing method provided in an embodiment of the present application;

fig. 2 is a schematic view illustrating a lesion detection process of an image processing method according to an embodiment of the present disclosure;

fig. 3 is a flowchart of an image processing method provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a U-net network provided in an embodiment of the present application;

FIG. 5 is a flowchart of a method for polyp detection in a single frame image according to an embodiment of the present application;

fig. 6 is a schematic flowchart of an online training CNN classifier according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image processing apparatus 800 according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, some terms related to the embodiments of the present application will be explained.

CNN: the English name is connected Neural Network, Chinese name convolution Neural Network.

In short, CNN is a computational network composed of multiple convolution operations, mostly used for deep learning. The deep learning technology is a technology for performing machine learning by using a deep neural network system.

CAD: computer air diagnosed Diagnosis, Chinese name Computer Aided Diagnosis.

The CAD is used for assisting in finding the focus and improving the diagnosis accuracy by combining the analysis and calculation of a computer through the imaging technology, the medical image processing technology and other possible physiological and biochemical means.

Video image stream: refers to a video stream formed by image-capturing a body part (a target organ on the human body) by an image-capturing device.

For example, taking the target organ as the colon and the rectum as an example, the video image stream refers to a video stream including a plurality of frames of intestinal tract images formed by image acquisition of the colon and the rectum by a medical instrument.

Enteroscopy: an endoscope for medical examination of the intestinal tract.

Polyps: refers to the growth of neoplasms on the surface of human tissue, and modern medicine generally refers to the neoplasms growing on the surface of human mucosa as polyps, including hyperplastic, inflammatory, hamartoma, adenoma and other tumors. It should be noted that polyps are one of the benign tumors.

Focus: a lesion generally refers to a portion of the body where a lesion occurs. Alternatively, a limited diseased tissue with pathogenic microorganisms may be referred to as a lesion.

Illustratively, if a lobe of the lung is destroyed by tubercle bacillus, then this portion is the tuberculosis lesion.

In the embodiments of the present application, a lesion refers to a polyp; in one possible implementation, the lesion is specifically referred to herein as an intestinal polyp.

Image classification: namely, the category to which the content contained in the image belongs is determined through image classification. In the embodiment of the present application, by classifying the medical image, it is possible to specify whether or not a polyp is present on the target organ of the patient. Illustratively, whether intestinal polyps exist on the intestinal tract of a patient can be identified through the image processing method provided by the embodiment of the application.

Optical flow: the motion of the pixel exists in the two adjacent frames of images, that is, the position of the pixel in the previous frame of image in the next frame of image has slight change, and then the change, that is, the displacement vector, is the optical flow of the pixel.

It is well known that colorectal cancer is one of the common causes of cancer death worldwide. Currently, the standard method to reduce colorectal cancer mortality is to screen for polyps by colorectal screening. Among them, enteroscopy has been widely used in colorectal cancer screening as a common practice today. During an enteroscopy, a clinician photographs the intestinal wall through an image acquisition device of a medical instrument to assist the clinician in polyp detection based on acquired medical image images. However, once the clinician has a missed detection situation, the patient may miss the opportunity for early disease detection and treatment, which may present a significant health risk. Therefore, in order to reduce the risk of misdiagnosis and reduce the burden of a clinician, the embodiment of the application realizes automatic detection of polyps through an image processing method automatically during enteroscopy of a patient through a computer-aided diagnosis method.

The following first describes an implementation environment related to the image processing method provided by the embodiment of the present invention.

Fig. 1 is a schematic diagram of an implementation environment related to an image processing method according to an embodiment of the present application. Referring to fig. 1, the implementation environment includes an image processing device 101, a display device 102, and an image acquisition device 103. The above-described image processing apparatus 101, display apparatus 102, and image pickup apparatus 103 constitute an image processing system. The display device 102 may be a display, and the image processing device 101 includes, but is not limited to, a fixed terminal and a mobile terminal, which is not particularly limited in this embodiment of the present application.

The image acquisition equipment 103 is used for acquiring images of a body part to be detected to obtain a video image stream of the body part to be detected; the image processing apparatus 101 comprises a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement: acquiring a video image stream; sequentially detecting the focus of each frame of image in the video image stream; classifying the current frame image according to a first focus detection result of a preamble frame image and a second focus detection result of the current frame image, wherein the preamble frame image is at least one frame image positioned in front of the current frame image in time sequence; the display device 102 is used to display the result output by the image processing device.

Taking a lesion as a polyp and performing polyp detection on the colorectal as an example, in the embodiment of the application, a clinician observes the colorectal of a patient through an enteroscope. The image capturing device 103, which is a medical instrument for performing enteroscopy, extends deep into the intestinal tract to capture images of the intestinal wall, and transmits the captured video image stream to the image processing device 101. Wherein the image capturing device 103 is a camera.

The image processing apparatus 101 is responsible for determining whether intestinal polyps are present in the currently acquired video image stream by the image processing method provided in the embodiment of the present application. If so, the image processing apparatus 101 is responsible for controlling the display apparatus 102 to display the output results, and to prompt the clinician.

The prompting method includes, but is not limited to: the polyp region detected by the audio prompt, the special warning prompt of the display device 102 or the indicator light, and the highlight display in the video picture displayed on the display device 102 are not particularly limited in this embodiment of the present application.

Based on the above description about the implementation environment, the image processing method provided by the embodiment of the present application, at an architecture level, on one hand, completes polyp prediction on a single frame image through an end-to-end deep learning network. On the other hand, the embodiment of the application also adds a tracking method for polyp detection, and integrates the prediction result of the pre-frame image and the image information of the current frame image to complete the final polyp prediction of the current frame image.

Continuing with the example of taking a lesion as a polyp and performing polyp detection on a colorectal region, referring to fig. 2, the detailed implementation steps of the image processing method provided by the embodiment of the present application include, but are not limited to:

1) and acquiring a colorectal video image stream by enteroscope.

2) And detecting whether polyps exist in the current frame of image through an end-to-end deep learning network for each frame of image in the video image stream.

That is, the present embodiment is first used to detect and segment polyps in images based on a deep learning network. For any frame of image, as long as a polyp is detected in the frame of image, the center point coordinates of the polyp are calculated, i.e., if a polyp exists in the current frame of image, the above-mentioned depth learning network will usually also give the spatial position of the polyp.

Additionally, the deep learning network described above is also referred to herein as a detection model.

3) And tracking the appearance position of the polyp detected by adopting the single-frame detection method given in the step 2) in the next frame image.

For example, it may be attempted to track the appearance position of the polyp in the next frame image by using the optical flow tracking method, and for the complex case that the optical flow tracking method fails, the optical flow tracking convolutional neural network is used to track the appearance position of the polyp in the other frame images continuously.

In the embodiment of the present application, after a polyp is detected in one frame of image, it will continue to be tracked in subsequent frames until the stopping rule is satisfied. During tracking, optical flow tracking is used for easier cases, while optical flow tracking convolutional neural networks are used for more difficult cases.

4) And for each frame image, comprehensively tracking the obtained polyps and predicting the polyps of the current frame image to obtain whether the polyps exist in the current frame image and a final prediction result of the positions of the polyps appearing in the current frame image.

Note that if a polyp is not included in a certain frame image, the frame image is regarded as a negative frame. If multiple polyp center points are included in the frame image (some of which are tracked to inheritance from the previous frame image), the embodiments of the present application employ a spatially weighted voting algorithm to retain the polyp center point with the highest confidence and use it as the final polyp center point, while deleting other polyp center points.

In summary, based on the characteristic that polyp features in the same patient and the same enteroscopy are relatively consistent, the embodiment of the application provides an image classification mode combining means such as single-frame detection, feature inheritance of previous and subsequent frames, moving target tracking and the like, the scheme not only absorbs the advantages of high efficiency and no accumulated error of a single-frame detection method provided in the related technology, but also obviously improves the polyp detection accuracy and ensures the coherence of prediction results by fusing video information.

Fig. 3 is a flowchart of an image processing method according to an embodiment of the present application. The main execution body of the method is the image processing device 101 shown in fig. 1, taking polyp detection on the colorectal as an example, referring to fig. 3, the method flow provided by the embodiment of the present application includes:

301. acquiring a video image stream of a body part to be detected.

Herein, the body part refers to a human organ, and exemplarily, the body part refers to a colorectal in the present embodiment. The video image stream is usually obtained by the camera of the medical instrument extending into the body part for image acquisition. And the camera can directly transmit the image to the image processing equipment after acquiring the image.

302. Sequentially carrying out focus detection on each frame of image in the video image stream; and classifying the current frame image according to a first focus detection result of the preamble frame image and a second focus detection result of the current frame image, wherein the preamble frame image is at least one frame image positioned in front of the current frame image in time sequence.

Single frame image lesion detection

Since the body region is referred to as a colorectal region in the embodiment of the present application, polyp detection is performed on a single frame image in this step. In the embodiment of the application, a single-frame image polyp detection is performed by adopting an end-to-end method based on deep learning.

Illustratively, referring to FIG. 4, an embodiment of the present application employs a full convolution neural network named U-net to segment each frame of the image. The U-net network is an end-to-end CNN (Convolutional Neural network), and has an input of an image and an output of a segmentation result of an object of interest in the image.

Stated another way, the input and output of the U-net network are both images, and the U-net network does not include a fully connected layer, whereas image segmentation is used to segment an exact contour of an object of interest.

As shown in fig. 4, the left half of the U-net network is used for feature extraction, including a convolutional layer and a pooling layer, and after an image is input into the U-net network, layer-by-layer feature extraction of the input image can be completed through cooperation of the convolutional layer and the pooling layer.

The convolution layer executes convolution operation through a convolution kernel, and further realizes feature extraction of the input image. It should be noted that the output of the previous convolutional layer can be used as the input of the next convolutional layer, and the extracted feature information is generally characterized by a feature map (feature map). In addition, since the features learned by one layer of convolution tend to be local, and the higher the number of layers of the convolutional layer, the more global the learned features become, in order to extract the global features of the input image, a plurality of convolutional layers are generally included in the U-net network, and a plurality of convolution kernels are generally included in each convolutional layer.

The pooling layer is specifically used for reducing the dimension to reduce the amount of calculation and avoid overfitting, for example, a large image can be reduced by using the pooling layer, and important information in the image is kept.

Further, the right half of the U-net network is used for performing deconvolution operations, including deconvolution layers, convolution layers, and splicing steps. As shown in FIG. 4, the network structure is similar to a U-shaped structure, so the network structure is called a U-net network. It should be noted that, each time of deconvolution, feature fusion with the feature extraction part is performed correspondingly once, that is, feature concatenation is performed once.

The deconvolution is also called transpose convolution, and the forward propagation process of the convolutional layer is the backward propagation process of the convolutional layer, and the backward propagation process of the convolutional layer is the forward propagation process of the convolutional layer. Since the deconvolution process is a process from a small size to a large size, the input image and the output segmented image are of the same size.

In one possible implementation, referring to fig. 5, taking the currently processed image of one frame (referred to as the current frame image for short) as an example, polyp detection on the image of a single frame includes, but is not limited to, the following steps:

302a, inputting the current frame image into the trained detection model, and obtaining a first segmentation image of the current frame image output by the detection model.

The detection model refers to the above mentioned U-net network, and the size of the current frame image is consistent with the size of the first segmentation image. In another expression, the input image is passed through a U-net network to obtain a segmented image, and the size of the segmented image is the same as the size of the input image.

The first and the subsequent second are only used for distinguishing different segmented images, and do not constitute any other limitation.

In addition, each pixel point in the first segmentation image represents the probability value that the pixel point at the corresponding position in the current frame image is a polyp; in another expression, each pixel in the segmented image refers to a probability value that a region where the corresponding pixel in the original image is located is a polyp. Illustratively, 1 denotes a polyp, and 0 denotes a non-polyp, which is not particularly limited in the embodiments of the present application.

302b, adjusting the first segmentation image, and post-processing the adjusted first segmentation image.

In order to reduce the jitter effect, the embodiment of the application also performs weighted average on the first divided image and the divided images of the previous multi-frame images, and takes the weighted average result as the final divided image of the current frame image. Illustratively, the weight may be 0.5^d+1Where d is the distance (i.e., the number of frames) from the frame image matching one of the divided images to the current frame image.

Taking the divided image of the t frame image as S_tFor example, the adjusted segmented image S_tThe method comprises the following steps:

because the denominator part is close to 1 when the value of t is larger in the above formula, the denominator can be discarded, and the numerator can be obtained after merging:

that is, in the embodiment of the present application, the final divided image of the current frame image is calculated using the divided image calculated based on the current frame image and the adjusted divided image of the previous frame image, and the final divided image is an average value of the two. Put another way, the first segmented image is adjusted, including but not limited to: and acquiring an adjusted second segmentation image matched with the previous frame image, and calculating the average value of the first segmentation image and the adjusted second segmentation image to obtain an adjusted first segmentation image.

In a possible implementation manner, in order to reduce the risk of obtaining a false positive prediction result, after obtaining the adjusted first segmented image, the embodiment of the present application further performs a post-processing operation on the adjusted first segmented image. Wherein, the post-processing of the adjusted first segmentation image includes but is not limited to: performing binarization processing on the adjusted first segmentation image by taking the designated numerical value as a threshold value; and removing noise points in the first segmentation image after the binarization processing and smoothing the foreground edge.

The value of the designated value may be 0.5 or 0.6, and the like, which is not specifically limited in the embodiment of the present application. For example, taking the value of the designated value as 0.5 as an example, in the post-processing, the first divided image after adjustment is first binarized with 0.5 as a threshold value. And then, carrying out erosion operation on the segmented image after the binarization processing, thereby removing small noise points and smoothing the foreground edge.

302c, calculating connected components of at least one foreground region in the first segmentation image after the post-processing, and respectively sequencing the at least one connected component according to the numerical value and the similarity degree with the target shape.

Since polyps tend to appear circular or elliptical in segmented images, the target shape may be circular or elliptical, which is not particularly limited in the embodiments of the present application.

For example, for the calculated connected components, the connected components may be sequentially sorted according to the order of the connected components from large to small, and then sequentially sorted according to the order of the ellipse degrees from large to small.

And 302d, when the maximum connected component is consistent with the connected component closest to the target shape, determining the foreground region indicated by the maximum connected component as the focus central point of the current frame image.

Since polyps tend to appear circular or elliptical in segmented images and the regions are large, only when the largest connected component and the most target-shaped connected component are the same connected component, the foreground region indicated by the connected component is determined as the polyp center point. Continuing with the above example as an example, for two sorted lists, if the connected component sorted at the head of the two lists is the same connected component, determining the foreground region corresponding to the connected component as the polyp center point of the current frame image.

The above describes a process of performing lesion detection on a single frame image by taking a currently processed frame image as an example; continuing with the current frame image as an example, in addition to the above description, the image processing apparatus may further track a focus center point in the current frame image based on a focus detection result of the previous frame image.

Focus tracking

After a polyp is detected in one frame of image, the embodiment of the application tracks the polyp in the following frames of images until the stopping tracking rule is satisfied. It should be noted that during tracking, the optical flow method is used for easier tracking, and the optical flow tracking convolutional neural network is used for more difficult processing.

Where the optical flow method is an apparent motion pattern of image objects between two successive frames, it is a 2D vector field, each vector being a displacement vector representing the flow of points from the first frame image to the second frame image.

In general, the optical flow method is based on the following two assumptions:

1. the pixel intensity of the same object does not change between successive frames;

2. adjacent pixels have similar motion. In the present embodiment, given the polyp center coordinates (x, y) of frame t, the position where it appears in the next frame image can be tracked using optical flow.

Based on the above description, the principle of optical flow for polyp tracking is as follows:

detecting a foreground object, namely polyp, which possibly appears aiming at each frame of image in the video image stream; if a polyp central point appears in a certain frame of image, for any two adjacent frames of images, searching the position of the polyp central point appearing in the previous frame of image appearing in the current frame of image, thereby obtaining the position coordinates of the foreground target in the current frame of image; such an iteration can be performed to achieve polyp tracking.

However, blurred images or image artifacts may not satisfy the above two assumptions, and polyp tracking using the optical flow method may fail. In the embodiment of the present application, in order to determine whether to continue polyp tracking, a more robust motion regression model is used to evaluate whether to continue polyp tracking, and further tracking is performed when optical flow tracking is stopped.

Motion regression model

In the embodiment of the application, the motion regression model is adopted to predict the motion condition of the polyp center point in the current frame image by utilizing the motion condition of the polyp center point in the previous frame image.

Illustratively, assume Δ P_t＝P_t-P_t-1Motion vector referring to polyp center point in the t-th frame, where P_tRefers to the location of the polyp center point in the t-th frame, P_t-1Referring to the position of the polyp center point in the t-1 th frame, the embodiment of the present application predicts the motion vector Δ P of the polyp center point in the current frame image by linear fitting using the motion vector of the polyp center point in the previous frame image_t。

Illustratively, the embodiments of the present application utilize the motion vector [ Δ P ] of the polyp center point in the first three frames_t-3,ΔP_t-2,ΔP_t-1]To predict the motion vector deltap of the current image frame_tThus passing through formula P_t＝ΔP_t+P_t-1Obtaining the position P of the polyp center point in the current frame image_t. The position P_tI.e. prediction of the polyp location in the current frame image from the previous frame image, i.e.The polyp center point is tracked frame by frame.

In one possible implementation, if the position P is_tWithin the image range of the current frame image, the predicted position P is further determined using a classifier as described below_tWhether it is the actual polyp center point; if yes, continuing to track in the subsequent frames; otherwise, the tracking is stopped. The classifier may be a CNN classifier, which is not specifically limited in this embodiment of the present application.

Based on the above description, the current frame image is classified according to the first polyp detection result of the preamble frame image and the second polyp detection result of the current frame image, which includes but is not limited to:

polyps are tracked in the current frame image: predicting the motion vector of the polyp center point of the current frame image through linear fitting according to the motion vector of the polyp center point of at least one frame image in the pre-frame image; then, based on the motion vector of the polyp central point predicted by the current frame image, tracking the position coordinate of the polyp central point predicted in the current frame image; and when the predicted position coordinate obtained by tracking is positioned in the image range of the current frame image, judging the predicted position coordinate based on the classifier to obtain a third polyp detection result.

Wherein, the polyp central point of at least one frame image in the preorder frame images is obtained based on a first polyp detection result; the first polyp detection result is a general term for polyp detection results of all images in a preceding frame image.

In the embodiment of the present application, since the polyp center point is obtained by performing the above-mentioned single frame detection on one frame of image, and is also continuously tracked in the subsequent multiple frames of images, a plurality of polyp center points may be included in one frame of image, for example, from the single frame detection result and the tracking inheritance result, respectively. So, for example, the polyp center point of the at least one frame image may include both the single frame detection result and the tracking inheritance result; in addition, for the tracking inheritance process, the tracking of the polyp center point starts from single frame detection, namely an expression mode, and after the polyp center point is obtained through single frame detection, the polyp center point starts to be predicted whether to appear in a subsequent frame image or not.

The at least one frame of image may be a partial image or a whole image in a preamble frame of image, which is not specifically limited in this embodiment of the present application. For example, the at least one frame of image may be three frames of images temporally located before the current frame of image.

Then, the current frame image is classified based on a second polyp detection result obtained by performing single frame detection on the current frame image and a third polyp detection result obtained by performing polyp tracking.

As described above, after the polyp prediction result based on the preamble frame image completes the prediction of the current frame image, the classifier trained on line is also used to actually determine the polyp center point. Before explaining this decision process, the classifier's on-line training process is described.

On-line training classifier

In an actual scene, the appearance of polyps is observed to be consistent from frame to frame according to experience, so the embodiment of the application provides an online-trained optical flow tracking CNN framework for determining whether the polyp center point predicted by a motion regression model is a real polyp, and further determining whether the motion regression model should stop tracking.

Since the intermediate feature map extracted in the U-net network calculation process includes the polyp feature required for optical flow tracking CNN calculation, and since each frame image is calculated through the U-net network in the embodiment of the present application, the intermediate feature map generated in the U-net network calculation process can be directly used as the input of the optical flow tracking CNN in order to reduce the calculation complexity and improve the calculation efficiency. Referring to fig. 6, for the tracking process of the current frame image, the embodiment of the present application uses the feature map extracted after the previous frame image of the current frame image is input into the U-net network shown in fig. 4 as the shared feature.

In addition, in order to optimize the classifier and determine whether the tracked polyp still exists in the current frame image, in the embodiment of the present application, positive samples of the target number are collected in a region near the polyp detected in the previous frame image, negative samples of the target number are collected in a region far away from the polyp, and then, pooling operation is performed on the shared feature map corresponding to the region where the samples are located, so that shared feature length is standardized, and then, on-line training of the classifier is realized based on the positive samples of the target number and the negative samples of the target number. Put another way, the embodiments of the present application fine-tune the classifier by a certain number of positive and negative samples from the previous frame of image, thereby completing the classification of polyps.

The value of the target number may be 4, which is not specifically limited in this embodiment of the present application.

In one possible implementation, the target number of positive samples and the target number of negative samples are generated based on a shared profile, including but not limited to: in a final segmentation image acquired from the previous frame of image, cutting an image region with the overlapping range with the polyp region larger than a first value to obtain positive samples with a target quantity; and in the final segmentation image acquired from the previous frame of image, cutting an image region of which the overlapping range with the polyp region is smaller than a second value to obtain negative samples of the target quantity. The first value may be 0.7, and the second value may be 0.1, which is not specifically limited in this embodiment of the application.

It should be noted that the final segmented image obtained from the previous frame of image is referred to as a third segmented image in this document, and the third segmented image is a segmented image after being subjected to the similar adjustment process shown in the above step 302 b.

Illustratively, as shown in fig. 6, four positive samples and four negative samples may be generated based on the previous frame image of the current frame image, such as the positive samples having an overlap with the polyp region jaccard of more than 0.7 and the negative samples having an overlap with the polyp region jaccard of less than 0.1. The classifier is fine-tuned (on-line trained) with these eight samples to classify the polyps predicted for the current frame image based on the on-line trained classifier on a motion regression model.

Wherein Jaccard is used for comparing similarity and difference between limited sample sets, and the larger the value of the Jaccard coefficient is, the higher the sample similarity is.

It should be noted that, the position coordinates of the polyp center point in the current frame image have been predicted by the motion regression model, so this step uses the classifier trained on line to classify it, i.e. determine whether it is an actual polyp; if the classifier determines it as a non-polyp, i.e., classifies it as a background region rather than a foreground region, the continuation of the tracking is stopped, i.e., is cancelled in the subsequent frame image.

In a possible implementation manner, considering the problem of operation speed, as shown in fig. 6, the embodiment of the present application takes the segmentation image output by the U-net network as the input of the classifier. In fig. 6, the size of an input image input into the U-net network is 288 × 384 × 3, and the U-net network produces shared features with dimensions 18 × 24 × 512, which are reduced by a factor of 16 from the size of the input original image. Therefore, when extracting the feature map of the ROI, it is directly achieved that the size of the ROI is reduced by 16 times.

In addition, the characteristic diagram of the ROI is directly cut into a corresponding size, for example, for convenience of operation, the length and the width of the ROI are fixed to 48 × 48, and then a 3 × 3 region is cut out from the characteristic diagram of the ROI, so that the cutting of the positive sample and the negative sample is completed; then, a 1 × 1 × 256 convolutional layer + nonlinear layer is accessed; and then connecting two full-connection layers and a softmax layer, and finally finishing polyp classification by adopting a cross soil moisture loss function to obtain whether the result predicted by the previous motion regression model is the classification result of the actual polyp.

Tracing stop rules

In the embodiment of the present application, if the predicted position coordinates of the polyp center point of the current frame image exceed the image range of the current frame image, or the online-trained classifier classifies the polyp center point predicted for the current frame image as the background, the embodiment of the present application stops tracking the predicted polyp center point.

When the predicted position coordinate of the polyp center point of the current frame image exceeds the image range of the current frame image, the tracking of the predicted polyp center point corresponding to the predicted position coordinate in the next frame image is stopped; or, when the classifier judges the central point of the predicted polyp corresponding to the predicted position coordinate as the background, the central point of the predicted polyp corresponding to the predicted position coordinate is stopped tracking in the next frame image.

In one possible implementation, since the same polyp can be tracked by a plurality of polyp center points generated from different frame images, in order to save calculation time and reduce unnecessary tracking, when the number of tracking frames for any one polyp center point is greater than a first number, the polyp center point is stopped from being tracked; the value of the first number may be 10, which is not specifically limited in this embodiment of the present application.

In addition, to reduce errors caused by online trained classifiers, tracking of a polyp center point is stopped when it fails to track the polyp center point in a second number of consecutive images. The value of the second quantity may be 3, which is not specifically limited in this embodiment of the present application.

In the embodiment of the present application, the current frame image is classified based on the second lesion detection result and the third lesion detection result.

Spatial voting algorithm

In the embodiment of the present application, after performing tracking, a frame of image may contain a plurality of polyp center points, however, some of the polyp center points may be abnormal values, which may seriously affect the classification result, and the correct polyp center points are concentrated in a small region according to experience observation. In short, first, adjacent polyp center points whose euclidean distance is smaller than the target threshold are connected, then connected components are calculated, and the polyp center point of the largest connected component is taken as the final polyp center point of the current frame image.

In another expression, the current frame image is classified based on the second lesion detection result and the third lesion detection result, which includes but is not limited to: when the number of the polyp central points given by the second focus detection result and the third focus detection result is at least two, connecting the adjacent polyp central points with Euclidean distance smaller than a target threshold value; and calculating connected components of the center points of at least two polyps, and determining the polyp center point corresponding to the maximum connected component as the final polyp center point of the current frame image.

In summary, the method provided by the embodiment of the present application inherits the advantages of high computation efficiency and no accumulated error of the single frame detection method, and at least has the following beneficial effects:

(1) and the detection rate of the focus can be obviously improved. On one hand, the polyp prediction of a single frame image is completed through an end-to-end deep learning network, on the other hand, a tracking method is added in the embodiment of the application and used for polyp detection, the final prediction of the current frame image can be completed by integrating the prediction result of a pre-frame image and the image information of the current frame image, the image classification mode has low requirement on the accuracy of a polyp detection model, and the missed polyps of the single frame detection method can be completed through the video tracking method provided by the embodiment of the application.

(2) Thus, the time consistency of the prediction result can be improved. Compared with the method for detecting the single frame, the method for processing the image has the advantages that the prediction results of the previous multi-frame images are fused into the prediction of the current frame image, the two adjacent frames of images can not generate quite different prediction results, so that the time coherence of polyp detection can be improved, the visually same region can be avoided, and different, non-coherent and inconsistent prediction results can be given by a polyp detection model.

(3) And the false detection probability in the detection process can be reduced. According to the embodiment of the application, the error result detected only in a few frames can be eliminated through a spatial voting algorithm.

In another possible implementation, the image processing method provided above has a wide range of application scenarios, not only applicable to polyp detection or not only intestinal polyp detection, but also to detection of other types of diseases. I.e. polyp detection for some other type of disease or another body part, detection for this type of disease can also be achieved based on the image processing method provided by the embodiments of the present application.

In another expression, the image processing method provided in the embodiment of the present application can implement detection of various medical diseases, not limited to polyp detection, but only exemplified by intestinal polyp detection.

In another possible implementation, the foregoing embodiment only detects whether a polyp is present in each image and the position where the polyp appears in the image. In addition, the present embodiment may further provide more information, such as the size, type, and shape of the polyp, and the generation of a diagnosis report regarding the current detection result, and the present embodiment is not particularly limited thereto.

In another possible implementation, in addition to polyp detection using the image processing method provided by the above-described embodiment, polyp detection may also be performed using a detection method of a single-frame still image. For example, when a certain frame image is predicted, the prediction result of the preamble image frame and the image feature information may also be used as input for prediction of the current frame image, that is, polyp prediction may be performed together with the current frame image.

In another possible implementation, polyp detection may also be performed using other video tracking methods. For example, an end-to-end deep learning method may be used, and a polyp prediction result of each frame image in a video may be directly generated by inputting the polyp prediction result into the video through a long-time memory network or a similar deep learning network. However, this method requires a large and labeled whole video for training, and accumulated errors are generated as the video increases.

In another possible implementation manner, when single-frame image polyp detection is performed, another end-to-end depth learning method is to take the time dimension as the third dimension, accumulate two-dimensional images into a three-dimensional matrix, and perform calculation in a three-dimensional convolution manner. However, this method has more convolution operations in one-dimensional space, and thus has higher computational complexity than two-dimensional convolution.

Fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. Referring to fig. 7, the apparatus includes:

an obtaining module 701, configured to obtain a video image stream of a body part to be detected;

a detection module 702, configured to perform lesion detection on each frame of image in the video image stream in sequence;

a processing module 703, configured to classify a current frame image according to a first focus detection result of a preamble frame image and a second focus detection result of the current frame image;

The device provided by the embodiment of the application considers the prediction result of the pre-frame image into the prediction of the current frame image during image processing, namely the device provided by the embodiment of the application integrates the prediction result of the pre-frame image and the image information of the current frame image to complete the final prediction of the single frame image, not only combines the advantages of high efficiency and no accumulative error of the single frame image detection method, but also remarkably improves the accuracy of image classification and ensures the continuity of the prediction result by fusing the relevant information of other frame images.

In a possible implementation manner, the detection module is further configured to input a current frame image into a detection model, and obtain a first segmented image output by the detection model, where each pixel in the first segmented image represents a probability value that a pixel at a corresponding position in the current frame image is a focus; adjusting the first segmentation image, and post-processing the adjusted first segmentation image; calculating a connected component of at least one foreground region in the post-processed first segmentation image; sorting the at least one connected component by size and sorting the at least one connected component by similarity to a target shape; and when the maximum connected component is consistent with the connected component closest to the target shape, determining the foreground region indicated by the maximum connected component as the focus central point of the current frame image.

In a possible implementation manner, the processing module is further configured to predict a motion vector of a focus center point of the current frame image through linear fitting according to a motion vector of a focus center point of at least one frame image in the preamble frame image, where the focus center point of the at least one frame image in the preamble frame image is obtained based on the first focus detection result; tracking the position coordinates of the predicted focus central point in the current frame image based on the motion vector of the predicted focus central point of the current frame image; when the predicted position coordinate obtained by tracking is located in the image range of the current frame image, judging the predicted position coordinate based on a classifier to obtain a third focus detection result; classifying the current frame image based on the second lesion detection result and the third lesion detection result.

In one possible implementation, the apparatus further includes:

the training module is used for acquiring a third segmentation image obtained by inputting the previous frame image of the current frame image into the detection model; generating a target number of positive samples and a target number of negative samples based on the third segmentation image; training the classifier on-line based on the target number of positive samples and the target number of negative samples.

In a possible implementation manner, the training module is further configured to determine a lesion region in the third segmented image; in the third segmentation image, cutting an image area with the overlapping range with the focus area larger than a first value to obtain positive samples of the target number; and in the third segmentation image, cutting an image area of which the overlapping range with the focus area is smaller than a second value to obtain a negative sample of the target quantity.

In a possible implementation manner, the processing module is further configured to stop tracking the focus central point when the number of tracking frames for any focus central point is greater than a first number; or the like, or, alternatively,

stopping tracking of a lesion center point when tracking of the lesion center point fails in a second number of consecutive images.

In a possible implementation manner, the processing module is further configured to connect neighboring lesion center points with euclidean distances smaller than a target threshold when the number of the lesion center points given by the second prediction result and the third prediction result is at least two; and calculating the connected components of the central points of the at least two focuses, and determining the central point of the focus corresponding to the maximum connected component as the final central point of the focus of the current frame image.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

It should be noted that: in the image processing apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing image processing, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the functions described above. In addition, the image processing apparatus and the image processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

Fig. 8 shows a block diagram of an image processing apparatus 800 according to an exemplary embodiment of the present application. The device 800 may be a portable mobile device such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Device 800 may also be referred to by other names such as user equipment, portable device, laptop device, desktop device, and so forth.

In general, the apparatus 800 includes: a processor 801 and a memory 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the image processing methods provided by method embodiments herein.

In some embodiments, the apparatus 800 may further optionally include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a touch screen display 805, a camera 806, an audio circuit 807, a positioning component 808, and a power supply 809.

The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 804 may communicate with other devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, providing the front panel of the device 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the device 800 or in a folded design; in still other embodiments, the display 805 may be a flexible display, disposed on a curved surface or on a folded surface of the device 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of the apparatus, and a rear camera is disposed on a rear surface of the apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. The microphones may be multiple and placed at different locations of the device 800 for stereo sound acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.

The positioning component 808 is operative to locate a current geographic Location of the device 800 for navigation or LBS (Location Based Service). The Positioning component 808 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

A power supply 809 is used to power the various components in the device 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power supply 809 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the device 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.

The acceleration sensor 811 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the apparatus 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the touch screen 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 812 may detect a body direction and a rotation angle of the device 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user with respect to the device 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 813 may be disposed on the side bezel of device 800 and/or underneath touch display 805. When the pressure sensor 813 is arranged on the side frame of the device 800, the holding signal of the user to the device 800 can be detected, and the processor 801 performs left-right hand identification or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the touch display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of device 800. When a physical key or vendor Logo is provided on the device 800, the fingerprint sensor 814 may be integrated with the physical key or vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the touch screen 805 based on the ambient light intensity collected by the optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 805 is increased; when the ambient light intensity is low, the display brightness of the touch display 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also known as a distance sensor, is typically provided on the front panel of the device 800. The proximity sensor 816 is used to capture the distance between the user and the front of the device 800. In one embodiment, the processor 801 controls the touch display 805 to switch from a bright screen state to a dark screen state when the proximity sensor 816 detects that the distance between the user and the front surface of the device 800 is gradually reduced; when the proximity sensor 816 detects that the distance between the user and the front of the device 800 is gradually increasing, the touch display 805 is controlled by the processor 801 to switch from a rest screen state to a light screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of the apparatus 800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring a video image stream of a body part to be detected;

for a current frame image, predicting a motion vector of a focus central point of the current frame image through linear fitting according to a motion vector of the focus central point of at least one frame image in a preamble frame image, wherein the focus central point of at least one frame image in the preamble frame image is obtained based on a first focus detection result of the preamble frame image; tracking the position coordinates of the predicted focus central point in the current frame image based on the motion vector of the predicted focus central point of the current frame image; when the predicted position coordinate obtained by tracking is located in the image range of the current frame image, judging the predicted position coordinate based on a classifier to obtain a third focus detection result; classifying the current frame image based on the second focus detection result and the third focus detection result of the current frame image;

wherein, the preorder frame image is at least one frame image which is positioned in front of the current frame image in time sequence; performing focus detection on the current frame image, including:

inputting the current frame image into a detection model, and acquiring a first segmentation image output by the detection model, wherein each pixel point in the first segmentation image represents the probability value that the pixel point at the corresponding position in the current frame image is a focus; acquiring an adjusted second segmentation image matched with the previous frame image; calculating the average value of the first segmentation image and the adjusted second segmentation image to obtain an adjusted first segmentation image; performing binarization processing on the adjusted first segmentation image by taking a specified numerical value as a threshold value; removing noise points in the first segmentation image after the binarization processing and smoothing the foreground edge; calculating a connected component of at least one foreground region in the post-processed first segmentation image; sorting the at least one connected component by size and sorting the at least one connected component by similarity to a target shape; and when the maximum connected component is consistent with the connected component closest to the target shape, determining a foreground region indicated by the maximum connected component as a focus central point of the current frame image.

2. The method of claim 1, further comprising:

acquiring a third segmentation image obtained by inputting a previous frame image of the current frame image into a detection model;

generating a target number of positive samples and a target number of negative samples based on the third segmentation image;

training the classifier on-line based on the target number of positive samples and the target number of negative samples.

3. The method of claim 2, wherein generating a target number of positive samples and a target number of negative samples based on the third segmentation image comprises:

determining a lesion region in the third segmented image;

in the third segmentation image, cutting an image area with the overlapping range with the focus area larger than a first value to obtain positive samples of the target number;

and in the third segmentation image, cutting an image area of which the overlapping range with the focus area is smaller than a second value to obtain a negative sample of the target quantity.

4. The method of claim 1, further comprising:

when the predicted position coordinate exceeds the image range of the current frame image, stopping tracking the predicted focus central point corresponding to the predicted position coordinate in the next frame image; or the like, or, alternatively,

and when the classifier judges the predicted focus central point corresponding to the predicted position coordinate as the background, stopping tracking the predicted focus central point corresponding to the predicted position coordinate in the next frame of image.

5. The method according to any one of claims 1 to 4, further comprising:

when the number of tracking frames for any focus central point is larger than a first number, stopping tracking the focus central point; or the like, or, alternatively,

6. The method of claim 1, wherein classifying the current frame image based on the second lesion detection result and the third lesion detection result comprises:

when the number of the focus central points given by the second prediction result and the third prediction result is at least two, connecting the adjacent focus central points with Euclidean distance smaller than a target threshold value;

and calculating the connected components of the central points of the at least two focuses, and determining the central point of the focus corresponding to the maximum connected component as the final central point of the focus of the current frame image.

7. An image processing apparatus, characterized in that the apparatus comprises:

the processing module is used for predicting the motion vector of the focus central point of the current frame image through linear fitting according to the motion vector of the focus central point of at least one frame image in the pre-frame image, wherein the focus central point of at least one frame image in the pre-frame image is obtained based on a first focus detection result of the pre-frame image; tracking the position coordinates of the predicted focus central point in the current frame image based on the motion vector of the predicted focus central point of the current frame image; when the predicted position coordinate obtained by tracking is located in the image range of the current frame image, judging the predicted position coordinate based on a classifier to obtain a third focus detection result; classifying the current frame image based on the second focus detection result and the third focus detection result of the current frame image;

wherein, the preorder frame image is at least one frame image which is positioned in front of the current frame image in time sequence; the detection module is further configured to input the current frame image into a detection model, and obtain a first segmentation image output by the detection model, where each pixel point in the first segmentation image represents a probability value that a pixel point at a corresponding position in the current frame image is a focus; acquiring an adjusted second segmentation image matched with the previous frame image; calculating the average value of the first segmentation image and the adjusted second segmentation image to obtain an adjusted first segmentation image; performing binarization processing on the adjusted first segmentation image by taking a specified numerical value as a threshold value; removing noise points in the first segmentation image after the binarization processing and smoothing the foreground edge; calculating a connected component of at least one foreground region in the post-processed first segmentation image; sorting the at least one connected component by size and sorting the at least one connected component by similarity to a target shape; and when the maximum connected component is consistent with the connected component closest to the target shape, determining a foreground region indicated by the maximum connected component as a focus central point of the current frame image.

8. The apparatus of claim 7, further comprising:

9. The apparatus of claim 8, wherein the training module is further configured to determine a lesion region in the third segmented image; in the third segmentation image, cutting an image area with the overlapping range with the focus area larger than a first value to obtain positive samples of the target number; and in the third segmentation image, cutting an image area of which the overlapping range with the focus area is smaller than a second value to obtain a negative sample of the target quantity.

10. An image processing apparatus, characterized in that the apparatus comprises a processor and a memory, in which at least one instruction is stored, which is loaded and executed by the processor to implement the image processing method according to any one of claims 1 to 6.

11. An image processing system, characterized in that the system comprises: the system comprises image acquisition equipment, image processing equipment and display equipment;

the image processing apparatus includes a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement: acquiring the video image stream; sequentially carrying out focus detection on each frame of image in the video image stream; for a current frame image, predicting a motion vector of a focus central point of the current frame image through linear fitting according to a motion vector of the focus central point of at least one frame image in a preamble frame image, wherein the focus central point of at least one frame image in the preamble frame image is obtained based on a first focus detection result of the preamble frame image; tracking the position coordinates of the predicted focus central point in the current frame image based on the motion vector of the predicted focus central point of the current frame image; when the predicted position coordinate obtained by tracking is located in the image range of the current frame image, judging the predicted position coordinate based on a classifier to obtain a third focus detection result; classifying the current frame image based on a second focus detection result and a third focus detection result of the current frame image, wherein the preorder frame image is at least one frame image positioned in front of the current frame image in time sequence; wherein, the focus detection is carried out on the current frame image, and the focus detection comprises the following steps:

inputting the current frame image into a detection model, and acquiring a first segmentation image output by the detection model, wherein each pixel point in the first segmentation image represents the probability value that the pixel point at the corresponding position in the current frame image is a focus; acquiring an adjusted second segmentation image matched with the previous frame image; calculating the average value of the first segmentation image and the adjusted second segmentation image to obtain an adjusted first segmentation image; performing binarization processing on the adjusted first segmentation image by taking a specified numerical value as a threshold value; removing noise points in the first segmentation image after the binarization processing and smoothing the foreground edge; calculating a connected component of at least one foreground region in the post-processed first segmentation image; sorting the at least one connected component by size and sorting the at least one connected component by similarity to a target shape; when the maximum connected component is consistent with the connected component closest to the target shape, determining a foreground region indicated by the maximum connected component as a focus central point of the current frame image;